• deque<> vs. list<> for a queue

    From Bonita Montero@21:1/5 to All on Fri Nov 17 11:06:02 2023
    I know that a deque is superior to a list if you want to have a queue
    because it does chunk-allocations, thereby allogating the storage for
    multiple nodes at once. But I was curious about what's the actual per- formance-impact. And I was interested in how many allocations were
    saved through a deque. So I overloaded new and delete, redirected the allocation to malloc and free and counted the number of allocations.

    #include <iostream>
    #include <deque>
    #include <string>
    #include <sstream>
    #include <chrono>
    #include <optional>
    #include <list>

    using namespace std;
    using namespace chrono;

    uint64_t allocs = 0;

    void *operator new( size_t n )
    {
    ++allocs;
    return malloc( n );
    }

    void operator delete( void *p ) noexcept
    {
    free( p );
    }

    int main()
    {
    auto bench = []( auto queue )
    {
    for( size_t i = 0; i != 10'000; ++i )
    queue.emplace_back( (ostringstream() << "no short string
    optimization: " << i).str() );
    auto start = high_resolution_clock::now();
    constexpr size_t ROUNDS = 100'000'000;
    uint64_t allocsBefore = ::allocs;
    for( size_t r = ROUNDS; r--; )
    {
    string str( move( queue.front() ) );
    queue.pop_front();
    queue.emplace_back( move( str ) );
    }
    double ns = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ROUNDS;
    uint64_t allocs = ::allocs - allocsBefore;
    cout << ns << ": " << allocs << endl;
    };
    bench( deque<string>() );
    bench( list<string>() );
    }

    With MSVC I get the following results on a AMD 7950X Zen4 CPU:
    4.01774: 6384
    23.5851: 100000000
    So the deque is about eight times faster and has 15.600 times less
    allocations. So the chunk size must be really large with MSVC's
    deque.
    With g++ 12.3.0 the results are the following:
    3.13869: 6250000
    11.8536: 100000000
    So the deque is about 3.5 times faster and the chunks seem rather
    small with 16 elements per chunk. Nevertheless libstdc++ seems to
    have the faster memory allocator so that allocating so much more
    chunks doesn't result in less performance. The better allocation
    performance shows also with the list-performance.
    So the decision is for a deque if you can afford the memory blend.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Fri Nov 17 11:29:27 2023
    I guess that MSVC's deque-implementation attaches an just emptied front
    chunk to the back's capactity. Maybe I'll find a way to prove that.

    Am 17.11.2023 um 11:06 schrieb Bonita Montero:
    I know that a deque is superior to a list if you want to have a queue
    because it does chunk-allocations, thereby allogating the storage for multiple nodes at once. But I was curious about what's the actual per- formance-impact. And I was interested in how many allocations were
    saved through a deque. So I overloaded new and delete, redirected the allocation to malloc and free and counted the number of allocations.

    #include <iostream>
    #include <deque>
    #include <string>
    #include <sstream>
    #include <chrono>
    #include <optional>
    #include <list>

    using namespace std;
    using namespace chrono;

    uint64_t allocs = 0;

    void *operator new( size_t n )
    {
        ++allocs;
        return malloc( n );
    }

    void operator delete( void *p ) noexcept
    {
        free( p );
    }

    int main()
    {
        auto bench = []( auto queue )
        {
            for( size_t i = 0; i != 10'000; ++i )
                queue.emplace_back( (ostringstream() << "no short string
    optimization: " << i).str() );
            auto start = high_resolution_clock::now();
            constexpr size_t ROUNDS = 100'000'000;
            uint64_t allocsBefore = ::allocs;
            for( size_t r = ROUNDS; r--; )
            {
                string str( move( queue.front() ) );
                queue.pop_front();
                queue.emplace_back( move( str ) );
            }
            double ns = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ROUNDS;
            uint64_t allocs = ::allocs - allocsBefore;
            cout << ns << ": " << allocs << endl;
        };
        bench( deque<string>() );
        bench( list<string>() );
    }

    With MSVC I get the following results on a AMD 7950X Zen4 CPU:
        4.01774: 6384
        23.5851: 100000000
    So the deque is about eight times faster and has 15.600 times less allocations. So the chunk size must be really large with MSVC's
    deque.
    With g++ 12.3.0 the results are the following:
        3.13869: 6250000
        11.8536: 100000000
    So the deque is about 3.5 times faster and the chunks seem rather
    small with 16 elements per chunk. Nevertheless libstdc++ seems to
    have the faster memory allocator so that allocating so much more
    chunks doesn't result in less performance. The better allocation
    performance shows also with the list-performance.
    So the decision is for a deque if you can afford the memory blend.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Pavel@21:1/5 to Bonita Montero on Fri Nov 17 21:42:49 2023
    Bonita Montero wrote:
    I know that a deque is superior to a list if you want to have a queue
    because it does chunk-allocations, thereby allogating the storage for multiple nodes at once. But I was curious about what's the actual per- formance-impact. And I was interested in how many allocations were
    saved through a deque. So I overloaded new and delete, redirected the allocation to malloc and free and counted the number of allocations.

    #include <iostream>
    #include <deque>
    #include <string>
    #include <sstream>
    #include <chrono>
    #include <optional>
    #include <list>

    using namespace std;
    using namespace chrono;

    uint64_t allocs = 0;

    void *operator new( size_t n )
    {
        ++allocs;
        return malloc( n );
    }

    void operator delete( void *p ) noexcept
    {
        free( p );
    }

    int main()
    {
        auto bench = []( auto queue )
        {
            for( size_t i = 0; i != 10'000; ++i )
                queue.emplace_back( (ostringstream() << "no short string
    optimization: " << i).str() );
            auto start = high_resolution_clock::now();
            constexpr size_t ROUNDS = 100'000'000;
            uint64_t allocsBefore = ::allocs;
            for( size_t r = ROUNDS; r--; )
            {
                string str( move( queue.front() ) );
                queue.pop_front();
                queue.emplace_back( move( str ) );
            }
            double ns = (double)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ROUNDS;
            uint64_t allocs = ::allocs - allocsBefore;
            cout << ns << ": " << allocs << endl;
        };
        bench( deque<string>() );
        bench( list<string>() );
    }

    With MSVC I get the following results on a AMD 7950X Zen4 CPU:
        4.01774: 6384
        23.5851: 100000000
    So the deque is about eight times faster and has 15.600 times less allocations. So the chunk size must be really large with MSVC's
    deque.
    With g++ 12.3.0 the results are the following:
        3.13869: 6250000
        11.8536: 100000000
    So the deque is about 3.5 times faster and the chunks seem rather
    small with 16 elements per chunk. Nevertheless libstdc++ seems to
    have the faster memory allocator so that allocating so much more
    chunks doesn't result in less performance. The better allocation
    performance shows also with the list-performance.
    So the decision is for a deque if you can afford the memory blend.

    This particular test will most likely be bested by boost::circular_buffer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sat Nov 18 05:19:04 2023
    Am 18.11.2023 um 03:42 schrieb Pavel:

    This particular test will most likely be bested by boost::circular_buffer.

    I've got about 18 cycles per rotate from the back to the front unter
    linux. A circular buffer wouldn't be much more efficient and has the
    drawback that it is expensive to extend.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)