• Why is sorting a pointer-array to strings slower than sorting the strin

    From Bonita Montero@21:1/5 to All on Mon Nov 29 21:26:17 2021
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ? I've checked that with that code:

    #include <iostream>
    #include <vector>
    #include <fstream>
    #include <string>
    #include <algorithm>
    #include <random>
    #include <concepts>
    #include <chrono>
    #include <execution>

    using namespace std;
    using namespace chrono;

    int main( int argc, char **argv )
    {
    if( argc < 2 )
    return EXIT_FAILURE;
    using vc_t = vector<char>;
    using vc_it = vc_t::iterator;
    vector<char> block;
    {
    ifstream ifs;
    ifs.exceptions( ios_base::badbit );
    ifs.open( argv[1], ios_base::binary );
    ifs.seekg( 0, ios_base::end );
    streampos size = ifs.tellg();
    if( size > (size_t)-1 || !size )
    return EXIT_FAILURE;
    ifs.seekg( 0, ios_base::beg );
    block.resize( (size_t)size );
    ifs.read( &*block.begin(), (streamsize)size );
    }
    cout << "read" << endl;
    auto parse = [&]<typename Callback>( Callback callback ) -> size_t
    requires
    requires( Callback callback, char const *str, size_t len )
    {
    { callback( str, len ) };
    }
    {
    size_t n = 0;
    for( vc_it cit = block.begin(); cit != block.end(); )
    {
    vc_it lineBegin = cit;
    for( ; cit != block.end() && (unsigned char)*cit > ' '; ++cit );
    if( cit != lineBegin )
    ++n, callback( &*lineBegin, cit - lineBegin );
    for( ; cit != block.end() && (unsigned char)*cit <= ' '; ++cit );
    }
    return n;
    };
    vector<string> strings;
    size_t sumLen = 0;
    strings.reserve( parse( [&]( char const *, size_t len ) { sumLen += len; } ) );
    if( !strings.capacity() )
    return EXIT_FAILURE;
    cout << "counted, avg length: " << (double)(ptrdiff_t)sumLen / (ptrdiff_t)strings.capacity() << endl;
    parse( [&]( char const *str, size_t len ) { strings.emplace_back( str, len ); } );
    cout << "parsed" << endl;
    sort( strings.begin(), strings.end() );
    strings.erase( unique( strings.begin(), strings.end() ), strings.end() );
    cout << "uniqued" << endl;
    strings.shrink_to_fit();
    block.resize( 0 ), block.reserve( 0 );
    auto randomize = []<typename RandomIt>( RandomIt begin, size_t n )
    requires random_access_iterator<RandomIt>
    {
    mt19937_64 mt;
    uniform_int_distribution<size_t> uidIdx( 0, n - 1 );
    for( size_t i = 0, j; i != n; ++i )
    {
    while( (j = uidIdx( mt )) == i );
    swap( begin[i], begin[j] );
    }
    };
    randomize( strings.begin(), strings.size() );
    auto start = high_resolution_clock::now();
    sort( strings.begin(), strings.end() );
    double ns = (double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (ptrdiff_t)strings.size();
    cout << "string-sort: " << ns << endl;
    vector<string *> pStrings;
    pStrings.reserve( strings.size() );
    for( string &str : strings )
    pStrings.emplace_back( &str );
    randomize( pStrings.begin(), pStrings.size() );
    start = high_resolution_clock::now();
    sort( pStrings.begin(), pStrings.end(),
    []( string *left, string *right ) { return *left < *right; } );
    ns = (double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (ptrdiff_t)strings.size();
    cout << "pString-sort: " << ns << endl;
    }

    I've read 1.4 million strings with an average lengh of 9 characters,
    so most strings should fall under the short string optimization and
    swapping the string-objects would have a higher cost than swapping
    two pointers or a pointer and a length. And sorting string-pointers
    is actually nearly twice times slower than sorting the string-objects.
    Why is that ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Mon Nov 29 22:54:49 2021
    Am 29.11.21 um 21:26 schrieb Bonita Montero:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ? I've checked that with that code:
    [...]
    I've read 1.4 million strings with an average lengh of 9 characters,
    so most strings should fall under the short string optimization and
    swapping the string-objects would have a higher cost than swapping
    two pointers or a pointer and a length.

    Swapping a few bytes is not the performance killer. Cache misses are the performance killer. Every time you touch a memory address not in the
    cache you get the latency of the RAM. And this latency has not improved significantly in the last two decades.

    And sorting string-pointers
    is actually nearly twice times slower than sorting the string-objects.
    Why is that ?

    Using pointers touches more different memory locations. The array and
    the string storage. With that large array almost any (random) access is
    a cache miss.


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Marcel Mueller on Tue Nov 30 14:58:00 2021
    Marcel Mueller <news.5.maazl@spamgourmet.org> writes:
    Am 29.11.21 um 21:26 schrieb Bonita Montero:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ? I've checked that with that code:
    [...]
    I've read 1.4 million strings with an average lengh of 9 characters,
    so most strings should fall under the short string optimization and
    swapping the string-objects would have a higher cost than swapping
    two pointers or a pointer and a length.

    Swapping a few bytes is not the performance killer. Cache misses are the >performance killer. Every time you touch a memory address not in the
    cache you get the latency of the RAM. And this latency has not improved >significantly in the last two decades.

    Define significantly. Mid 2000's server DRAM access latency could
    be as much as 250ns (Intel Westmere), while modern DRAM latency is
    closer to 50-60ns. That's a five-fold improvement (granted westmere
    was abnormally slow when accessing the second socket).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Nov 30 16:02:56 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 30.11.2021 um 15:58 schrieb Scott Lurndal:

    Define significantly. Mid 2000's server DRAM access latency could
    be as much as 250ns (Intel Westmere), while modern DRAM latency is
    closer to 50-60ns. That's a five-fold improvement (granted westmere
    was abnormally slow when accessing the second socket).

    Westmere was 2010:
    https://en.wikipedia.org/wiki/Westmere_(microarchitecture)

    Nahelem suffered from the same issue, and we had samples of
    westmere in 2008.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Nov 30 16:42:10 2021
    Am 30.11.2021 um 15:58 schrieb Scott Lurndal:

    Define significantly. Mid 2000's server DRAM access latency could
    be as much as 250ns (Intel Westmere), while modern DRAM latency is
    closer to 50-60ns. That's a five-fold improvement (granted westmere
    was abnormally slow when accessing the second socket).

    Westmere was 2010:
    https://en.wikipedia.org/wiki/Westmere_(microarchitecture)
    2000's DRAM-Latency wasn't signigificantly slower, but the Pentium 4
    CPUs of that time had 128 byte cacheline-sizes, fetching a cacheline
    with two bursts which results in a higher latency.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Nov 30 17:58:13 2021
    Am 30.11.2021 um 17:46 schrieb Scott Lurndal:

    I never, ever, mentioned 128-byte cache lines. Why do you insist
    on changing the topic, which was memory latency; which on Nahalem
    was significantly higher than today's CPUs (mainly due to inter-socket communication latency, being a NUMA architecture).

    DRAM access latency hasn't varied much since DDR1.
    The snoops are faster than the parallel speculative fetches,
    so NUMA isn't a problem here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Nov 30 17:19:33 2021
    Am 30.11.2021 um 17:02 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 30.11.2021 um 15:58 schrieb Scott Lurndal:

    Define significantly. Mid 2000's server DRAM access latency could
    be as much as 250ns (Intel Westmere), while modern DRAM latency is
    closer to 50-60ns. That's a five-fold improvement (granted westmere
    was abnormally slow when accessing the second socket).

    Westmere was 2010:
    https://en.wikipedia.org/wiki/Westmere_(microarchitecture)

    Nahelem suffered from the same issue, and we had samples of
    westmere in 2008.

    No, there was no CPU-architecture except the Pentium 4 that had
    128 byte cachelines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Nov 30 16:46:42 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 30.11.2021 um 17:02 schrieb Scott Lurndal:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    Am 30.11.2021 um 15:58 schrieb Scott Lurndal:

    Define significantly. Mid 2000's server DRAM access latency could
    be as much as 250ns (Intel Westmere), while modern DRAM latency is
    closer to 50-60ns. That's a five-fold improvement (granted westmere
    was abnormally slow when accessing the second socket).

    Westmere was 2010:
    https://en.wikipedia.org/wiki/Westmere_(microarchitecture)

    Nahelem suffered from the same issue, and we had samples of
    westmere in 2008.

    No, there was no CPU-architecture except the Pentium 4 that had
    128 byte cachelines.

    I never, ever, mentioned 128-byte cache lines. Why do you insist
    on changing the topic, which was memory latency; which on Nahalem
    was significantly higher than today's CPUs (mainly due to inter-socket communication latency, being a NUMA architecture).

    Note also that there are current CPU architectures with 128-byte
    cache lines, for example Octeon (MIPS) and OcteonTx (ARM64).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Tue Nov 30 19:17:11 2021
    Am 30.11.21 um 15:58 schrieb Scott Lurndal:
    Swapping a few bytes is not the performance killer. Cache misses are the
    performance killer. Every time you touch a memory address not in the
    cache you get the latency of the RAM. And this latency has not improved
    significantly in the last two decades.

    Define significantly. Mid 2000's server DRAM access latency could
    be as much as 250ns (Intel Westmere), while modern DRAM latency is
    closer to 50-60ns.

    1. Westmere appeared in 2010.

    2. Let's take the CAS latency as an example:
    DDR1: typ. 15 ns
    DDR4: typ. 14 ns
    Other values like tRAS are similar.

    3. Maybe some CPUs are designed bad and introduced additional latency.
    But this is no general rule. Even an old AMD Athlon took only about 50ns
    or 70ns (page miss).

    Did you confuse it with the latency of NUMA architectures?


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?w5bDtiBUaWli?=@21:1/5 to Bonita Montero on Tue Nov 30 13:26:39 2021
    On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ?i

    Because std::string itself is kind of "smart pointer" of text it contains.
    Its swap after compare costs next to nothing. So using additional
    indirection and memory for raw pointers there is clear "preliminary pessimization" ... more complicated code for no gain.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Wed Dec 1 08:29:02 2021
    Am 30.11.2021 um 22:26 schrieb Öö Tiib:
    On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ?i

    Because std::string itself is kind of "smart pointer" of text it contains. Its swap after compare costs next to nothing. So using additional
    indirection and memory for raw pointers there is clear "preliminary pessimization" ... more complicated code for no gain.

    I already said that I'm swapping strings that fit inside the string
    -objects themselfes. They have a average length of 9 chars and they
    fit inside the small string optimization.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?B?w5bDtiBUaWli?=@21:1/5 to Bonita Montero on Wed Dec 1 13:36:08 2021
    On Wednesday, 1 December 2021 at 09:29:19 UTC+2, Bonita Montero wrote:
    Am 30.11.2021 um 22:26 schrieb Öö Tiib:
    On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ?i

    Because std::string itself is kind of "smart pointer" of text it contains. Its swap after compare costs next to nothing. So using additional indirection and memory for raw pointers there is clear "preliminary pessimization" ... more complicated code for no gain.
    I already said that I'm swapping strings that fit inside the string
    -objects themselfes. They have a average length of 9 chars and they
    fit inside the small string optimization.

    What difference does it supposedly make? You still use quarter more
    memory, for those pointers, add one indirection and have to access
    all that memory for sorting. Pure pessimisation. Why you expected
    that it is quicker?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Thu Dec 2 09:09:42 2021
    Am 01.12.2021 um 22:36 schrieb Öö Tiib:
    On Wednesday, 1 December 2021 at 09:29:19 UTC+2, Bonita Montero wrote:
    Am 30.11.2021 um 22:26 schrieb Öö Tiib:
    On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ?i

    Because std::string itself is kind of "smart pointer" of text it contains. >>> Its swap after compare costs next to nothing. So using additional
    indirection and memory for raw pointers there is clear "preliminary
    pessimization" ... more complicated code for no gain.
    I already said that I'm swapping strings that fit inside the string
    -objects themselfes. They have a average length of 9 chars and they
    fit inside the small string optimization.

    What difference does it supposedly make? ...

    It makes the difference that you swap the string-contents byte-by-byte
    and not with two pointer-swaps. Loading or storing a byte costs the same
    like loading or storing a pointer.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Juha Nieminen@21:1/5 to Bonita Montero on Thu Dec 2 08:39:27 2021
    Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 01.12.2021 um 22:36 schrieb Tiib:
    On Wednesday, 1 December 2021 at 09:29:19 UTC+2, Bonita Montero wrote:
    Am 30.11.2021 um 22:26 schrieb Tiib:
    On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote:
    Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ?i

    Because std::string itself is kind of "smart pointer" of text it contains. >>>> Its swap after compare costs next to nothing. So using additional
    indirection and memory for raw pointers there is clear "preliminary
    pessimization" ... more complicated code for no gain.
    I already said that I'm swapping strings that fit inside the string
    -objects themselfes. They have a average length of 9 chars and they
    fit inside the small string optimization.

    What difference does it supposedly make? ...

    It makes the difference that you swap the string-contents byte-by-byte
    and not with two pointer-swaps. Loading or storing a byte costs the same
    like loading or storing a pointer.

    If the string objects are indeed using short string optimization, I doubt
    that the object contents are being swapped byte-by-byte. Most likely
    8 bytes at a time (just like the pointers).

    One advantage over using pointers to the objects is, of course, that you
    don't get the extra indirection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Thu Dec 2 10:44:23 2021
    Am 02.12.2021 um 09:39 schrieb Juha Nieminen:
    Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 01.12.2021 um 22:36 schrieb Öö Tiib:
    On Wednesday, 1 December 2021 at 09:29:19 UTC+2, Bonita Montero wrote:
    Am 30.11.2021 um 22:26 schrieb Öö Tiib:
    On Monday, 29 November 2021 at 22:26:35 UTC+2, Bonita Montero wrote: >>>>>> Why is sorting a pointer-array to strings slower than sorting the
    strings themselfes ?i

    Because std::string itself is kind of "smart pointer" of text it contains.
    Its swap after compare costs next to nothing. So using additional
    indirection and memory for raw pointers there is clear "preliminary
    pessimization" ... more complicated code for no gain.
    I already said that I'm swapping strings that fit inside the string
    -objects themselfes. They have a average length of 9 chars and they
    fit inside the small string optimization.

    What difference does it supposedly make? ...

    It makes the difference that you swap the string-contents byte-by-byte
    and not with two pointer-swaps. Loading or storing a byte costs the same
    like loading or storing a pointer.

    If the string objects are indeed using short string optimization, I doubt that the object contents are being swapped byte-by-byte. Most likely
    8 bytes at a time (just like the pointers).

    One advantage over using pointers to the objects is, of course, that you don't get the extra indirection.

    With MSVC, clang and g++ the contents are swapped as four SSE
    128 bit words; the size of the string or if its external or
    internal is ignored. But it's more effort to swap four 128 bit
    words than two pointers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Juha Nieminen@21:1/5 to Bonita Montero on Thu Dec 2 10:27:19 2021
    Bonita Montero <Bonita.Montero@gmail.com> wrote:
    With MSVC, clang and g++ the contents are swapped as four SSE
    128 bit words; the size of the string or if its external or
    internal is ignored.

    That's strange. 4x128 bits would be 64 bytes. I just tried what the sizeof(std::string) is with a very recent version of gcc 11, and
    it was 32 bytes.

    If compiling for a modern enough architecture, that would be one
    single SSE register swap. But even if it uses two 128-bit registers,
    it doesn't make all that much of a difference.

    But it's more effort to swap four 128 bit
    words than two pointers.

    You are forgetting that you are not only swapping the strings, you are
    also comparing them. With pointers you'll be using an extra level of indirection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)