• Most efficient prefetching distance

    From Bonita Montero@21:1/5 to All on Fri Oct 1 19:48:01 2021
    On today's x86-CPUs there is a prefetching-instruction which loads
    cacheline into an cache-level chosable by a parameter for this in-
    struction. But I often wondered what is the most appropriate pre- fetching-distance. So I wrote a program which you can give a maxi-
    mum block-size and it incrementally scans this block lineary from
    a beginning of a block-size of 4kB up to a default of 64MB, but
    you can chose a larger maximum (nk, nm, ng parameter, the parame-
    ter can be a float). The prefetching is done with an incrementing
    distance, from zero (special case without prefetching) to 512
    cachelines (assuming a L1- cacheline-size of 64 bytes, which fits
    for all x86 CPUs for decades). It first CLFLUSHs the cachelines
    it scans afterwards. It only scans a block if the prefetching
    -distance is up to one fourth of the block-size. The tailing part
    of the block which would give a prefetching beyond the block is
    scanned without prefetching to simulate a common optimization.
    It runs the test multiple times and takes the fastest timing.
    On Windows it sets it thread-affinity to CPU 0 and the priority
    as high as possible. On Linux it sets only the affinity. So it
    would be best to run the benchmark with "nice -20 ./a.out".

    So here's the source (C++17):

    #if defined(_MSC_VER)
    #define NOMINMAX
    #include <Windows.h>
    #elif defined(__unix__)
    #include <unistd.h>
    #include <sched.h>
    #include <pthread.h>
    #endif
    #include <iostream>
    #include <charconv>
    #include <cstdlib>
    #include <vector>
    #include <cstdint>
    #include <limits>
    #include <cstring>
    #include <cmath>
    #if defined(_MSC_VER)
    #include <intrin.h>
    #elif defined(__GNUC__)
    #include <x86intrin.h>
    #endif

    using namespace std;

    size_t parseSize( char const *str );

    int main( int argc, char **argv )
    {
    static size_t const DEFAULT_SIZE = (size_t)64 * 1024 * 1024;
    size_t blockSize = argc >= 2 ? parseSize( argv[1] ) : DEFAULT_SIZE;
    if( blockSize == -1 )
    return EXIT_FAILURE;
    #if defined(_MSC_VER)
    // incrementally get more priority until we're denied
    SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS );
    SetPriorityClass( GetCurrentProcess(), REALTIME_PRIORITY_CLASS );
    SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );
    SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_HIGHEST );
    SetThreadAffinityMask( GetCurrentThread(), 1 );
    #elif defined(__unix__)
    cpu_set_t cpuSet;
    CPU_ZERO(&cpuSet);
    CPU_SET(0, &cpuSet);
    pthread_setaffinity_np( pthread_self(), sizeof cpuSet, &cpuSet ); #endif
    using vchar_t = vector<char>;
    using vchar_it = vchar_t::iterator;
    vector<char> block( blockSize );
    static size_t const CACHELINE_SIZE = 64;
    size_t size = 4096;
    do
    {
    if( size > blockSize )
    size = blockSize;
    uint64_t fastestTicks = numeric_limits<uint64_t>::max();
    unsigned fastestDistance = 0;
    size_t nTests = (ptrdiff_t)((double)(8.0 * 1024) / (ptrdiff_t)size *
    25.0 + 0.5);
    nTests = nTests >= 3 ? nTests : 3;
    for( unsigned nClDistance = 0; nClDistance <= 256; ++nClDistance )
    {
    size_t distance = (size_t)nClDistance * CACHELINE_SIZE;
    if( distance > size / 4 )
    continue;
    static unsigned const N_TESTS = 25;
    for( size_t t = nTests; t; --t )
    {
    vchar_it it = block.begin();
    for( vchar_it end = it + size; it != end; it += CACHELINE_SIZE )
    _mm_clflush( &*it );
    uint64_t start = __rdtsc();
    if( nClDistance )
    {
    it = block.begin();
    for( vchar_it beforeEnd = it + (size - distance); it != beforeEnd;
    it += CACHELINE_SIZE )
    _mm_prefetch( &*it + distance, _MM_HINT_NTA ),
    *(char volatile *)&*it;
    for( vchar_it end = block.begin() + size; it != end; it +=
    CACHELINE_SIZE )
    *(char volatile *)&*it;
    }
    else
    {
    it = block.begin();
    for( vchar_it end = block.begin() + size; it != end; it +=
    CACHELINE_SIZE )
    *(char volatile *)&*it;
    }
    uint64_t ticks = __rdtsc() - start;
    if( ticks < fastestTicks )
    fastestTicks = ticks,
    fastestDistance = nClDistance;
    }
    }
    if( fastestTicks != numeric_limits<uint64_t>::max() )
    cout << "block-size: " << size << " fastest distance: " <<
    fastestDistance << " cachelines (" << nTests << ")" << endl;
    } while( (size *= 2) <= blockSize );
    }

    size_t parseSize( char const *str )
    {
    double dSize;
    from_chars_result fcr = from_chars( str, str + strlen( str ), dSize, chars_format::general );
    if( fcr.ec != errc() )
    return -1;
    if( !*(str = fcr.ptr) || str[1] )
    return -1;
    static const
    struct suffix_t
    {
    char suffix;
    size_t mult;
    } suffixes[]
    {
    { 'k', 1024 },
    { 'm', (size_t)1024 * 1024 },
    { 'g', (size_t)1024 * 1024 * 1024 }
    };
    char cSuf = tolower( *str );
    for( suffix_t const &suf : suffixes )
    if( suf.suffix == cSuf )
    {
    dSize = trunc( dSize * (ptrdiff_t)suf.mult );
    if( dSize < 1.0 || dSize >= numeric_limits<ptrdiff_t>::max() )
    return -1;
    return (ptrdiff_t)dSize;
    }
    return -1;
    }

    It would be nice to see the results from you and you should
    mention the CPU-type and RAM-type.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Fri Oct 1 18:36:00 2021
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    On today's x86-CPUs there is a prefetching-instruction which loads
    cacheline into an cache-level chosable by a parameter for this in-
    struction.

    Modern CPUs for the last decade have included automatic prefetchers
    in the cache subsystems. Usually a mix of stride-based and/or predictive fetchers.

    It's very seldom necessary for an application to provide an explicit prefetching hint except in very unusual circumstances. And most
    programmers trying to insert hints manually will get it wrong.
    The behavior of such is also heavily microarchitecture dependent,
    so what works on one chip may really slow things down on another.

    Note that they are, after all, hints. The processor need not
    actually do anything for a prefetch instruction.

    Let the hardware handle it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Scott Lurndal on Fri Oct 1 18:58:13 2021
    On 2021-10-01, Scott Lurndal <scott@slp53.sl.home> wrote:
    Bonita Montero <Bonita.Montero@gmail.com> writes:
    On today's x86-CPUs there is a prefetching-instruction which loads >>cacheline into an cache-level chosable by a parameter for this in- >>struction.

    Modern CPUs for the last decade have included automatic prefetchers
    in the cache subsystems. Usually a mix of stride-based and/or predictive fetchers.

    It's very seldom necessary for an application to provide an explicit prefetching hint except in very unusual circumstances. And most
    programmers trying to insert hints manually will get it wrong.
    The behavior of such is also heavily microarchitecture dependent,
    so what works on one chip may really slow things down on another.

    Note that they are, after all, hints. The processor need not
    actually do anything for a prefetch instruction.

    Let the hardware handle it.
    Let her try:
    t elf64 executable 3
    include 'import64.inc'
    interpreter '/lib64/ld-linux-x86-64.so.2'
    needed 'libc.so.6'
    import printf,atoi,exit

    segment executable
    entry $
    mov r8,100
    mov r10,100000
    cmp dword[rsp],2
    jl .skip
    mov rdi, [rsp+16]
    call [atoi]
    movsxd r8,eax
    test r8,r8
    jz .errzero
    xor edx,edx
    mov rax,4000000
    idiv r8
    mov r10,rax
    test r10,r10
    jz .errexit
    js .errsexit
    .skip:
    ; warm up
    imul r9,r8,128
    mov rcx,r9
    mov rdi,outbuf
    mov rsi,inbuf
    rep movsb

    rdtscp
    shl rdx,32
    or rax,rdx
    mov [r1],rax
    mov rbx,r10
    @@:
    imul r9,r8,128
    mov rcx,r9
    mov rdi,outbuf
    mov rsi,inbuf
    rep movsb
    dec rbx
    jnz @b

    rdtscp
    shl rdx,32
    or rax,rdx
    sub rax,[r1]
    cvtsi2sd xmm0,rax
    cvtsi2sd xmm1,r10
    mulsd xmm1,qword[clock]
    divsd xmm0,xmm1
    movsd [r1],xmm0

    rdtscp
    shl rdx,32
    or rax,rdx
    mov [r2],rax
    mov rbx,r10
    @@:
    imul r9,r8,128/8
    mov rcx,r9
    mov rdi,outbuf
    mov rsi,inbuf
    rep movsq
    dec rbx
    jnz @b

    rdtscp
    shl rdx,32
    or rax,rdx
    sub rax,[r2]
    cvtsi2sd xmm0,rax
    cvtsi2sd xmm1,r10
    mulsd xmm1,qword[clock]
    divsd xmm0,xmm1
    movsd [r2],xmm0

    rdtscp
    shl rdx,32
    or rax,rdx
    mov [r3],rax
    mov rbx,r10
    @@:
    mov rcx,r8
    mov rdi,outbuf
    mov rsi,inbuf
    .L0:
    movdqa xmm0,[rsi]
    movdqa xmm1,[rsi+0x10]
    movdqa xmm2,[rsi+0x20]
    movdqa xmm3,[rsi+0x30]
    movdqa xmm4,[rsi+0x40]
    movdqa xmm5,[rsi+0x50]
    movdqa xmm6,[rsi+0x60]
    movdqa xmm7,[rsi+0x70]
    movntdq [rdi],xmm0
    movntdq [rdi+0x10],xmm1
    movntdq [rdi+0x20],xmm2
    movntdq [rdi+0x30],xmm3
    movntdq [rdi+0x40],xmm4
    movntdq [rdi+0x50],xmm5
    movntdq [rdi+0x60],xmm6
    movntdq [rdi+0x70],xmm7
    add rsi,128
    add rdi,128
    dec rcx
    jnz .L0
    dec rbx
    jnz @b

    rdtscp
    shl rdx,32
    or rax,rdx
    sub rax,[r3]
    cvtsi2sd xmm0,rax
    cvtsi2sd xmm1,r10
    mulsd xmm1,qword[clock]
    divsd xmm0,xmm1
    movsd [r3],xmm0

    rdtscp
    shl rdx,32
    or rax,rdx
    mov [r4],rax
    mov rbx,r10
    @@:
    mov rcx,r8
    mov rdi,outbuf
    mov rsi,inbuf
    prefetch [rsi]
    prefetch [rsi+0x40]
    .L1:
    prefetch [rsi+0x80]
    prefetch [rsi+0xc0]
    movdqa xmm0,[rsi]
    movdqa xmm1,[rsi+0x10]
    movdqa xmm2,[rsi+0x20]
    movdqa xmm3,[rsi+0x30]
    movdqa xmm4,[rsi+0x40]
    movdqa xmm5,[rsi+0x50]
    movdqa xmm6,[rsi+0x60]
    movdqa xmm7,[rsi+0x70]
    movntdq [rdi],xmm0
    movntdq [rdi+0x10],xmm1
    movntdq [rdi+0x20],xmm2
    movntdq [rdi+0x30],xmm3
    movntdq [rdi+0x40],xmm4
    movntdq [rdi+0x50],xmm5
    movntdq [rdi+0x60],xmm6
    movntdq [rdi+0x70],xmm7
    add rsi,128
    add rdi,128
    dec rcx
    jnz .L1
    dec rbx
    jnz @b

    rdtscp
    shl rdx,32
    or rax,rdx
    sub rax,[r4]
    cvtsi2sd xmm0,rax
    cvtsi2sd xmm1,r10
    mulsd xmm1,qword[clock]
    divsd xmm0,xmm1
    movsd [r4],xmm0

    rdtscp
    shl rdx,32
    or rax,rdx
    mov [r5],rax
    mov rbx,r10
    @@:
    mov rcx,r8
    mov rdi,outbuf
    mov rsi,inbuf
    prefetch [rsi]
    prefetch [rsi+0x40]
    .L2:
    prefetch [rsi+0x80]
    prefetch [rsi+0xc0]
    vmovdqa ymm0,[rsi]
    vmovdqa ymm1,[rsi+0x20]
    vmovdqa ymm2,[rsi+0x40]
    vmovdqa ymm3,[rsi+0x60]
    vmovntdq [rdi],ymm0
    vmovntdq [rdi+0x20],ymm1
    vmovntdq [rdi+0x40],ymm2
    vmovntdq [rdi+0x60],ymm3
    add rsi,128
    add rdi,128
    dec rcx
    jnz .L2
    dec rbx
    jnz @b

    rdtscp
    shl rdx,32
    or rax,rdx
    sub rax,[r5]
    cvtsi2sd xmm0,rax
    cvtsi2sd xmm1,r10
    mulsd xmm1,qword[clock]
    divsd xmm0,xmm1
    movsd [r5],xmm0

    mov rdi,fmth
    mov rsi,r8
    mov rdx,r10
    xor eax,eax
    call [printf]

    mov rdi,fmt
    mov rsi,fmtmovsb
    movsd xmm0, [r1]
    mov eax,1
    call [printf]

    mov rdi,fmt
    mov rsi,fmtmovsq
    movsd xmm0, [r2]
    mov eax,1
    call [printf]

    mov rdi,fmt
    mov rsi,fmtmovntdq
    movsd xmm0, [r3]
    mov eax,1
    call [printf]

    mov rdi,fmt
    mov rsi,fmtmovntdqp
    movsd xmm0, [r4]
    mov eax,1
    call [printf]

    mov rdi,fmt
    mov rsi,fmtmovntdqy
    movsd xmm0, [r5]
    mov eax,1
    call [printf]

    call [exit]
    .errexit:
    mov rdi,fmtbig
    jmp .next
    .errsexit:
    mov rdi,fmtsgn
    jmp .next
    .errzero:
    mov rdi,fmtzero
    .next:
    mov rsi,r8
    xor eax,eax
    call [printf]
    xor edi,edi
    call [exit]

    segment readable
    fmt db '%-32s%16.14f',0ah,0
    fmtmovsb db 'rep movsb',0
    fmtmovsq db 'rep movsq',0
    fmtmovntdq db 'movntdq',0
    fmtmovntdqp db 'movntdq prefetch',0
    fmtmovntdqy db 'movntdq prefetch ymm',0
    fmth db '%d 128 byte blocks, loops:%d',0ah,0
    fmtbig db 'value of %d is too big, maximum value is 4000000',0ah,0
    fmtsgn db 'value of %d is negative, should be positive',0ah,0
    fmtzero db 'nothing to do 0',0ah,0
    align 8
    clock dq 3.8e9
    segment writeable
    align 32
    inbuf rb 4000000*128
    outbuf rb 4000000*128
    r1 rq 1
    r2 rq 1
    r3 rq 1
    r4 rq 1
    r5 rq 1


    --

    7-77-777
    Evil Sinner!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sat Oct 2 07:09:07 2021
    Modern CPUs for the last decade have included automatic prefetchers
    in the cache subsystems. Usually a mix of stride-based and/or predictive fetchers.

    If they would be better my program would give the best result of
    zero prefetching. And there would be no prefetching-instructions
    at all.

    It's very seldom necessary for an application to provide an
    explicit prefetching hint except in very unusual circumstances.

    Automatic prefetchers are dumb.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris M. Thomasson@21:1/5 to Bonita Montero on Fri Oct 1 23:15:19 2021
    On 10/1/2021 10:09 PM, Bonita Montero wrote:
    Modern CPUs for the last decade have included automatic prefetchers
    in the cache subsystems.   Usually a mix of stride-based and/or
    predictive
    fetchers.

    If they would be better my program would give the best result of
    zero prefetching. And there would be no prefetching-instructions
    at all.

    It's very seldom necessary for an application to provide an
    explicit prefetching hint except in very unusual circumstances.

    Automatic prefetchers are dumb.

    Oh.... shit. You make me feel like a full blown moron for even
    responding to you, Bonita. YIKES! Let me guess, you agree with me, and
    say I am stupid for responding to you. ;^)

    lol.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sat Oct 2 08:40:23 2021
    Am 02.10.2021 um 08:15 schrieb Chris M. Thomasson:
    On 10/1/2021 10:09 PM, Bonita Montero wrote:
    Modern CPUs for the last decade have included automatic prefetchers
    in the cache subsystems.   Usually a mix of stride-based and/or
    predictive
    fetchers.

    If they would be better my program would give the best result of
    zero prefetching. And there would be no prefetching-instructions
    at all.

    It's very seldom necessary for an application to provide an
    explicit prefetching hint except in very unusual circumstances.

    Automatic prefetchers are dumb.

    Oh.... shit. You make me feel like a full blown moron for even
    responding to you, Bonita. YIKES! Let me guess, you agree with me, and
    say I am stupid for responding to you. ;^)

    This is my improved probing-algorithm. It also compares no prefetching
    with the fastest prefetching-distance. I get an average improvement of
    10% and a fastest improvement of 30% if I test up to 256m:

    #if defined(_MSC_VER)
    #define NOMINMAX
    #include <Windows.h>
    #elif defined(__unix__)
    #include <unistd.h>
    #include <sched.h>
    #include <pthread.h>
    #endif
    #include <iostream>
    #include <charconv>
    #include <cstdlib>
    #include <vector>
    #include <cstdint>
    #include <limits>
    #include <cstring>
    #include <cmath>
    #include <sstream>
    #if defined(_MSC_VER)
    #include <intrin.h>
    #elif defined(__GNUC__)
    #include <x86intrin.h>
    #endif

    using namespace std;

    size_t parseSize( char const *str );
    string blockSizeStr( size_t blockSize );

    int main( int argc, char **argv )
    {
    static size_t const DEFAULT_SIZE = (size_t)64 * 1024 * 1024;
    size_t blockSize = argc >= 2 ? parseSize( argv[1] ) : DEFAULT_SIZE;
    if( blockSize == -1 )
    return EXIT_FAILURE;
    #if defined(_MSC_VER)
    // incrementally get more priority until we're denied
    SetPriorityClass( GetCurrentProcess(), HIGH_PRIORITY_CLASS );
    SetPriorityClass( GetCurrentProcess(), REALTIME_PRIORITY_CLASS );
    SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_TIME_CRITICAL );
    SetThreadPriority( GetCurrentThread(), THREAD_PRIORITY_HIGHEST );
    SetThreadAffinityMask( GetCurrentThread(), 1 );
    #elif defined(__unix__)
    cpu_set_t cpuSet;
    CPU_ZERO(&cpuSet);
    CPU_SET(0, &cpuSet);
    pthread_setaffinity_np( pthread_self(), sizeof cpuSet, &cpuSet ); #endif
    using vchar_t = vector<char>;
    vector<char> block( blockSize );
    char *begin = &*block.begin();
    static size_t const CACHELINE_SIZE = 64;
    size_t size = 4096;
    double fastestImprovement = 0.0, avgImprovement = 0.0;
    int avgDiv = 0;
    do
    {
    if( size > blockSize )
    size = blockSize;
    uint64_t fastestTicks = numeric_limits<uint64_t>::max();
    unsigned fastestDistance = 0;
    uint64_t fastestZeroTicks = fastestTicks;
    size_t nTests = (ptrdiff_t)((double)(8.0 * 1024) / (ptrdiff_t)size *
    25.0 + 0.5);
    nTests = nTests >= 3 ? nTests : 3;
    bool hadTest = false;
    for( unsigned nClDistance = 0; nClDistance <= 256; ++nClDistance )
    {
    size_t distance = (size_t)nClDistance * CACHELINE_SIZE;
    if( distance > size / 4 )
    continue;
    hadTest = true;
    for( size_t t = nTests; t; --t )
    {
    for( char *p = begin, *end = p + size; p != end; p += CACHELINE_SIZE )
    _mm_clflush( p );
    uint64_t start = __rdtsc();
    if( nClDistance )
    {
    char *p = begin;
    for( char *end = begin + size - distance; p < end; p +=
    CACHELINE_SIZE )
    _mm_prefetch( p, _MM_HINT_NTA ),
    *(char volatile *)p;
    for( char *end = begin + size; p != end; p += CACHELINE_SIZE )
    *(char volatile *)p;
    }
    else
    {
    for( char *p = begin, *end = begin + size; p != end; p +=
    CACHELINE_SIZE )
    *(char volatile *)p;
    }
    uint64_t ticks = __rdtsc() - start;
    if( ticks < fastestTicks )
    fastestTicks = ticks,
    fastestDistance = nClDistance;
    if( !nClDistance && ticks < fastestZeroTicks )
    fastestZeroTicks = ticks;
    }
    }
    double improvement = (double)(int64_t)fastestZeroTicks / (int64_t)fastestTicks - 1.0;
    if( fastestTicks != numeric_limits<uint64_t>::max() )
    cout << "block-size: " << blockSizeStr( size ),
    cout << " fastest distance: " << fastestDistance,
    cout << " cachelines (" << nTests << ") (",
    cout << improvement * 100.0 << "%)" << endl;
    avgImprovement += improvement;
    fastestImprovement = improvement > fastestImprovement ? improvement :
    fastestImprovement;
    avgDiv += hadTest;
    } while( (size *= 2) <= blockSize );
    avgImprovement /= (double)avgDiv;
    cout << "fastest improvement: " << fastestImprovement * 100.0 << "%" << endl;
    cout << "avg. improvment: " << avgImprovement * 100.0 << "%" << endl;
    }

    size_t parseSize( char const *str )
    {
    double dSize;
    from_chars_result fcr = from_chars( str, str + strlen( str ), dSize, chars_format::general );
    if( fcr.ec != errc() )
    return -1;
    if( !*(str = fcr.ptr) || str[1] )
    return -1;
    static const
    struct suffix_t
    {
    char suffix;
    size_t mult;
    } suffixes[]
    {
    { 'k', 1024 },
    { 'm', (size_t)1024 * 1024 },
    { 'g', (size_t)1024 * 1024 * 1024 }
    };
    char cSuf = tolower( *str );
    for( suffix_t const &suf : suffixes )
    if( suf.suffix == cSuf )
    {
    dSize = trunc( dSize * (ptrdiff_t)suf.mult );
    if( dSize < 1.0 || dSize >= (double)numeric_limits<ptrdiff_t>::max() )
    return -1;
    return (ptrdiff_t)dSize;
    }
    return -1;
    }

    string blockSizeStr( size_t blockSize )
    {
    ostringstream oss;
    if( blockSize < 1024 )
    oss << blockSize;
    else if( blockSize < (size_t)1024 * 1024 )
    oss << (double)blockSize / 1024.0 << "kB";
    else if( blockSize < (size_t)1024 * 1024 * 1024 )
    oss << (double)blockSize / 1024.0 / 1024.0 << "MB";
    else
    oss << (double)blockSize / 1024.0 / 1024.0 / 1024.0 << "GB";
    return oss.str();
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Chris M. Thomasson on Sat Oct 2 11:02:07 2021
    On 2021-10-02, Chris M. Thomasson <chris.m.thomasson.1@gmail.com> wrote:
    Automatic prefetchers are dumb.

    Oh.... shit. You make me feel like a full blown moron for even
    responding to you, Bonita. YIKES! Let me guess, you agree with me, and
    say I am stupid for responding to you. ;^)

    lol.
    Enlightenment is, when you realise that everything that happens to you is from self beliefs good or bad, and when you realise that you transfer that to others, buy convincing, good or bad, you start to convince in only good, or stop completely, which is even better :P ME

    --

    7-77-777
    Evil Sinner!
    https://github.com/rofl0r/chaos-pp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Sun Oct 3 16:00:14 2021
    Am 03.10.2021 um 15:33 schrieb Marcel Mueller:
    Am 01.10.21 um 20:36 schrieb Scott Lurndal:
    It's very seldom necessary for an application to provide an explicit
    prefetching hint except in very unusual circumstances. And most
    programmers trying to insert hints manually will get it wrong.
    The behavior of such is also heavily microarchitecture dependent,
    so what works on one chip may really slow things down on another.

    I can confirm this.
    I did several tests with __builtin_prefetch to reduce the collision rate
    in lock free algorithms. ...

    Why should a lockfree algorithm employ prefechting ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Sun Oct 3 15:33:35 2021
    Am 01.10.21 um 20:36 schrieb Scott Lurndal:
    It's very seldom necessary for an application to provide an explicit prefetching hint except in very unusual circumstances. And most
    programmers trying to insert hints manually will get it wrong.
    The behavior of such is also heavily microarchitecture dependent,
    so what works on one chip may really slow things down on another.

    I can confirm this.
    I did several tests with __builtin_prefetch to reduce the collision rate
    in lock free algorithms. While this worked on some platforms it did not
    work or is even counterproductive on other platforms. I am in doubt that
    there is any use of prefetching in /platform independent code/.
    Of course, if you are coding only for a specific set of similar
    platforms the situation changes.


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Oct 4 16:30:42 2021
    There's the Unix-command wc which counts words and lines. And the wc-implementation from the current GNU core utilities contain an
    optional very tricky AVX-implementation. This improves the speed
    of wc on my Linux-computer by factor 29.
    I improved this algorithm further to partition the data in three
    parts which I handle interleaved, i.e. 32-byte-chunks synchronously
    from each part and then I increment the common offset by 32. This
    is while the original-algorithm has a depencency-chain which limits
    out of order execution. I partitioned the data in three but not in
    four parts because there wouldn't be enough integer-registers - I
    need 14 in my interleaving-loop. With four parts I'd have much
    regiseter spilling and reloading which gains a performance similar
    to the original-algorithm.
    The speedup of the interleaved code over the original wc-algorithm
    is about 60%.
    The reason why I tell this is that I benchmarked the code under
    different conditions. I also improved the trivial algorithm with
    prefetching and partitioning and it could be run with either
    switched on like the AVX-code. This are the results on my Linux
    Ryzen 7 1800X:

    trivial / non-interleaved / non-prefetched
    1 thread: 468MB/s
    trivial / non-interleaved / prefetched
    1 thread: 492MB/s
    trivial / interleaved / non-prefetched
    1 thread: 778MB/s
    trivial / interleaved / prefetched
    1 thread: 694MB/s
    AVX / non-interleaved / non-prefetched
    1 thread: 13731MB/s
    AVX / non-interleaved / prefetched
    1 thread: 13757MB/s
    AVX / interleaved / non-prefetched
    1 thread: 19722MB/s
    AVX / interleaved / prefetched
    1 thread: 23558MB/s

    As you can see manual prefetching gives only a little gain for the
    trivial non-interleaved code, and it even drops with the trivial
    interleaved / prefetched code over the trivial interleaved / non
    -prefetched code. But for the AVX-code there's a significant speedup
    of the interleaved / prefetched code over the interleaved / non
    -prefetched code. So there are cases where prefetching gives a
    significant speedup.
    With interleaving there are more complex memory access patterns
    and I suspect that the prefetcher doesn't work that good under
    such conditions.

    If you're interested in the code. The relevant functions are the
    lambdas trivialSpaceCount and avxSpaceCount. The code is compilable
    only with C++20.

    #if defined(_MSC_VER)
    #include <Windows.h>
    #elif defined(__unix__)
    #include <pthread.h>
    #endif
    #include <iostream>
    #include <utility>
    #include <fstream>
    #include <vector>
    #include <cstdint>
    #include <algorithm>
    #include <chrono>
    #include <thread>
    #include <mutex>
    #include <condition_variable>
    #include <atomic>
    #include <vector>
    #include <cstdlib>
    #include <charconv>
    #include <cmath>
    #include <sstream>
    #include <limits>
    #include <cctype>
    #include <functional>
    #include <array>
    #include <string.h>
    #if defined(_MSC_VER)
    #include <intrin.h>
    #elif defined(__GNUC__)
    #include <x86intrin.h>
    #include <cpuid.h>
    #endif

    #if defined(_MSC_VER)
    #pragma warning(disable: 26495)
    #endif

    using namespace std;
    using namespace chrono;

    struct cmline_params
    {
    char const *fileName;
    size_t blockSize;
    unsigned nCPUs;
    bool invert;
    enum class priority_t : unsigned
    {
    UNSET, NORMAL, HIGH, REALTIME, BEST_AS_CAN
    } priority;
    vector<string> parse( int argc, char const *const *argv );
    };

    static void setThreadAffinity( thread::native_handle_type handle,
    unsigned affinity );
    static unsigned popCnt32( uint32_t value );
    static vector<char> readFileRepeated( char const *fileName, size_t
    blockSize );
    static int xstricmp( char const *a, char const *b );

    int main( int argc, char **argv )
    {
    cmline_params params;
    vector<string> errs( params.parse( argc, argv ) );
    if( errs.size() )
    {
    for( string &err : errs )
    cout << err << endl;
    return EXIT_FAILURE;
    }
    #if defined(_MSC_VER)
    if( params.priority != cmline_params::priority_t::UNSET )
    {
    auto setPriority = []( DWORD dwPriorityClass )
    {
    // SetPriorityClass always returns false !
    SetPriorityClass( GetCurrentProcess(), dwPriorityClass );
    return GetPriorityClass( GetCurrentProcess() ) == dwPriorityClass;
    };
    static const
    struct prio_map_t
    {
    cmline_params::priority_t priority;
    DWORD dwPriorityClass;
    } prioMappings[] =
    {
    { cmline_params::priority_t::NORMAL, NORMAL_PRIORITY_CLASS },
    { cmline_params::priority_t::HIGH, HIGH_PRIORITY_CLASS },
    { cmline_params::priority_t::REALTIME, REALTIME_PRIORITY_CLASS }
    };
    DWORD dwPriorityClass = -1;
    for( prio_map_t const &pm : prioMappings )
    if( pm.priority == params.priority )
    {
    dwPriorityClass = pm.dwPriorityClass;
    break;
    }
    if( dwPriorityClass != -1 )
    if( !setPriority( dwPriorityClass ) )
    return EXIT_FAILURE;
    else;
    else
    {
    ptrdiff_t p = 2;
    bool succ;
    do
    succ = setPriority( prioMappings[p].dwPriorityClass );
    while( !succ && --p >= 0 );
    }
    }
    #endif
    vector<char> block;
    try
    {
    if( !(block = readFileRepeated( params.fileName, params.blockSize
    )).size() )
    throw 123;
    }
    catch( ... )
    {
    cout << "error reading file" << endl;
    return EXIT_FAILURE;
    }
    struct words_and_lines
    {
    size_t words, lines;
    words_and_lines( size_t words = 0, size_t lines = 0 ) :
    words( words ), lines( lines )
    {
    }
    };
    using count_fn_t = void (*)( words_and_lines &, char *, size_t, bool, bool * );
    struct state_t
    {
    char *mem;
    bool wasSpace;
    words_and_lines counters;
    state_t( char *mem, bool wasSpace, words_and_lines const &counters ) :
    mem( mem ),
    wasSpace( wasSpace ),
    counters( counters )
    {
    }
    };
    static size_t const PREFETCH_DISTANCE = 32 * 64;
    static
    auto trivialSpaceCount = []( bool interleave, bool prefetch, words_and_lines &counters, char *mem, size_t count, bool extend, bool *pWasSpace )
    {
    bool wasSpace = pWasSpace ? *pWasSpace : false;
    if( !count )
    {
    counters.words += !extend && !wasSpace;
    return;
    }
    auto stateBlock = [&]<bool prefetch>( state_t &state, size_t offset )
    {
    if constexpr( prefetch )
    _mm_prefetch( &state.mem[offset] + PREFETCH_DISTANCE, _MM_HINT_NTA );
    bool isSpace = (unsigned char)state.mem[offset] <= ' ';
    state.counters.words += isSpace && !wasSpace;
    state.counters.lines += state.mem[offset] == '\n';
    state.wasSpace = isSpace;
    };
    if( interleave && count >= 3 )
    {
    size_t partitionSize = count / 3;
    char *ends[] = { mem + partitionSize, mem + partitionSize * 2 };
    state_t states[] =
    {
    state_t( mem, wasSpace, words_and_lines() ),
    state_t( ends[0], ends[0][-1] == ' ', words_and_lines() ),
    state_t( ends[1], ends[1][-1] == ' ', words_and_lines() ),
    };
    size_t offset = 0;
    if( prefetch )
    for( ; (ptrdiff_t)offset < (ptrdiff_t)(partitionSize -
    PREFETCH_DISTANCE); ++offset )
    stateBlock.operator ()<true>( states[0], offset ),
    stateBlock.operator ()<true>( states[1], offset ),
    stateBlock.operator ()<true>( states[1], offset );
    for( ; offset != partitionSize; ++offset )
    stateBlock.operator ()<false>( states[0], offset ),
    stateBlock.operator ()<false>( states[1], offset ),
    stateBlock.operator ()<false>( states[1], offset );
    mem += partitionSize * 3;
    count -= partitionSize * 3;
    counters.words += states[0].counters.words + states[1].counters.words
    + states[0].counters.words;
    counters.lines += states[0].counters.lines + states[1].counters.lines
    + states[1].counters.lines;
    wasSpace = states[2].wasSpace;
    }
    if( count )
    {
    state_t state( mem, wasSpace, counters );
    size_t offset = 0;
    if( prefetch )
    for( ; (ptrdiff_t)offset < (ptrdiff_t)(count - PREFETCH_DISTANCE);
    ++offset )
    stateBlock.operator ()<true>( state, offset );
    for( ; offset != count; ++offset )
    stateBlock.operator ()<false>( state, offset );
    counters = state.counters;
    }
    if( pWasSpace )
    *pWasSpace = wasSpace;
    };
    static
    auto avxSpaceCount = []( bool interleave, bool prefetch, words_and_lines &counters, char *mem, size_t count, bool extend, bool *pWasSpace )
    {
    bool wasSpace = pWasSpace ? *pWasSpace : false;
    if( !count )
    {
    counters.words += !extend && !wasSpace;
    return;
    }
    size_t prefix = ((size_t)mem + 31 & -32) - (size_t)mem;
    prefix = prefix <= count ? prefix : count;
    trivialSpaceCount( interleave, prefetch, counters, mem, prefix, count
    prefix || extend, &wasSpace );
    mem += prefix;
    count -= prefix;
    if( count >= 32 )
    {
    __m256i spaces = _mm256_set1_epi8( ' ' + 1 ),
    newlines = _mm256_set1_epi8( '\n' );
    auto stateBlock = [&]<bool prefetch>( state_t &state, size_t offset )
    {
    if constexpr( prefetch )
    _mm_prefetch( &state.mem[offset] + PREFETCH_DISTANCE, _MM_HINT_NTA );
    __m256i chunk = _mm256_load_si256( (__m256i *)&state.mem[offset] );
    uint32_t isSpaceMask = _mm256_movemask_epi8( _mm256_andnot_si256(
    chunk, _mm256_sub_epi8( chunk, spaces ) ) ),
    wasSpaceMask = isSpaceMask << 1 | (uint32_t)state.wasSpace,
    newlineMask = _mm256_movemask_epi8( _mm256_cmpeq_epi8(
    chunk, newlines ) );
    state.counters.words += popCnt32( isSpaceMask & ~wasSpaceMask );
    state.counters.lines += popCnt32( newlineMask );
    state.wasSpace = (int32_t)isSpaceMask < 0 ? 1 : 0;
    };
    if( interleave && count >= (3 * 32) )
    {
    size_t partitionSize = count / (3 * 32) * 32;
    char *ends[] = { mem + partitionSize, mem + partitionSize * 2 };
    state_t states[] =
    {
    state_t( mem, wasSpace, words_and_lines() ),
    state_t( ends[0], ends[0][-1] == ' ', words_and_lines() ),
    state_t( ends[1], ends[1][-1] == ' ', words_and_lines() ),
    };
    size_t offset = 0;
    if( prefetch )
    // with prefetching
    for( ; (ptrdiff_t)offset < (ptrdiff_t)(partitionSize -
    PREFETCH_DISTANCE); offset += 32 )
    stateBlock.operator ()<true>( states[0], offset ),
    stateBlock.operator ()<true>( states[1], offset ),
    stateBlock.operator ()<true>( states[2], offset );
    // without prefetching
    for( ; offset != partitionSize; offset += 32 )
    stateBlock.operator ()<false>( states[0], offset ),
    stateBlock.operator ()<false>( states[1], offset ),
    stateBlock.operator ()<false>( states[2], offset );
    mem += partitionSize * 3;
    count -= partitionSize * 3;
    counters.words += states[0].counters.words + states[1].counters.words + states[0].counters.words;
    counters.lines += states[0].counters.lines + states[1].counters.lines + states[1].counters.lines;
    wasSpace = states[2].wasSpace;
    }
    if( count >= 32 )
    {
    state_t state( mem, wasSpace, counters );
    size_t offset = 0;
    do
    stateBlock.operator ()<false>( state, offset );
    while( (offset += 32) != (count & -32) );
    mem += count & -32;
    count %= 32;
    counters = state.counters;
    wasSpace = state.wasSpace;
    }
    }
    trivialSpaceCount( interleave, prefetch, counters, mem, count, extend,
    &wasSpace );
    if( pWasSpace )
    *pWasSpace = wasSpace;
    };
    using spacecount_fn_t = function<void( bool, bool, words_and_lines &, char *, size_t, bool, bool * )>;
    struct descr_count_fn
    {
    char const *descr;
    spacecount_fn_t countFn;
    };
    static
    array<descr_count_fn, 2> const descrCountFns(
    {
    { "trivial", bind( trivialSpaceCount, placeholders::_1, placeholders::_2, placeholders::_3, placeholders::_4, placeholders::_5, placeholders::_6, placeholders::_7 ) },
    { "AVX", bind( avxSpaceCount, placeholders::_1, placeholders::_2,
    placeholders::_3, placeholders::_4, placeholders::_5, placeholders::_6, placeholders::_7 ) }
    } );
    mutex mtx;
    unsigned ready;
    condition_variable cvRready;
    bool run;
    condition_variable cvRun;
    atomic_int64_t sumDur;
    auto theThread = [&]( bool interleave, bool prefetch, spacecount_fn_t const &countFn, char *mem, size_t blockSize, size_t repeats )
    {
    unique_lock<mutex> lock( mtx );
    if( !--ready )
    cvRready.notify_one();
    cvRun.wait( lock, [&]() -> bool { return run; } );
    lock.unlock();
    auto start = high_resolution_clock::now();
    size_t volatile sum = 0;
    words_and_lines wordsAndLines;
    for( size_t r = repeats; r; --r )
    {
    wordsAndLines = words_and_lines();
    sum = 0;
    countFn( interleave, prefetch, wordsAndLines, mem, blockSize, false,
    nullptr );
    sum += wordsAndLines.words + wordsAndLines.lines;
    }
    sumDur += (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();
    };
    vector<thread> threads;
    threads.reserve( params.nCPUs );
    #if defined(NDEBUG)
    double const MBS = 256.0;
    #else
    double const MBS = 1.0;
    #endif
    size_t repeats = (ptrdiff_t)(MBS * 1000 * 1000 / (ptrdiff_t)params.blockSize + 0.5);
    repeats += repeats == 0;
    unsigned hc = thread::hardware_concurrency();
    using vresult_t = vector<double>;
    using vvresult_t = vector<vresult_t>;
    for( descr_count_fn const &dfn : descrCountFns )
    {
    for( unsigned interleave = 0; interleave <= 1; ++interleave )
    for( unsigned prefetch = 0; prefetch <= 1; ++prefetch )
    {
    std::cout << dfn.descr;

    cout << (!interleave ? " / non-interleaved" : " / interleaved");
    cout << (!prefetch ? " / non-prefetched" : " / prefetched ") << endl;
    for( unsigned nThreads = 1; nThreads <= params.nCPUs; ++nThreads )
    {
    ready = nThreads;
    run = false;
    sumDur = 0;
    threads.resize( 0 );
    for( unsigned t = 0; t != nThreads; ++t )
    {
    threads.emplace_back( theThread, (bool)interleave, (bool)prefetch,
    ref( dfn.countFn ), &block[0], params.blockSize, repeats );
    unsigned affinity = !params.invert ? t : (t % 2) * (hc / 2) + t / 2;
    setThreadAffinity( threads.back().native_handle(), affinity );
    }
    unique_lock<mutex> lock( mtx );
    cvRready.wait( lock, [&]() -> bool { return !ready; } );
    run = true;
    cvRun.notify_all();
    lock.unlock();
    for( thread &thr : threads )
    thr.join();
    static double const MEGABYTE = 1000.0 * 1000.0;
    double secs = sumDur / (1.0e9 * nThreads),
    mbsPerSec = ((double)nThreads * (ptrdiff_t)params.blockSize *
    (ptrdiff_t)repeats / MEGABYTE) / secs;
    std::cout << "\t\t" << nThreads << (nThreads > 1 ? " threads: " : "
    thread: ") << (int64_t)(mbsPerSec + 0.5) << "MB/s";
    cout << endl;
    }
    }
    }
    }

    inline
    unsigned popCnt32( uint32_t value )
    {
    #if defined(_MSC_VER)
    return __popcnt( value );
    #elif defined(__GNUC__)
    return __builtin_popcount( value );
    #endif
    }

    vector<string> cmline_params::parse( int argc, char const *const *argv )
    {
    vector<string> errs;
    unsigned hc = thread::hardware_concurrency();
    if( hc )
    {
    fileName = nullptr;
    blockSize = (size_t)256 * 1024 * 1024;
    nCPUs = hc;
    invert = false;
    priority = priority_t::UNSET;
    }
    else
    errs.emplace_back( "thread::hardware_concurrency() == 0" );
    char const *const *param = argv + 1,
    *const *paramEnd = argv + argc;
    auto addErrString = [&]( char const *prefix, char const *param )
    {
    ostringstream oss;
    oss << ": \"" << param << "\"";
    errs.emplace_back( prefix + oss.str() );
    };
    while( param < paramEnd )
    {
    if( xstricmp( *param, "--file" ) == 0 )
    {
    if( ++param == paramEnd )
    {
    errs.emplace_back( "supply filename !" );
    goto ret;
    }
    fileName = *param++;
    continue;
    }
    if( xstricmp( *param, "--size" ) == 0 )
    {
    if( ++param == paramEnd )
    {
    errs.emplace_back( "supply size !" );
    goto ret;
    }
    char const *sizeParam = *param;
    double dSizeParam;
    from_chars_result fcr = from_chars( sizeParam, sizeParam + strlen(
    sizeParam ), dSizeParam, chars_format::general );
    auto invalidSize = [&]()
    {
    addErrString( "invalid size", *param );
    };
    if( fcr.ec == errc() && (dSizeParam = trunc( dSizeParam )) >= 1.0 )
    {
    char const *suffixPtr = fcr.ptr;
    static const
    struct suffix_mult
    {
    char suffix;
    size_t mult;
    } sms[]
    {
    { 'g', (size_t)1000 * 1000 * 1000 },
    { 'm', (size_t)1000 * 1000 },
    { 'k', (size_t)1000},
    { 'b', (size_t)1 }
    };
    suffix_mult const *pSm = nullptr;
    if( *suffixPtr )
    {
    char suffix = tolower( *suffixPtr );
    for( suffix_mult const &sm : sms )
    if( suffix == sm.suffix )
    {
    pSm = &sm;
    break;
    }
    }
    static
    auto dblSizeCvt = []( double dbl, size_t &st ) -> bool
    {
    if( dbl >= (double)((int64_t)1 << 53) )
    return false;
    st = (ptrdiff_t)dbl;
    return true;
    };
    if( pSm )
    if( !suffixPtr[1] )
    if( !dblSizeCvt( dSizeParam * (ptrdiff_t)pSm->mult, blockSize ) )
    invalidSize();
    else;
    else
    invalidSize();
    else
    if( !*suffixPtr )
    if( !dblSizeCvt( dSizeParam, blockSize ) )
    invalidSize();
    else;
    else
    invalidSize();
    }
    else
    invalidSize();
    ++param;
    continue;
    }
    if( xstricmp( *param, "--bound" ) == 0 )
    {
    if( ++param == paramEnd )
    {
    errs.emplace_back( "supply CPU-bound !" );
    goto ret;
    }
    unsigned cpuBound = -1;
    from_chars_result fcr = from_chars( *param, *param + strlen( *param
    ), cpuBound );
    if( fcr.ec == errc() && !*fcr.ptr )
    nCPUs = nCPUs <= cpuBound ? nCPUs : cpuBound;
    else
    addErrString( "invalid CPU-bound", argv[3] );
    ++param;
    continue;
    }
    #if defined(_MSC_VER)
    static const
    struct str_prio
    {
    char const *prioStr;
    priority_t priority;
    } prios[] =
    {
    { "--normal", priority_t::NORMAL },
    { "--high", priority_t::HIGH },
    { "--realtime", priority_t::REALTIME },
    { "--best", priority_t::BEST_AS_CAN },
    };
    bool prioSet = false;
    for( str_prio const &strPrio : prios )
    if( xstricmp( *param, strPrio.prioStr ) == 0 )
    {
    priority = strPrio.priority;
    ++param;
    prioSet = true;
    break;
    }
    if( prioSet )
    continue;
    #endif
    if( xstricmp( *param, "--invert" ) == 0 )
    {
    ++param;
    int cpuIdRegs[2][4];
    #if defined(_MSC_VER)
    __cpuid( cpuIdRegs[0], 0 );
    __cpuid( cpuIdRegs[1], 1 );
    #elif defined(__GNUC__)
    __cpuid(0, cpuIdRegs[0][0], cpuIdRegs[0][1], cpuIdRegs[0][2],
    cpuIdRegs[0][3]);
    __cpuid(1, cpuIdRegs[1][0], cpuIdRegs[1][1], cpuIdRegs[1][2],
    cpuIdRegs[1][3]);
    #endif
    if( (unsigned)cpuIdRegs[0][0] < 1 || !((unsigned)cpuIdRegs[1][3] & 1
    << 28) )
    {
    errs.emplace_back( "inversion impossible - CPU hasn't SMT" );
    continue;
    }
    invert = true;
    continue;
    }
    addErrString( "invalid option", *param++ );
    }
    ret:
    if( !fileName )
    errs.insert( errs.begin(), "supply filename !" );
    return errs;
    }

    static
    vector<char> readFileRepeated( char const *fileName, size_t blockSize )
    {
    if( !blockSize )
    return vector<char>();
    ifstream ifs;
    ifs.exceptions( ifstream::failbit | ifstream::badbit );
    ifs.open( fileName, ifstream::binary );
    ifs.seekg( 0, ios_base::end );
    streampos fileSize = ifs.tellg();
    if( !fileSize || fileSize > (size_t)-1 )
    return vector<char>();
    ifs.seekg( 0, ios_base::beg );
    vector<char> block( blockSize, 0 );
    size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize : blockSize;
    ifs.read( &*block.begin(), repSize );
    bool lastNewline = block[repSize - 1] == '\n';
    size_t remaining = block.size() - repSize;
    do
    {
    size_t cpy = remaining >= repSize ? repSize : remaining;
    copy( block.begin(), block.begin() + cpy, block.end() - remaining );
    remaining -= cpy;
    if( !lastNewline && remaining )
    block.end()[-(ptrdiff_t)remaining--] = '\n';
    } while( remaining );
    return block;
    }

    static
    void setThreadAffinity( thread::native_handle_type handle, unsigned
    affinity )
    {
    #if defined(_MSC_VER)
    SetThreadAffinityMask( handle, (DWORD_PTR)1 << affinity );
    #elif defined(__unix__)
    cpu_set_t cpuSet;
    CPU_ZERO(&cpuSet);
    CPU_SET(affinity, &cpuSet);
    pthread_setaffinity_np( handle, sizeof cpuSet, &cpuSet );
    #endif
    }

    inline
    int xstricmp( char const *a, char const *b )
    {
    using uchar_t = unsigned char;
    uchar_t lA, lB;
    for( size_t i = 0; a[i] | b[i]; ++i )
    if( (lA = tolower( a[i] )) != (lB = tolower( b[i] )) )
    return (int)lA - (int)lB;
    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bonita Montero on Mon Oct 4 16:36:24 2021
    On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    There's the Unix-command wc which counts words and lines. And the wc-implementation from the current GNU core utilities contain an
    optional very tricky AVX-implementation. This improves the speed
    of wc on my Linux-computer by factor 29.
    I improved this algorithm further to partition the data in three
    parts which I handle interleaved, i.e. 32-byte-chunks synchronously

    static
    vector<char> readFileRepeated( char const *fileName, size_t blockSize )
    {
    if( !blockSize )
    return vector<char>();
    ifstream ifs;
    ifs.exceptions( ifstream::failbit | ifstream::badbit );
    ifs.open( fileName, ifstream::binary );
    ifs.seekg( 0, ios_base::end );
    streampos fileSize = ifs.tellg();
    if( !fileSize || fileSize > (size_t)-1 )
    return vector<char>();
    ifs.seekg( 0, ios_base::beg );
    vector<char> block( blockSize, 0 );
    size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize : blockSize;
    ifs.read( &*block.begin(), repSize );
    bool lastNewline = block[repSize - 1] == '\n';
    size_t remaining = block.size() - repSize;
    do
    {
    size_t cpy = remaining >= repSize ? repSize : remaining;
    copy( block.begin(), block.begin() + cpy, block.end() - remaining );
    remaining -= cpy;
    if( !lastNewline && remaining )
    block.end()[-(ptrdiff_t)remaining--] = '\n';
    } while( remaining );
    return block;
    }

    Talking about efficiency :P
    Who will pay you for overcomplicating simple things?

    --

    7-77-777
    Evil Sinner!
    to weak you should be meek, and you should brainfuck stronger https://github.com/rofl0r/chaos-pp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Mon Oct 4 18:55:28 2021
    Am 04.10.2021 um 18:36 schrieb Branimir Maksimovic:
    On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    There's the Unix-command wc which counts words and lines. And the
    wc-implementation from the current GNU core utilities contain an
    optional very tricky AVX-implementation. This improves the speed
    of wc on my Linux-computer by factor 29.
    I improved this algorithm further to partition the data in three
    parts which I handle interleaved, i.e. 32-byte-chunks synchronously

    static
    vector<char> readFileRepeated( char const *fileName, size_t blockSize )
    {
    if( !blockSize )
    return vector<char>();
    ifstream ifs;
    ifs.exceptions( ifstream::failbit | ifstream::badbit );
    ifs.open( fileName, ifstream::binary );
    ifs.seekg( 0, ios_base::end );
    streampos fileSize = ifs.tellg();
    if( !fileSize || fileSize > (size_t)-1 )
    return vector<char>();
    ifs.seekg( 0, ios_base::beg );
    vector<char> block( blockSize, 0 );
    size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize :
    blockSize;
    ifs.read( &*block.begin(), repSize );
    bool lastNewline = block[repSize - 1] == '\n';
    size_t remaining = block.size() - repSize;
    do
    {
    size_t cpy = remaining >= repSize ? repSize : remaining;
    copy( block.begin(), block.begin() + cpy, block.end() - remaining );
    remaining -= cpy;
    if( !lastNewline && remaining )
    block.end()[-(ptrdiff_t)remaining--] = '\n';
    } while( remaining );
    return block;
    }

    Talking about efficiency :P
    Who will pay you for overcomplicating simple things?

    Why should this be overcomplicated ? Im repeatedly copy a
    file into a buffer until it is full; maybe not even once
    fully if the file doesn't fit in the buffer's maximum size.
    That's the most direct way.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Mon Oct 4 21:59:20 2021
    Am 03.10.21 um 16:00 schrieb Bonita Montero:
    Am 03.10.2021 um 15:33 schrieb Marcel Mueller:
    I did several tests with __builtin_prefetch to reduce the collision
    rate in lock free algorithms. ...

    Why should a lockfree algorithm employ prefechting ?

    Prefetch can access invalid memory. So prefetching a shared memory area
    behind a pointer can significantly decrease the probability of failed
    CAS when implementing strong thread safety. But on some platforms I
    observed excessive cachline hopping with this strategy.


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bonita Montero on Mon Oct 4 20:22:48 2021
    On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 04.10.2021 um 18:36 schrieb Branimir Maksimovic:
    On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    There's the Unix-command wc which counts words and lines. And the
    wc-implementation from the current GNU core utilities contain an
    optional very tricky AVX-implementation. This improves the speed
    of wc on my Linux-computer by factor 29.
    I improved this algorithm further to partition the data in three
    parts which I handle interleaved, i.e. 32-byte-chunks synchronously

    static
    vector<char> readFileRepeated( char const *fileName, size_t blockSize )
    {
    if( !blockSize )
    return vector<char>();
    ifstream ifs;
    ifs.exceptions( ifstream::failbit | ifstream::badbit );
    ifs.open( fileName, ifstream::binary );
    ifs.seekg( 0, ios_base::end );
    streampos fileSize = ifs.tellg();
    if( !fileSize || fileSize > (size_t)-1 )
    return vector<char>();
    ifs.seekg( 0, ios_base::beg );
    vector<char> block( blockSize, 0 );
    size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize :
    blockSize;
    ifs.read( &*block.begin(), repSize );
    bool lastNewline = block[repSize - 1] == '\n';
    size_t remaining = block.size() - repSize;
    do
    {
    size_t cpy = remaining >= repSize ? repSize : remaining;
    copy( block.begin(), block.begin() + cpy, block.end() - remaining );
    remaining -= cpy;
    if( !lastNewline && remaining )
    block.end()[-(ptrdiff_t)remaining--] = '\n';
    } while( remaining );
    return block;
    }

    Talking about efficiency :P
    Who will pay you for overcomplicating simple things?

    Why should this be overcomplicated ? Im repeatedly copy a
    file into a buffer until it is full; maybe not even once
    fully if the file doesn't fit in the buffer's maximum size.
    That's the most direct way.
    take a look at this simple and professionaly done
    program that does all that :P
    (Ian Collins I think is AUTHOR :P
    #include <map>
    #include <unordered_map>
    #include <iostream>
    #include <fstream>
    #include <algorithm>
    #include <iomanip>
    using namespace std;
    using Pairs = unordered_map<string,int>;

    void fill( Pairs& pairs, char c )
    {
    static string word;

    if( ispunct(c) ) return;

    if( isspace(c) )
    {
    if( word.size() )
    {
    pairs[word]++;
    word.clear();
    }
    }
    else
    {
    word += tolower(c);
    }
    }

    int main()
    {
    ifstream bible {"bible.txt"};

    using citerator = istreambuf_iterator<char>;

    Pairs pairs;

    for_each( citerator(bible.rdbuf()), citerator(),
    [&pairs]( char c ){ fill( pairs, c ); } );

    multimap<unsigned,string> sorted;

    // Sort the {word, count} pairs.
    //
    for_each( pairs.begin(), pairs.end(),
    [&sorted]( const Pairs::value_type& p )
    { sorted.insert(make_pair(p.second,p.first)); } );

    // Print the top 20.
    //
    auto item = sorted.rbegin();

    for( auto n = 0; n < 20; ++n, ++item )
    {
    cout << "Position " << setw(2) << n+1
    << ": count = " << setw(6) << item->first
    << " " << item->second << '\n';
    }

    return 0;
    }

    --

    7-77-777
    Evil Sinner!
    to weak you should be meek, and you should brainfuck stronger https://github.com/rofl0r/chaos-pp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Oct 5 07:11:14 2021
    Am 04.10.2021 um 21:59 schrieb Marcel Mueller:

    Prefetch can access invalid memory. So prefetching a shared memory area behind a pointer can significantly decrease the probability of failed
    CAS when implementing strong thread safety. But on some platforms I
    observed excessive cachline hopping with this strategy.

    That doesn't make sense. When you prefetch you usually process a lot of
    data before the point you prefetched. When you have CASes you rotatedy
    process the same data; prefetching here is nonsense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Oct 5 07:08:14 2021
    Am 04.10.2021 um 22:22 schrieb Branimir Maksimovic:
    On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 04.10.2021 um 18:36 schrieb Branimir Maksimovic:
    On 2021-10-04, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    There's the Unix-command wc which counts words and lines. And the
    wc-implementation from the current GNU core utilities contain an
    optional very tricky AVX-implementation. This improves the speed
    of wc on my Linux-computer by factor 29.
    I improved this algorithm further to partition the data in three
    parts which I handle interleaved, i.e. 32-byte-chunks synchronously

    static
    vector<char> readFileRepeated( char const *fileName, size_t blockSize ) >>>> {
    if( !blockSize )
    return vector<char>();
    ifstream ifs;
    ifs.exceptions( ifstream::failbit | ifstream::badbit );
    ifs.open( fileName, ifstream::binary );
    ifs.seekg( 0, ios_base::end );
    streampos fileSize = ifs.tellg();
    if( !fileSize || fileSize > (size_t)-1 )
    return vector<char>();
    ifs.seekg( 0, ios_base::beg );
    vector<char> block( blockSize, 0 );
    size_t repSize = (size_t)fileSize <= blockSize ? (size_t)fileSize : >>>> blockSize;
    ifs.read( &*block.begin(), repSize );
    bool lastNewline = block[repSize - 1] == '\n';
    size_t remaining = block.size() - repSize;
    do
    {
    size_t cpy = remaining >= repSize ? repSize : remaining;
    copy( block.begin(), block.begin() + cpy, block.end() - remaining );
    remaining -= cpy;
    if( !lastNewline && remaining )
    block.end()[-(ptrdiff_t)remaining--] = '\n';
    } while( remaining );
    return block;
    }

    Talking about efficiency :P
    Who will pay you for overcomplicating simple things?

    Why should this be overcomplicated ? Im repeatedly copy a
    file into a buffer until it is full; maybe not even once
    fully if the file doesn't fit in the buffer's maximum size.
    That's the most direct way.
    take a look at this simple and professionaly done
    program that does all that :P
    (Ian Collins I think is AUTHOR :P
    #include <map>
    #include <unordered_map>
    #include <iostream>
    #include <fstream>
    #include <algorithm>
    #include <iomanip>
    using namespace std;
    using Pairs = unordered_map<string,int>;

    void fill( Pairs& pairs, char c )
    {
    static string word;

    if( ispunct(c) ) return;

    if( isspace(c) )
    {
    if( word.size() )
    {
    pairs[word]++;
    word.clear();
    }
    }
    else
    {
    word += tolower(c);
    }
    }

    int main()
    {
    ifstream bible {"bible.txt"};

    using citerator = istreambuf_iterator<char>;

    Pairs pairs;

    for_each( citerator(bible.rdbuf()), citerator(),
    [&pairs]( char c ){ fill( pairs, c ); } );

    multimap<unsigned,string> sorted;

    // Sort the {word, count} pairs.
    //
    for_each( pairs.begin(), pairs.end(),
    [&sorted]( const Pairs::value_type& p )
    { sorted.insert(make_pair(p.second,p.first)); } );

    // Print the top 20.
    //
    auto item = sorted.rbegin();

    for( auto n = 0; n < 20; ++n, ++item )
    {
    cout << "Position " << setw(2) << n+1
    << ": count = " << setw(6) << item->first
    << " " << item->second << '\n';
    }

    return 0;
    }

    Ok, you don't understand what I do.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Branimir Maksimovic@21:1/5 to Bonita Montero on Tue Oct 5 09:55:46 2021
    On 2021-10-05, Bonita Montero <Bonita.Montero@gmail.com> wrote:
    Am 04.10.2021 um 21:59 schrieb Marcel Mueller:

    Prefetch can access invalid memory. So prefetching a shared memory area
    behind a pointer can significantly decrease the probability of failed
    CAS when implementing strong thread safety. But on some platforms I
    observed excessive cachline hopping with this strategy.

    That doesn't make sense. When you prefetch you usually process a lot of
    data before the point you prefetched. When you have CASes you rotatedy process the same data; prefetching here is nonsense.

    Prefetching is nonsense in HLL :P

    --

    7-77-777
    Evil Sinner!
    to weak you should be meek, and you should brainfuck stronger https://github.com/rofl0r/chaos-pp

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bonita Montero@21:1/5 to All on Tue Oct 5 13:23:02 2021
    Am 05.10.2021 um 11:55 schrieb Branimir Maksimovic:

    Prefetch can access invalid memory. So prefetching a shared memory area
    behind a pointer can significantly decrease the probability of failed
    CAS when implementing strong thread safety. But on some platforms I
    observed excessive cachline hopping with this strategy.

    That doesn't make sense. When you prefetch you usually process a lot of
    data before the point you prefetched. When you have CASes you rotatedy
    process the same data; prefetching here is nonsense.

    Prefetching is nonsense in HLL :P

    No, automatic prefetching is dumd and there are a lot of patterns
    they're unable to predict. With my 3-way interleaved access I've
    even shown a very simple pattern where manual prefetching helps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)