• CLZERO

    From Bonita Montero@21:1/5 to All on Mon May 16 13:58:56 2022
    x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.
    According to Wikichip this is to recover from some memory-errors,
    but this is pure nonsense. There was a posting in the LKML that
    reveals the correct purpose: it's to fast zero memory without
    polluting the cache, i.e. clzero is non-temporal.
    I thought it would be nice to have a comparison betwen a looped
    clzero and a plain memset, which itself is usually optimized
    very good with today's compiler. So I wrote a little benchmark
    in C++20 to compare both:

    #include <iostream>
    #include <chrono>
    #include <vector>
    #include <memory>
    #include <chrono>
    #include <cstring>
    #if defined(_MSC_VER)
    #include <intrin.h>
    #elif defined(__GNUC__) || defined(__clang__)
    #include <x86intrin.h>
    #endif

    using namespace std;
    using namespace chrono;

    template<bool MemSet = false>
    size_t clZeroRange( void *p, size_t n );

    int main()
    {
    constexpr size_t
    N = 0x4000000,
    ROUNDS = 1'000;
    vector<char> vc( N, 0 );
    auto bench = [&]<bool MemSet>( bool_constant<MemSet> )
    {
    auto start = high_resolution_clock::now();
    size_t n = 0;
    for( size_t r = ROUNDS; r--; )
    n += clZeroRange<MemSet>( to_address( vc.begin() ), N );
    double GBS = (double)(ptrdiff_t)n / 0x1.0p30;
    cout << GBS / ((double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 1.0e9) << endl;
    };
    bench( false_type() );
    bench( true_type() );
    }

    template<bool MemSet>
    size_t clZeroRange( void *p, size_t n )
    {
    char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);
    n -= pAlign - (char *)p;
    n &= (ptrdiff_t)-64;
    if constexpr( !MemSet )
    for( char *end = pAlign + n; pAlign != end; pAlign += 64 )
    _mm_clzero( pAlign );
    else
    memset( p, 0, n );
    return n;
    }

    Interestingly I get the same performance for both variants with
    MSVC++ 2022. With g++ / glibc I get a performance of about one
    third of with memset() than with the clzero()-solution. I think
    the memset() of glibc just not optimized so properly. The memset()
    of Visual C++ uses non-temporal SSE stores which explains the good
    performance.

    Would someone here be so nice to post his values ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)