Forum: >>> Magnum BBS <<<

CLZERO

From Bonita Montero@21:1/5 to All on Mon May 16 13:58:56 2022

x86 on AMD-CPUs since Zen1 has an instruction called CLZERO.
According to Wikichip this is to recover from some memory-errors,
but this is pure nonsense. There was a posting in the LKML that
reveals the correct purpose: it's to fast zero memory without
polluting the cache, i.e. clzero is non-temporal.
I thought it would be nice to have a comparison betwen a looped
clzero and a plain memset, which itself is usually optimized
very good with today's compiler. So I wrote a little benchmark
in C++20 to compare both:

#include <iostream>
#include <chrono>
#include <vector>
#include <memory>
#include <chrono>
#include <cstring>
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif

using namespace std;
using namespace chrono;

template<bool MemSet = false>
size_t clZeroRange( void *p, size_t n );

int main()
{
constexpr size_t
N = 0x4000000,
ROUNDS = 1'000;
vector<char> vc( N, 0 );
auto bench = [&]<bool MemSet>( bool_constant<MemSet> )
{
auto start = high_resolution_clock::now();
size_t n = 0;
for( size_t r = ROUNDS; r--; )
n += clZeroRange<MemSet>( to_address( vc.begin() ), N );
double GBS = (double)(ptrdiff_t)n / 0x1.0p30;
cout << GBS / ((double)(int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / 1.0e9) << endl;
};
bench( false_type() );
bench( true_type() );
}

template<bool MemSet>
size_t clZeroRange( void *p, size_t n )
{
char *pAlign = (char *)(((size_t)p + 63) & (ptrdiff_t)-64);
n -= pAlign - (char *)p;
n &= (ptrdiff_t)-64;
if constexpr( !MemSet )
for( char *end = pAlign + n; pAlign != end; pAlign += 64 )
_mm_clzero( pAlign );
else
memset( p, 0, n );
return n;
}

Interestingly I get the same performance for both variants with
MSVC++ 2022. With g++ / glibc I get a performance of about one
third of with memset() than with the clzero()-solution. I think
the memset() of glibc just not optimized so properly. The memset()
of Visual C++ uses non-temporal SSE stores which explains the good
performance.

Would someone here be so nice to post his values ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Tue Jan 28 06:40:34 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Gwylbert
  Tue Jan 28 04:48:36 2025
  from Sydney, Nsw via Telnet
- Guest
  Tue Jan 28 03:02:48 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Keyop
  Tue Jan 28 00:51:06 2025
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Tue Jan 28 00:50:25 2025
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Tue Jan 28 00:49:39 2025
  from Huddersfield, West Yorkshire via SSH
- Guest
  Mon Jan 27 22:26:23 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Bob Worm
  Mon Jan 27 21:08:12 2025
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	407
Nodes:	16 (2 / 14)
Uptime:	11:22:51
Calls:	8,554
Calls today:	6
Files:	13,219
Messages:	5,925,264

CLZERO

Who's Online

Recent Visitors

System Info