I derived two hash-algorithms from FNV32/64 and CRC64 that don't yield
the same results, and don't provide the opportunity for error-correction
for CRC, but which have the same equal distribution and are much more performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
Btw, if you can come up with a really fast SHA2 impl, I would be
interested because of my experimental HMAC cipher. I have a C version:
On 10/10/2021 9:51 AM, Bonita Montero wrote:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yield
the same results, and don't provide the opportunity for error-correction
for CRC, but which have the same equal distribution and are much more
performant on modern OoO-CPUS. How did I do that ?
I don't know.
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
Btw, if you can come up with a really fast SHA2 impl, I would be
interested because of my experimental HMAC cipher. I have a C version:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yield
the same results, and don't provide the opportunity for error-correction
for CRC, but which have the same equal distribution and are much more performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yield
the same results, and don't provide the opportunity for error-correction
for CRC, but which have the same equal distribution and are much more performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic:Dunno, haven't have need to calculate crc64 yet :P
On 2021-10-10, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yieldProprietary?
the same results, and don't provide the opportunity for error-correction >>> for CRC, but which have the same equal distribution and are much more
performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
That's the improved "CRC64":
#include "crc64.h"
using namespace std;
uint64_t CRC64_ECMA182::operator ()( void const *p, size_t n, uint64_t startCrc ) const
{
uint64_t crc = startCrc;
uint8_t const *s = (uint8_t *)p,
*end = s + n;
size_t t;
for( ; s != end; ++s )
t = (size_t)(crc >> 56) ^ *s,
crc = table.t[t] ^ (crc << 8);
return crc;
}
uint64_t CRC64_ECMA182::blocked( void const *p, size_t n, uint64_t
startCrc ) const
{
auto crc64_8x8 = []( uint8_t const *s ) -> uint64_t
{
uint64_t crcs[8] =
{
table.t[s[ 0]],
table.t[s[ 8]],
table.t[s[16]],
table.t[s[24]],
table.t[s[32]],
table.t[s[40]],
table.t[s[48]],
table.t[s[56]]
};
size_t t;
uint8_t const *end = ++s + 7;
do
t = (size_t)(crcs[0] >> 56) ^ s[ 0],
crcs[0] = table.t[t] ^ (crcs[0] << 8),
t = (size_t)(crcs[1] >> 56) ^ s[ 8],
crcs[1] = table.t[t] ^ (crcs[1] << 8),
t = (size_t)(crcs[2] >> 56) ^ s[16],
crcs[2] = table.t[t] ^ (crcs[2] << 8),
t = (size_t)(crcs[3] >> 56) ^ s[24],
crcs[3] = table.t[t] ^ (crcs[3] << 8),
t = (size_t)(crcs[4] >> 56) ^ s[32],
crcs[4] = table.t[t] ^ (crcs[4] << 8),
t = (size_t)(crcs[5] >> 56) ^ s[40],
crcs[5] = table.t[t] ^ (crcs[5] << 8),
t = (size_t)(crcs[6] >> 56) ^ s[48],
crcs[6] = table.t[t] ^ (crcs[6] << 8),
t = (size_t)(crcs[7] >> 56) ^ s[56],
crcs[7] = table.t[t] ^ (crcs[7] << 8);
while( ++s != end );
uint64_t crc = 0;
for( size_t i = 0; i != 8; ++i )
crc ^= crcs[i];
return crc;
};
uint8_t const *s = (uint8_t *)p;
uint64_t crc = startCrc;
for( uint8_t const *end = s + (n & -64); s != end; s += 64 )
crc ^= crc64_8x8( s );
crc ^= (*this)( s, n % 64, 0 );
return crc;
}
CRC64_ECMA182::crc64_table::crc64_table()
{
uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u;
for( uint64_t i = 0; i != 256; ++i )
{
uint64_t crc = 0,
c = i << 56;
for( unsigned j = 0; j != 8; ++j )
crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc
<< 1,
c <<= 1;
t[(size_t)i] = crc;
}
}
CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
Why does it run faster ?
Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic:
On 2021-10-10, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yieldProprietary?
the same results, and don't provide the opportunity for error-correction >>> for CRC, but which have the same equal distribution and are much more
performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
Of course - but the same equal distribution.
On 2021-10-10, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yieldProprietary?
the same results, and don't provide the opportunity for error-correction
for CRC, but which have the same equal distribution and are much more
performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
On 2021-10-10, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yieldProprietary?
the same results, and don't provide the opportunity for error-correction
for CRC, but which have the same equal distribution and are much more
performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
CRC64_ECMA182::crc64_table::crc64_table()Dunno, haven't have need to calculate crc64 yet :P
{
uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u;
for( uint64_t i = 0; i != 256; ++i )
{
uint64_t crc = 0,
c = i << 56;
for( unsigned j = 0; j != 8; ++j )
crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc
<< 1,
c <<= 1;
t[(size_t)i] = crc;
}
}
CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
Why does it run faster ?
Better to generate table, don't waste time on generation :P
On 2021-10-11, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic:Dunno, those hashing algos work much better if there is
On 2021-10-10, Bonita Montero <Bonita.Montero@gmail.com> wrote:
I derived two hash-algorithms from FNV32/64 and CRC64 that don't yield >>>> the same results, and don't provide the opportunity for error-correction >>>> for CRC, but which have the same equal distribution and are much moreProprietary?
performant on modern OoO-CPUS. How did I do that ?
These are the results on my Linux Ryzen 7 1800X:
fnv streamed 32: : 0.964425 GB/s
fnv blocked 32: : 1.96624 GB/s 104%
fnv streamed 64: : 0.939418 GB/s
fnv blocked 64: : 3.11791 GB/s 232%
crc64: : 0.478093 GB/s
crc64 blocked: : 2.39144 GB/s 400%
Of course - but the same equal distribution.
support from hardware. But, of course faster, better :P
Am 11.10.2021 um 08:32 schrieb Branimir Maksimovic:
CRC64_ECMA182::crc64_table::crc64_table()Dunno, haven't have need to calculate crc64 yet :P
{
uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u;
for( uint64_t i = 0; i != 256; ++i )
{
uint64_t crc = 0,
c = i << 56;
for( unsigned j = 0; j != 8; ++j )
crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc
<< 1,
c <<= 1;
t[(size_t)i] = crc;
}
}
CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
Why does it run faster ?
Better to generate table, don't waste time on generation :P
Eeeh, I'm using also a table as you can see from above.
Am 11.10.2021 um 09:46 schrieb Branimir Maksimovic:If you want to be hacker you have to program in ASM
On 2021-10-11, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.10.2021 um 08:32 schrieb Branimir Maksimovic:What do you think about following:
CRC64_ECMA182::crc64_table::crc64_table()Dunno, haven't have need to calculate crc64 yet :P
{
uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u;
for( uint64_t i = 0; i != 256; ++i )
{
uint64_t crc = 0,
c = i << 56;
for( unsigned j = 0; j != 8; ++j )
crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc
<< 1,
c <<= 1;
t[(size_t)i] = crc;
}
}
CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
Why does it run faster ?
Better to generate table, don't waste time on generation :P
Eeeh, I'm using also a table as you can see from above.
https://github.com/intel/isa-l/tree/master/crc
I won't check this ASM-code. An I don't know why people use ASM.
C / C++ and intrinsics usually result in better code.
On 2021-10-11, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.10.2021 um 08:32 schrieb Branimir Maksimovic:What do you think about following: https://github.com/intel/isa-l/tree/master/crc
CRC64_ECMA182::crc64_table::crc64_table()Dunno, haven't have need to calculate crc64 yet :P
{
uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u;
for( uint64_t i = 0; i != 256; ++i )
{
uint64_t crc = 0,
c = i << 56;
for( unsigned j = 0; j != 8; ++j )
crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc
<< 1,
c <<= 1;
t[(size_t)i] = crc;
}
}
CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
Why does it run faster ?
Better to generate table, don't waste time on generation :P
Eeeh, I'm using also a table as you can see from above.
ASM code is most efficient always and works as tested without
surprises :P
Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic:Human always beats compiler, as you can always examine
ASM code is most efficient always and works as tested without
surprises :P
ASM code can be faster in rare cases when you know everyting about
your OoO-CPU, but in most cases the compiler generates better code.
I've seen code from clang where you might think: there would be no ASM-programmer that knows all of these optimization-tricks.
Am 11.10.2021 um 12:42 schrieb Branimir Maksimovic:Readable and compact, not bloated :P
Human always beats compiler, as you can always examine
compiler generated code, learn and bit it :
Humans tend to write Asm that is readable. Compilers generate
Asm that's often not readbale for performance reasons.
Human always beats compiler, as you can always examine
compiler generated code, learn and bit it :
Readable and compact, not bloated :P
Optimize in iterations :P
Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic:
ASM code is most efficient always and works as tested without
surprises :P
ASM code can be faster in rare cases when you know everyting about
your OoO-CPU, but in most cases the compiler generates better code.
I've seen code from clang where you might think: there would be no ASM-programmer that knows all of these optimization-tricks.
On 11/10/2021 13:58, Bo Persson wrote:Compiler at best can produce generic optimistic code that does not stand
On 2021-10-11 at 11:18, Bonita Montero wrote:
Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic:
ASM code is most efficient always and works as tested without
surprises :P
ASM code can be faster in rare cases when you know everyting about
your OoO-CPU, but in most cases the compiler generates better code.
I've seen code from clang where you might think: there would be no
ASM-programmer that knows all of these optimization-tricks.
In this case the asm code is supplied by Intel.
I bet they qualify for "you know everything about your OoO-CPU". :-)
Well, kind of. Intel as a whole probably knows most of what there is to
know about Intel processors. But Intel does not write code - people
working at (or for) Intel write code, and there is absolutely no
guarantee that the person or people who wrote the code know all about
all of Intel's processors - never mind non-Intel x86 processors, or
non-x86 processors, or any other device. At best, you can probably be
quite confident that the code is close to optimal if you run it on the
same processor the assembly author used.
How well it will run on the dozen other current Intel processor
variations is another matter (by "dozen", I am ignoring devices that
differ only in clock speed or core count, and ignoring older devices).
How well it will run on AMD processors is also another matter.
You write these routines in C (or C++), and you tune the optimisation.
You compile with "-fmarch=native", or whatever flag your compiler has to
get the fastest code for your particular processor. You use compiler features for multi-versioning for target-specific optimisations, so that
the compiler generates versions for different SIMD and other instruction
set extensions, and picks the best version for the real cpu when the
code starts up. You use inline assembly or intrinsics for specific
target versions if you are /sure/ your assembly works faster, and have measured it.
General use of assembly language is something that comes /way/ down on
the list when you are trying to get fast implementation of code.
On 2021-10-11 at 11:18, Bonita Montero wrote:
Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic:
ASM code is most efficient always and works as tested without
surprises :P
ASM code can be faster in rare cases when you know everyting about
your OoO-CPU, but in most cases the compiler generates better code.
I've seen code from clang where you might think: there would be no
ASM-programmer that knows all of these optimization-tricks.
In this case the asm code is supplied by Intel.
I bet they qualify for "you know everything about your OoO-CPU". :-)
Compiler at best can produce generic optimistic code that does not stand
a chance against dedicated human :P
On 2021-10-11, David Brown <david.brown@hesbynett.no> wrote:
On 11/10/2021 13:58, Bo Persson wrote:Compiler at best can produce generic optimistic code that does not stand
On 2021-10-11 at 11:18, Bonita Montero wrote:
Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic:
ASM code is most efficient always and works as tested without
surprises :P
ASM code can be faster in rare cases when you know everyting about
your OoO-CPU, but in most cases the compiler generates better code.
I've seen code from clang where you might think: there would be no
ASM-programmer that knows all of these optimization-tricks.
In this case the asm code is supplied by Intel.
I bet they qualify for "you know everything about your OoO-CPU". :-)
Well, kind of. Intel as a whole probably knows most of what there is to
know about Intel processors. But Intel does not write code - people
working at (or for) Intel write code, and there is absolutely no
guarantee that the person or people who wrote the code know all about
all of Intel's processors - never mind non-Intel x86 processors, or
non-x86 processors, or any other device. At best, you can probably be
quite confident that the code is close to optimal if you run it on the
same processor the assembly author used.
How well it will run on the dozen other current Intel processor
variations is another matter (by "dozen", I am ignoring devices that
differ only in clock speed or core count, and ignoring older devices).
How well it will run on AMD processors is also another matter.
You write these routines in C (or C++), and you tune the optimisation.
You compile with "-fmarch=native", or whatever flag your compiler has to
get the fastest code for your particular processor. You use compiler
features for multi-versioning for target-specific optimisations, so that
the compiler generates versions for different SIMD and other instruction
set extensions, and picks the best version for the real cpu when the
code starts up. You use inline assembly or intrinsics for specific
target versions if you are /sure/ your assembly works faster, and have
measured it.
General use of assembly language is something that comes /way/ down on
the list when you are trying to get fast implementation of code.
a chance against dedicated human :P
ONly reason we do not program in assembler is because lerning curve is
steep :P
Am 11.10.2021 um 14:56 schrieb Branimir Maksimovic:Look, humans write compilers, start at that :P
Compiler at best can produce generic optimistic code that does not stand
a chance against dedicated human :P
It was easy to beat a compiler before 10 years, but today not anymore.
And in five years it will be almost impossible.
General use of assembly language is something that comes /way/ down onCompiler at best can produce generic optimistic code that does not stand
the list when you are trying to get fast implementation of code.
a chance against dedicated human :P
ONly reason we do not program in assembler is because lerning curve is
steep :P
I'm going to assume the ":P" smiley means you are being sarcastic.
On 2021-10-11, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.10.2021 um 14:56 schrieb Branimir Maksimovic:Look, humans write compilers, start at that :P
Compiler at best can produce generic optimistic code that does not stand >>> a chance against dedicated human :P
It was easy to beat a compiler before 10 years, but today not anymore.
And in five years it will be almost impossible.
On Mon, 11 Oct 2021 16:55:59 GMT
Branimir Maksimovic <branimir.maksimovic@icloud.com> wrote:
On 2021-10-11, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.10.2021 um 14:56 schrieb Branimir Maksimovic:Look, humans write compilers, start at that :P
Compiler at best can produce generic optimistic code that does not stand >>>> a chance against dedicated human :P
It was easy to beat a compiler before 10 years, but today not anymore.
And in five years it will be almost impossible.
Humans wrote AlphaZero. Good luck beating it at chess however.
On 2021-10-12, RadicalRabbit@theburrow.co.uk <RadicalRabbit@theburrow.co.uk> >wrote:
On Mon, 11 Oct 2021 16:55:59 GMTAlpaZero is nothing special, just better determination of position
Branimir Maksimovic <branimir.maksimovic@icloud.com> wrote:
On 2021-10-11, Bonita Montero <Bonita.Montero@gmail.com> wrote:
Am 11.10.2021 um 14:56 schrieb Branimir Maksimovic:Look, humans write compilers, start at that :P
Compiler at best can produce generic optimistic code that does not stand >>>>> a chance against dedicated human :P
It was easy to beat a compiler before 10 years, but today not anymore. >>>> And in five years it will be almost impossible.
Humans wrote AlphaZero. Good luck beating it at chess however.
quality function :P
Compiler is mutch larger byte :P
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
Btw, if you can come up with a really fast SHA2 impl, I would be
interested because of my experimental HMAC cipher. I have a C version:
https://en.wikipedia.org/wiki/Intel_SHA_extensions https://developer.arm.com/documentation/100076/0100/a64-instruction-set-reference/a64-cryptographic-algorithms/a64-cryptographic-instructions?lang=en
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 185 |
Nodes: | 16 (1 / 15) |
Uptime: | 07:19:51 |
Calls: | 3,645 |
Calls today: | 1 |
Files: | 11,145 |
Messages: | 3,445,015 |