Forum: >>> Magnum BBS <<<

hardware encryption

From Diederik de Haas@21:1/5 to brainfart@posteo.net on Thu Jun 3 16:49:54 2021

On woensdag 20 januari 2021 11:40:26 CEST brainfart@posteo.net wrote:

hardware accelerated encryption is a bit of a mystery to me
some processors advertise it but how do we know if it's being used
is there a way to test if hardware accelerated encryption is being used
or if it's just advertising hype

I very much like to understand this as well.
I have a/several Rock64 devices and it is supposed to have ARMv8 Cryptography Extensions according to https://wiki.pine64.org/wiki/ROCK64#CPU_Architecture.

Due to bug #976635 several CRYPTO modules got enabled in the 5.10 kernel.
But I don't know whether that's relevant for ARMv8 CE.

https://turecki.net/content/getting-most-out-ssh-hardware-acceleration-tuning-aes-ni
contains a test to check the speed of some crypto operations.
Based on that I've made a procedure which I've now run on several devices:

# adduser test
$ ssh-add (make sure ssh agent is running)
$ ssh-copy-id test@localhost
$ ssh test@localhost (verify key based auth works)
$ exit
$ for i in `ssh -Q cipher`; do dd if=/dev/zero bs=1M count=100 2> /dev/null | \ ssh -c $i test@localhost "(time -p cat) > /dev/null" 2>&1 | grep real | \
awk '{print "'$i': "100 / $2" MB/s" }'; done
$ grep -i -E "(flags|features)" /proc/cpuinfo | tail -n1

On a Rock64 with kernel 5.8.0-1-arm64, I got these results:
aes128-ctr: 45.8716 MB/s
aes192-ctr: 45.6621 MB/s
aes256-ctr: 44.6429 MB/s
aes128-gcm@openssh.com: 49.505 MB/s
aes256-gcm@openssh.com: 48.7805 MB/s
chacha20-poly1305@openssh.com: 36.9004 MB/s

Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

But on kernel 5.10.0-7-arm64, with those CRYPTO modules, I got this: aes128-ctr: 42,735 MB/s
aes192-ctr: 44,4444 MB/s
aes256-ctr: 44,0529 MB/s
aes128-gcm@openssh.com: 48,0769 MB/s
aes256-gcm@openssh.com: 46,0829 MB/s
chacha20-poly1305@openssh.com: 37,037 MB/s

Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

If you run the test several times, you'll get slightly different results
each time, so I consider these results the same.

For comparison (I don't remember which kernel version) on Ryzen 7 1800X: aes128-ctr: 714.286 MB/s
aes192-ctr: 714.286 MB/s
aes256-ctr: 769.231 MB/s
aes128-gcm@openssh.com: 1000 MB/s
aes256-gcm@openssh.com: 1000 MB/s
chacha20-poly1305@openssh.com: 294.118 MB/s

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx
f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1
avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1
xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

with kernel 5.10.0-7-amd64:
aes128-ctr: 714,286 MB/s
aes192-ctr: 769,231 MB/s
aes256-ctr: 714,286 MB/s
aes128-gcm@openssh.com: 909,091 MB/s
aes256-gcm@openssh.com: 909,091 MB/s
chacha20-poly1305@openssh.com: 500 MB/s

very odd that aes192-ctr and aes256-ctr seem to have switched, but the values are otherwise EXACTLY the same :-O
Very impressive speed improvement with chacha20-poly1305 though :D
(Note that the aforementioned bug report was about arm64, not amd64)

On a RPi2, the values were around 12 MB/s

I don't find the scores of the Rock64 impressive, but that may be because
I've read somewhere that ARMv8 Cryptography Extensions could/should
result in a FACTOR 10 speed improvements with cryptography.

There could be a number of issues here:
1) The 'factor 10' is horseshit
2) The 'factor 10' is true, but it doesn't work on Rock64 (yet?)
3) The 'factor 10' is true and working and without it, the scores would be abysmal.
4) The test is all wrong

If I do 'cat /proc/crypto' I get a long list, but I have no idea what the output means.

So essentially I have the same question as OP.
How can I/we know if it's present and working as intended?
What kind of speed improvement can/should one expect?
What is needed to take advantage of it? Kernel modules and if so, which?
The CRYPTO_XYZ_CE ones? Others? Something else entirely?

Cheers,
Diederik

-----BEGIN PGP SIGNATURE-----

iHUEABYKAB0WIQT1sUPBYsyGmi4usy/XblvOeH7bbgUCYLjsEgAKCRDXblvOeH7b bk8eAP9EWb4Me7LJQNg3dWmNODMC54aQ6MFQwf6Wo17aXGEV7AD/UTb5M/M9cG4T Ogxc6Tp67silX4n4ubMhFCZ9uoOv1wI=
=GYci
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jeffrey Walton@21:1/5 to didi.debian@cknow.org on Thu Jun 3 18:00:02 2021

On Thu, Jun 3, 2021 at 10:50 AM Diederik de Haas <didi.debian@cknow.org> wrote:

On woensdag 20 januari 2021 11:40:26 CEST brainfart@posteo.net wrote:

hardware accelerated encryption is a bit of a mystery to me
some processors advertise it but how do we know if it's being used
is there a way to test if hardware accelerated encryption is being used
or if it's just advertising hype

I very much like to understand this as well.
I have a/several Rock64 devices and it is supposed to have ARMv8 Cryptography Extensions according to https://wiki.pine64.org/wiki/ROCK64#CPU_Architecture.

Due to bug #976635 several CRYPTO modules got enabled in the 5.10 kernel. But I don't know whether that's relevant for ARMv8 CE.

https://turecki.net/content/getting-most-out-ssh-hardware-acceleration-tuning-aes-ni
contains a test to check the speed of some crypto operations.
Based on that I've made a procedure which I've now run on several devices:

# adduser test
$ ssh-add (make sure ssh agent is running)
$ ssh-copy-id test@localhost
$ ssh test@localhost (verify key based auth works)
$ exit
$ for i in `ssh -Q cipher`; do dd if=/dev/zero bs=1M count=100 2> /dev/null | \
ssh -c $i test@localhost "(time -p cat) > /dev/null" 2>&1 | grep real | \
awk '{print "'$i': "100 / $2" MB/s" }'; done
$ grep -i -E "(flags|features)" /proc/cpuinfo | tail -n1

On a Rock64 with kernel 5.8.0-1-arm64, I got these results:
aes128-ctr: 45.8716 MB/s
aes192-ctr: 45.6621 MB/s
aes256-ctr: 44.6429 MB/s
aes128-gcm@openssh.com: 49.505 MB/s
aes256-gcm@openssh.com: 48.7805 MB/s
chacha20-poly1305@openssh.com: 36.9004 MB/s

Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

But on kernel 5.10.0-7-arm64, with those CRYPTO modules, I got this: aes128-ctr: 42,735 MB/s
aes192-ctr: 44,4444 MB/s
aes256-ctr: 44,0529 MB/s
aes128-gcm@openssh.com: 48,0769 MB/s
aes256-gcm@openssh.com: 46,0829 MB/s
chacha20-poly1305@openssh.com: 37,037 MB/s

Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

If you run the test several times, you'll get slightly different results
each time, so I consider these results the same.

For comparison (I don't remember which kernel version) on Ryzen 7 1800X: aes128-ctr: 714.286 MB/s
aes192-ctr: 714.286 MB/s
aes256-ctr: 769.231 MB/s
aes128-gcm@openssh.com: 1000 MB/s
aes256-gcm@openssh.com: 1000 MB/s
chacha20-poly1305@openssh.com: 294.118 MB/s

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1
avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

with kernel 5.10.0-7-amd64:
aes128-ctr: 714,286 MB/s
aes192-ctr: 769,231 MB/s
aes256-ctr: 714,286 MB/s
aes128-gcm@openssh.com: 909,091 MB/s
aes256-gcm@openssh.com: 909,091 MB/s
chacha20-poly1305@openssh.com: 500 MB/s

very odd that aes192-ctr and aes256-ctr seem to have switched, but the values are otherwise EXACTLY the same :-O
Very impressive speed improvement with chacha20-poly1305 though :D
(Note that the aforementioned bug report was about arm64, not amd64)

On a RPi2, the values were around 12 MB/s

I don't find the scores of the Rock64 impressive, but that may be because I've read somewhere that ARMv8 Cryptography Extensions could/should
result in a FACTOR 10 speed improvements with cryptography.

There could be a number of issues here:
1) The 'factor 10' is horseshit
2) The 'factor 10' is true, but it doesn't work on Rock64 (yet?)
3) The 'factor 10' is true and working and without it, the scores would be abysmal.
4) The test is all wrong

If I do 'cat /proc/crypto' I get a long list, but I have no idea what the output means.

So essentially I have the same question as OP.
How can I/we know if it's present and working as intended?
What kind of speed improvement can/should one expect?
What is needed to take advantage of it? Kernel modules and if so, which?
The CRYPTO_XYZ_CE ones? Others? Something else entirely?

I _think_ OpenSSH uses OpenSSL, not kernel crypto. Or they use that
LibreSSL port of OpenSSL.

To benchmark OpenSSL, you use something like:

# C implementation
openssl speed aes-128-cbc

# Hardware acceleration
openssl speed -evp aes-128-cbc

You can see the difference in the numbers below. Below, I'm on a Core i7-8700.

$ openssl speed aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 57736814 aes-128 cbc's in 3.00s Doing aes-128 cbc for 3s on 64 size blocks: 14943316 aes-128 cbc's in 3.00s Doing aes-128 cbc for 3s on 256 size blocks: 3741357 aes-128 cbc's in 3.00s Doing aes-128 cbc for 3s on 1024 size blocks: 944345 aes-128 cbc's in 3.00s Doing aes-128 cbc for 3s on 8192 size blocks: 118246 aes-128 cbc's in 3.00s Doing aes-128 cbc for 3s on 16384 size blocks: 59132 aes-128 cbc's in 3.00s OpenSSL 1.1.1f 31 Mar 2020
built on: Wed Apr 28 00:37:28 2021 UTC
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes
8192 bytes 16384 bytes
aes-128 cbc 307929.67k 318790.74k 319262.46k 322336.43k
322890.41k 322939.56k

$ openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 186837731 aes-128-cbc's in 2.99s Doing aes-128-cbc for 3s on 64 size blocks: 78857865 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 20276035 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 5088201 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 8192 size blocks: 636732 aes-128-cbc's in 3.00s Doing aes-128-cbc for 3s on 16384 size blocks: 318374 aes-128-cbc's in 3.00s OpenSSL 1.1.1f 31 Mar 2020
built on: Wed Apr 28 00:37:28 2021 UTC
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes
8192 bytes 16384 bytes
aes-128-cbc 999800.57k 1682301.12k 1730221.65k 1736772.61k
1738702.85k 1738746.54k

I don't like OpenSSL output. They should provide Cycle-per-byte (cpb)
since it is mostly independent as a metric when measuring performance.
Jeff

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jeffrey Walton@21:1/5 to brainfart@posteo.net on Thu Jun 3 17:40:02 2021

On Wed, Jan 20, 2021 at 5:40 AM <brainfart@posteo.net> wrote:

...
this thing about hardware accelerated encryption is a bit of a mystery
to me
some processors advertise it but how do we know if it's being used
is there a way to test if hardware accelerated encryption is being used
or if it's just advertising hipe

You usually cannot tell when the hardware acceleration is being used.
For most libraries, they don't provide the implementation details.
About all you can do is check CPU availability of the acceleration.

One library that provides the algorithmic details is Crypto++.
Crypto++ is a C++ class library. Classes like AES and SHA have a
member function AlgorithmProvider(). If the C++ implementation is
used, then the string "C++" is returned. If hardware acceleration is
used, then the string will be "AES", "SHA" or "NEON", "ASIMD" or
"ARMv7", depending what is fastest.

I can't tell if you are asking how to check that a hardware
implementation, like AES or SHA acceleration, is actually faster than
C, C++, ASM, etc. For that you have to benchmark the algorithm.

And one thing to be aware of... NEON (ARMv7) and ASIMD (ARMv8) are
like Intel SSE acceleration. Some algorithms slow down when using NEON
or ASIMD. For example, BLAKE2 is fastest when using C or C++ code. If
you use NEON or ASIMD then the code slows down by about 3 cycles per
byte (cpb).[1] The slowdown is due to a slow double-word (64-bit)
shift that can only be issued from one port. That holds for ARM A53's,
A57's and Apple's M1.

[1] https://github.com/weidai11/cryptopp/blob/master/blake2.cpp#L30

if i'm encrypting my data and want to reduce the load on the cpu as much
as possible what processor would be best

Efficiency is one reason, but a more important one is side channels.
Using AES acceleration will avoid most side channel attacks.

Once the implementation is correct, then it can be sped-up to be faster :)

Jeff

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Diederik de Haas@21:1/5 to All on Thu Jun 3 19:34:19 2021

To: noloader@gmail.com

On donderdag 3 juni 2021 17:52:50 CEST Jeffrey Walton wrote:

I _think_ OpenSSH uses OpenSSL, not kernel crypto.

If that means that hardware/accelerated crypto is dependent on
the program being used, that would suck

To benchmark OpenSSL, you use something like:
# C implementation
openssl speed aes-128-cbc
# Hardware acceleration
openssl speed -evp aes-128-cbc

You can see the difference in the numbers below ... on a Core i7-8700.

$ openssl speed aes-128-cbc
...
OpenSSL 1.1.1f 31 Mar 2020
built on: Wed Apr 28 00:37:28 2021 UTC
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128 cbc 307929.67k 318790.74k 319262.46k 322336.43k 322890.41k 322939.56k

$ openssl speed -evp aes-128-cbc
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 999800.57k 1682301.12k 1730221.65k 1736772.61k 1738702.85k 1738746.54k

$ openssl speed aes-128-cbc
...
version: 3.0.0-alpha16
built on: built on: Thu May 6 19:54:38 2021 UTC
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 84716.70k 269243.61k 584986.37k 830015.83k 944873.47k 953417.73k

$ openssl speed -evp aes-128-cbc
...
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
AES-128-CBC 95904.58k 297023.53k 611697.15k 855083.69k 966412.97k 956033.71k

At first glance there seems to be some improvement, particular with 16/64 bytes,
but the difference is nowhere near as significant as with you.

But I also tried it a few more times and generally speaking 16/64 bytes saw higher scores with '-evp', but I've also had higher scores on the larger types WITHOUT '-evp' ?!?

(Included the version as it was very different; turns out mine if from experimental)

Thanks for your reply,
Diederik
-----BEGIN PGP SIGNATURE-----

iHUEABYKAB0WIQT1sUPBYsyGmi4usy/XblvOeH7bbgUCYLkSmwAKCRDXblvOeH7b bt9uAQDvHYzKk2N9AsR9pSYnJeljNcVw06+y3qWCl0pNBOJZjQD/ZUXQuPY2kAI4 MFbdfm/3i+8lqNe0TbK4cAjZzoTJCw8=
=vOEd
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	39:21:05
Calls:	6,648
Files:	12,193
Messages:	5,329,316

hardware encryption

Who's Online

System Info