Ever since the Sandy Bridge microarchitecture, Intel CPUs have been coming with hardware-accelerated AES support (aka “AES-NI”, new instructions). I figured it would be interesting see a comparison between AES with and without the hardware acceleration on my Intel Core i5-3317U CPU (Ivy Bridge) on Arch Linux.
According to a post on the OpenSSL Users mailing list, you can force openssl
to avoid hardware AES instructions using the OPENSSL_ia32cap
environment variable.
Benchmarks
First, with AES-NI enabled (the default, on hardware that supports it):
$ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 57196857 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 15343650 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 3897351 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 978726 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 122310 aes-128-cbc's in 3.00s
OpenSSL 1.0.1e 11 Feb 2013
built on: Sun Oct 20 14:49:13 CEST 2013
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -Wa,--noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 305049.90k 327331.20k 332573.95k 334071.81k 333987.84k
Then, setting the capability mask to turn off the hardware AES features:
$ OPENSSL_ia32cap="~0x200000200000000" openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 27883366 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 7736907 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 1949328 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 498847 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 62446 aes-128-cbc's in 3.00s
OpenSSL 1.0.1e 11 Feb 2013
built on: Sun Oct 20 14:49:13 CEST 2013
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -Wa,--noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4 -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 148711.29k 165054.02k 166342.66k 170273.11k 170519.21k
You can see that hardware-accelerated AES is pretty consistently twice as fast as the implementation without aesni. So it’s not an exponential win, but getting twice the performance is certainly very serious! This is great for not only for servers using AES encryption (SSL/TLS, hello!), but also for consumers wanting to connect to said servers as well as things like full-disk encryption.
Note: It seems Arch Linux’s OpenSSL is built with AES-NI support but not as an engine, so openssl speed
could be misleading (ie, you’d see no difference with or without the capabilities masked). To get the AES-NI support you need to use -evp
(“envelope”) mode, which is some sort of high-level interface for crypto functions in OpenSSL.
This is some seriously good stuff.
It seems to scale in a similar fashion in older chips such as Arrandale, with and without AES-NI acceleration.
Posting benchmarks once I get off from the bus.
Ah! Arrandale has AES-NI? Nice. Lucky you!
As promised, here are my benchmarks, with AES-NI HWaccel activated:
Seems my Intel Core i7 640M is 2.32X faster than your ULV core i5.
Without AES-Ni acceleration, my scores are:
With AES-NI HWaccel enabled, the OpenSSL benchmark scales by a factor of 2.84x. Thats’ on a Nehalem 😉
Nice stuff!!
Finally, on Arch Linux.
With AES-NI H/W acceleration (AES-NI) badassery:
Without AES-NI H/W Acceleration:
On Arch (everything up to date and using the latest OpenSSL build).
Processor:
Intel Core i7 640M, Arrandale.
It seems the build on Arch performs much, much faster than what Fedora 19 packages, haha.
After an update today:
openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 123841672 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 34140840 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 8555372 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2246614 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 282338 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1f 6 Jan 2014
built on: Mon Jan 6 21:23:11 CET 2014
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -Wa,–noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector –param=ssp-buffer-size=4 -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 660488.92k 728337.92k 730058.41k 766844.25k 770970.97k
With AES-NI only.
Without AES-NI, just to catch up with your slow ULV:
OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 46593943 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 12691229 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 3338377 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 854280 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 107370 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1f 6 Jan 2014
built on: Mon Jan 6 21:23:11 CET 2014
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -Wa,–noexecstack -march=x86-64 -mtune=generic -O2 -pipe -fstack-protector –param=ssp-buffer-size=4 -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 248501.03k 270746.22k 284874.84k 291594.24k 293191.68k
And to mess up with the results, hail unto:
Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
Results with all the zestinesss:
openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 71153330 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 24093729 aes-128-cbc’s in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 6118901 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1536165 aes-128-cbc’s in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 192017 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1 14 Mar 2012
built on: Mon Apr 15 15:27:18 UTC 2013
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: cc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector –param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,–noexecstack -Wall -DOPENSSL_NO_TLS1_2_CLIENT -DOPENSSL_MAX_TLS1_2_CIPHER_LENGTH=50 -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 379484.43k 512291.91k 522146.22k 522602.31k 524334.42k
Now, without AES-NI:
OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 30915053 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 12543885 aes-128-cbc’s in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 3204918 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 812027 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 80782 aes-128-cbc’s in 3.01s
OpenSSL 1.0.1 14 Mar 2012
built on: Mon Apr 15 15:27:18 UTC 2013
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: cc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector –param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,–noexecstack -Wall -DOPENSSL_NO_TLS1_2_CLIENT -DOPENSSL_MAX_TLS1_2_CIPHER_LENGTH=50 -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 164880.28k 266713.83k 273486.34k 277171.88k 219855.86k
Thats’ on Ubuntu 12.04LTS.
And yes….My Arrandale kicks its’ 16-core monstrous ass.
Alan,
Its’ all about the code optimizations. In Windows cygwin64, where these geniuses saw it wisest to compile openssl using mtune=generic and with no AES-NI acceleration, probably, here are my results. Very interesting:
Brainiarc7@Brainiarc7-PC ~
$ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 23181988 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 6403056 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 1653828 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 420094 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 52060 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1e 11 Feb 2013
built on: Thu Mar 7 05:51:56 CST 2013
options:bn(64,64) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: x86_64-pc-cygwin-gcc -D_WINDLL -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -DDSO_DLFCN -DHAVE_DLFCN_H -DTERMIOS -DL_ENDIAN -O3 -Wall
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 123637.27k 136462.07k 140985.67k 143248.84k 141969.21k
AES-NI acceleration disabled via OPENSSL ia32cap:
Brainiarc7@Brainiarc7-PC ~
$ OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 23114849 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 6451259 aes-128-cbc’s in 3.01s
Doing aes-128-cbc for 3s on 256 size blocks: 1658972 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 418238 aes-128-cbc’s in 3.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 52383 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1e 11 Feb 2013
built on: Thu Mar 7 05:51:56 CST 2013
options:bn(64,64) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: x86_64-pc-cygwin-gcc -D_WINDLL -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -DDSO_DLFCN -DHAVE_DLFCN_H -DTERMIOS -DL_ENDIAN -O3 -Wall
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 123279.19k 137352.15k 141377.11k 142473.62k 142850.05k
Conclusion: With AES-NI disabled in the cygwin64 build of OpenSSL, I get better results. This concludes to the fact that cygwin sucks.
But so does Windows.
Lets’ see how Ubuntu 12.04 fares on an IvyBridge Intel Core i5:
model name : Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
Scores with AES-NI:
openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 107115069 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 28602044 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 7269322 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 1824450 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 227851 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1 14 Mar 2012
built on: Wed Jan 8 20:45:51 UTC 2014
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: cc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector –param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,–noexecstack -Wall -DOPENSSL_NO_TLS1_2_CLIENT -DOPENSSL_MAX_TLS1_2_CIPHER_LENGTH=50 -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 571280.37k 610176.94k 620315.48k 622745.60k 622185.13k
With AES-NI explicitly disabled:
OPENSSL_ia32cap=”~0×200000200000000″ openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 20751858 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 5770370 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 1497761 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 801904 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 102059 aes-128-cbc’s in 3.00s
OpenSSL 1.0.1 14 Mar 2012
built on: Wed Jan 8 20:45:51 UTC 2014
options:bn(64,64) rc4(8x,int) des(idx,cisc,16,int) aes(partial) blowfish(idx)
compiler: cc -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector –param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,–noexecstack -Wall -DOPENSSL_NO_TLS1_2_CLIENT -DOPENSSL_MAX_TLS1_2_CIPHER_LENGTH=50 -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 110676.58k 123101.23k 127808.94k 273716.57k 278689.11k
Meh.