Slow AES and RC4 performance on Intel Westmere

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow AES and RC4 performance on Intel Westmere

Iain Morgan
Hello,

As previously noted on this mailing list, the AES performance (without
AES-NI) of v1.0.o on Intel Westmere chips seems a bit slow. In addition,
RC4 seems a bit slow compared to previous Intel chips.

I've included below the speed output for several versions of OpenSSL for
comparison. For simplicity, I choose a subset of the more interesting
algorithms. All tests were done on an Intel Westmere running at 3.0 GHz.
For reference, the output from cupid.c is below:

0000000b:756e6547:6c65746e:49656e69
000206c2:00200800:029ee3ff:bfebfbff
3c004121:01c0003f:0000003f:00000000

OpenSSL 0.9.8n 24 Mar 2010
built on: Tue Mar 30 17:29:17 PDT 2010
options:bn(64,64) md2(int) rc4(1x,char) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              40045.37k   125791.36k   304323.58k   470531.07k   560005.12k
sha1             37855.97k   108750.44k   234854.06k   330731.18k   375278.25k
rc4             319218.02k   344750.72k   291458.47k   292768.77k   293093.38k
blowfish cbc     99528.55k   103911.65k   105671.68k   105717.08k   105851.56k
aes-128 cbc     132494.65k   176655.57k   191974.14k   196217.86k   197528.23k
aes-192 cbc     116965.39k   150690.07k   162185.98k   165767.17k   166857.39k
aes-256 cbc     105400.30k   131841.28k   140635.48k   143151.79k   143720.45k


OpenSSL 1.0.0 29 Mar 2010
built on: Mon Mar 29 10:24:52 PDT 2010
options:bn(64,64) rc4(1x,char) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DWHIRLPOOL_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              25110.83k    88038.31k   245494.95k   442503.85k   578172.25k
sha1             26626.11k    83364.31k   201984.26k   312384.17k   372091.56k
rc4             314936.59k   342759.70k   290525.35k   293123.94k   292855.81k
blowfish cbc     99449.09k   104068.71k   105179.99k   105560.06k   105769.64k
aes-128 cbc      85512.84k    92059.22k    93251.07k    94923.43k    94836.05k
aes-192 cbc      72161.34k    77346.69k    78668.03k    78866.09k    79207.60k
aes-256 cbc      62707.46k    66420.74k    67309.91k    67600.73k    67908.67k


OpenSSL 1.1.0-dev xx XXX xxxx
built on: Tue Apr 20 16:50:15 PDT 2010
options:bn(64,64) rc4(1x,char) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DWHIRLPOOL_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              27442.73k    93375.47k   255399.94k   450797.23k   580198.40k
sha1             28238.43k    87491.35k   214036.31k   335593.23k   399572.99k
rc4             312835.40k   341760.87k   290699.35k   292358.14k   292571.82k
blowfish cbc     99555.57k   103852.71k   105283.07k   105647.79k   105690.45k
aes-128 cbc      85725.71k    91975.21k    93809.92k    94533.97k    94857.33k
aes-192 cbc      72206.59k    76802.57k    78558.72k    78909.78k    78963.76k
aes-256 cbc      62567.38k    66193.28k    67344.30k    67723.26k    67758.76k

(The 1.1 versions is from the 20100420 snaphsot.)

For comparison, here is the the 1.0.0 speed output for a somewhat older
3.0 GHz Intel chip.

OpenSSL 1.0.0 29 Mar 2010
built on: Mon Mar 29 10:24:52 PDT 2010
options:bn(64,64) rc4(1x,char) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DWHIRLPOOL_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              27344.09k    94055.85k   249754.62k   428658.51k   539028.05k
sha1             21816.47k    80017.71k   207663.70k   342875.14k   428610.35k
rc4             343765.07k   396592.60k   376180.05k   386428.59k   424370.09k
blowfish cbc    102917.62k   108506.22k   109450.89k   109555.71k   109770.07k
aes-128 cbc      87576.06k    96398.03k    98596.52k   207228.25k   210130.26k
aes-192 cbc      74433.59k    80153.79k    82086.91k   175156.05k   177280.34k
aes-256 cbc      64822.50k    69178.15k    70595.58k   150416.73k   151677.19k

Note the difference in the RC4 performance between these two systems
which are both nominally running at 3.0 GHz.

--
Iain Morgan
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Slow AES and RC4 performance on Intel Westmere

Andy Polyakov
> As previously noted on this mailing list, the AES performance (without
> AES-NI) of v1.0.o on Intel Westmere chips seems a bit slow.

As also noted, this is price for side-channel countermeasures. Normally
CBC performance is lower for small blocks sizes and "usual fast" for
 >=512B blocks. Intel has reintroduced Hyper-Threading, meaning that L1
cache is shared between two threads, which is why "usual fast" code is
never engaged. cpuid.c output (thanks!) confirms that it's intentional.

> In addition,
> RC4 seems a bit slow compared to previous Intel chips.

http://cvs.openssl.org/chngview?cn=19636. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]