[openssl.org #3615] [PATCH] ChaCha20 with Poly1305 TLS Cipher Suites via the EVP interface

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[openssl.org #3615] [PATCH] ChaCha20 with Poly1305 TLS Cipher Suites via the EVP interface

Rich Salz via RT
Hello everyone,
This patch is a contribution to OpenSSL.
It includes efficient implementations of Dan Bernstein's Poly1305 (authenticator) and ChaCha20 (stream cipher).
The patch is an adaptation of a previous implementation by Adam Langley of chacha20+poly1305 TLS based suites.
That previous implementation was made for the AEAD branch, and we adapted it for the current master head.
We also changed the code as to implement the latest draft and added optimized implementations, for processors that support AVX/AVX2, as well as future processors that would have AVX512f and AVX512IFMA instructions sets.
(in particular, using the instructions VPMADD52LUQ and VPMADD52HUQ, announced in https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf)
The patch attempts to provide optimized performance, even for short buffers when used in TLS.
Our implementation supports the following cipher suites:
ECDHE-ECDSA-CHACHA20-POLY1305
ECDHE-RSA-CHACHA20-POLY1305
DHE-RSA-CHACHA20-POLY1305

Previous related work:
A. Langley, chacha20poly1305 commit to OpenSSL. https://git.openssl.org/gitweb/?p=openssl.git;a=commit;h=9a8646510b3d0a48e950748f7a2aaa12ed40d5e0
M. Goll, S. Gueron, "Vectorization on ChaCha Stream Cipher". IEEE Proceedings of 11th International Conference on Information Technology: New Generations (ITNG 2014), 612-615 (2014).
M. Goll, S. Gueron, "Vectorization on Poly1305 Message Authentication Code", to be published.
M. Goll, S. Gueron, Chacha20 and Poly1305 AVX2/AVX512 code.
Currently implemented draft:
https://tools.ietf.org/html/draft-nir-cfrg-chacha20-poly1305-06

Performance on current processors (8KB):

                chacha20               poly1305         chacha20poly1305
IVB:          2.7 C/B                 1.33 C/B                   4.03 C/B
HSW:        1.4 C/B                 0.75 C/B                   2.15 C/B
BDW:        1.4 C/B                 0.72 C/B                   2.12 C/B

Developers and authors:
***************************************************************************
Shay Gueron (1, 2), and Vlad Krasnov (1)
(1) Intel Corporation, Israel Development Center, Haifa, Israel
(2) University of Haifa, Israel
***************************************************************************
Copyright(c) 2014, Intel Corp.





---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


chacha20poly1305_patch_avx123ifma52.patch (224K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3615] [PATCH] ChaCha20 with Poly1305 TLS Cipher Suites via the EVP interface

Rich Salz via RT
Hi,

> This patch is a contribution to OpenSSL.
> It includes efficient implementations of Dan Bernstein's Poly1305 (authenticator) and ChaCha20 (stream cipher).

Incidentally I'm working on this too and already have ChaCha module.
What I've learned is that ChaCha SIMD performance is a delicate balance
between instruction issue rate, latencies and register availability.
Thing is that at least on x86 processors it is possible to achieve
better performance by arranging data "vertically". I mean following.
Customarily key material is loaded into registers as following (numbers
are *word* offsets, 0xN0 refers to next key block):

   3    2    1    0
   7    6    5    4
 0xb  0xa    9    8
 0xf  0xe  0xd  0xc
0x13 0x12 0x11 0x10
...

, while "vertically" means:

0x30 0x20 0x10  0
0x31 0x21 0x11  1
0x32 0x22 0x12  2
0x33 0x23 0x13  3
...

Naturally this can be used only for longer inputs and question is when
is it appropriate to switch to it. The fact that question is posed means
that I utilize both "horizontal" and "vertical" layouts, for short and
long input lengths respectively. When to switch is governed by results
(quoting attached code, attached for reference):

#                IALU/gcc 4.8(i) 1xSSSE3/SSE2    4xSSSE3     8xAVX2
#
# P4             9.48/+99%       -/22.7(ii)      -
# Core2          7.83/+55%       7.90/8.08       4.35
# Westmere       7.19/+50%       5.60/6.70       3.00
# Sandy Bridge   8.31/+42%       5.45/6.76       2.72
# Ivy Bridge     6.71/+46%       5.40/6.49       2.41
# Haswell        5.92/+43%       5.20/6.45       2.42        1.23
# Silvermont     12.0/+33%       7.75/7.40       7.03(iii)
# Opteron        7.28/+52%       -/14.2(ii)      -
# Bulldozer      9.66/+28%       9.85/11.1       3.06(iv)
# VIA Nano       10.5/+46%       6.72/8.60       6.05
#
# (i)   compared to older gcc 3.x one can observe >2x improvement on
#       most platforms;
# (ii)  as it can be seen, SSE2 performance is too low on legacy
#       processors; NxSSE2 results are naturally better, but not
#       impressively better than IALU ones, which is why you won't
#       find SSE2 code below;
# (iii) this is not optimal result for Atom because of MSROM
#       limitations, SSE2 can do better, but gain is considered too
#       low to justify the [maintenance] effort;
# (iv)  Bulldozer actually executes 4xXOP code path that delivers 2.20;

As it can be seen on most processors it makes sense to switch to
"vertical" when processing more than one block. Not all, but most.

Additional notes.

Note that there is no AVX1 code. Rationale is that it provides only
modest improvement over SSSE3, too little to justify maintenance costs.

You can notice that attached code operates on 32-bit counter. This is
done because 64-bit counter is not needed in TLS context, while
operating on 32-bit one makes programming effort easier. But for
situations when 64-bit counter can be required, it would be caller's
responsibility to trace overflows (in manner similar to one implemented
in ctr128.c). But "caller" doesn't refer to "application programmer",
but OpenSSL code that calls assembly. I mean the 32-bit counter issue
won't be visible to developers.

The reason for why AVX512 is not implemented yet is following. As
mentioned in the beginning, in this case performance is matter of
delicate balance between instruction issue rate, latencies and register
availability. As result assessment and choice of approach is left till
moment when more detailed data is available about AVX512 hardware.

There even are 32-bit x86 (not final) and ARM results available:

#                       IALU/gcc        4xSSSE3
# Pentium               17.5/+80%
# PIII                  14.2/+60%
# P4                    18.6/+84%
# Core2                 9.56/+89%       4.90
# Westmere              9.50/+45%       3.50
# Sandy Bridge          10.5/+47%       3.25
# Haswell               8.15/+50%       2.85
# Silvermont            17.4/+36%       8.35
# Opteron               10.2/+54%
# Bulldozer             13.4/+50%       4.40

#                       IALU/gcc        1xNEON      3xNEON+1xIALU
# Cortex-A5             19.3(*)/+130%   21.8        14.1
# Cortex-A8             10.5(*)/+110%   13.9        6.35
# Cortex-A9             12.9(**)/+170%  14.3        6.50
# Snapdragon S4         11.5/+150%      13.6        4.90
#
# (*)   most "favourable" result for aligned data on little-endian
#       processor, result for misaligned data is 10-15% lower;
# (**)  this result is a trade-off: it can be improved by 20%,
#       but then Snapdragon S4 and Cortex-A8 results get
#       20-25% worse;

As for poly1305. I'm going to look into it next. Just like with ChaCha
the scope will not be limited to just latest x86_64 processors and ELF
platforms. This doesn't preclude possibility that parts of this code
submission will be used.



chacha-x86_64.pl (75K) Download Attachment