Re: unrolled RC4 for ia64

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: unrolled RC4 for ia64

Andy Polyakov
Hi, David! Long time, no see:-)

 > From: David Mosberger
 > To: [hidden email]
 > Cc: [hidden email]

First of all, note that openssl-dev is subscribers-only list, meaning
that you have to be subscribed to post to it. But no harm is done, as
you've sent a copy to my personal address and the submission will be
cosidered. I apologize for inconvenience.

> Attached below is a patch that is based on the RC4 code recently
> released by HP Labs (http://www.hpl.hp.com/research/linux/crypto/).
> Compared to the HP Labs code, I just changed things to generate the
> code via a perl-script since that seems to be the standard way for
> OpenSSL.

There is no requirement for expressing it in perl. All IA-64 OSes share
same procedure calling convention and assembler syntax, which are two
most common reasons for why most of other assembler modules are
[required to be] written in perl. But never mind:-)

> Compared to the existing RC4 code for ia64 (which is very
> good already), this version has two improvements: it unrolls the loop
> to avoid the 1-cycle penalty that McKinley-type cores exhibit when a
> byte-store to the same word occurs faster than once per 4 cycles

For my future reference regarding this issue. In commentary section you
mention "McKinley/Madison can issue "st1" to the same bank at a rate of
at most one per 4 cycles." I wonder how large is the bank?

> (RC4 would like to do it once per 3 cycles).  Additionally, the code
> carefully prefetches the data and the key table.  This was measured to
> achieve real speedup in complex workloads and also lets RC4 run at
> basically the same speed no matter whether the data is in memory or in
> the cache.
>
> With the patch applied, "make tests" still succeeds.  Performance
> looks like this (openssl speed rc4):
>
>
> type                      16 bytes     64 bytes    256 bytes   1024
> bytes   8192 bytes
>
> rc4 original            146981.83k   178195.65k   187632.68k  
> 190110.93k   190661.86k
> rc4 revised            134713.29k   185199.72k   248433.36k  
> 265739.02k   271370.10k
>
> This was on a (lowly) 900MHz McKinley.

Currently available version is already second one [original was not
playing well with OpenSSH], and it's a tad slower than the original one.
This is explanation to the discrepancy in OpenSSL results mentioned
above [190MBps] and in our current source code [~210MBps]. Just for
reference...

> If the 16-byte speed regression is considered serious, I suspect we
> could avoid that by not prefetching the table for small data sizes.

Not really a concern.

> Please consider this code for inclusion into the next OpenSSL release.

I'll look into it shortly and for now I'd like to thank you for your
submission. A.

P.S. I omit the original copy of patch to spare the public bandwidth, as
the code is available at referred URL, if anybody is anxious to examine
it now.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: unrolled RC4 for ia64

Andy Polyakov
>>I'll look into it shortly and for now I'd like to thank you for your
>>submission.

1. RC4 implementation.

Original submission is at http://cvs.openssl.org/chngview?cn=14248, and
my adaptation at http://cvs.openssl.org/chngview?cn=14249. Most notably
I've eliminated need for key->x=1 in RC4_set_key, as it's important to
us that C and assembler versions of RC4 are perfectly interchangeable.
Secondly I have eliminated remaining byte-order dependency [look for
RC4_BIG_ENDIAN]. The code has been tested on Linux and HP-UX and
compiles on Win64. There're other small changes listed at the latter URL.

2. MD5 implementation.

There seem to be misconception about md5_*_[host|data]_order.
md5_*_host_order is called with input in host byte-order, i.e.
big-endian on big-endian and little-endian on little-endian. This means
that endianess switch can't take place in md5_*_host_order function. In
other words the switch belongs in md5_*_data_order function. Which in
turn means that yes, it's appropriate to implement one. I already did
this [tested on Linux and HP-UX], but didn't commit the code yet, as I
have one remaining issue, namely HP-UX assembler failing to compile it
[see "common comments" below]. I'll commit when I resolve the issue.
BTW, you also make wrong assumptions about the moment when endianess
switch is appropriate. It has to take place *after* you pick A,B,C,D and
*prior* you write them back, not vice versa.

Once again. I *already* have MD5 working on both Linux and HP-UX and the
code will be committed shortly.

3. DES implementation.

I'd rather not:-( The point is that we always have to weight performance
gain against complexity inherent to assembler implementation, and 20-30%
improvement is exactly where the line is drawn. Yes, I know that one can
find assembler code in OpenSSL which gives less gain, most notably
md5-sparcv9.S. But nobody says that I don't regret this by now. At least
I can assure that if/when it fails, it will be just excluded, instead of
being fixed. On a side note, speaking of 64-bit alignment of key
schedule with long long dummy. I'd recommend double, as long long is not
as portable as one expects:-)

4. Endian-neutral AES mentioned in your TODO list.

It would require AES_set_[encrypt|decrypt]_key implemented in assembler,
because key schedule has to be maintained in big-endian order [but not
amount of rounds?]. Then we also have to think about cache-timing
attack. I already have sketched some code mitigating this kind of attack
[similar to one found in x86 implementation, where only CBC mode is
"protected"], so I'd say let's cooperate on this in order to avoid
duplicate effort.

5. Endian-neutral SHA mentioned in your TODO list.

Performance difference would be just few percents. I don't think it
worth the effort...

Common comments.

Usage of LP64 macro to guard usage of addp4. HP-UX in 32-bit mode is the
only platform using addp4, therefore it's more than appropriate to write
"#if defined(_HPUX_SOURCE) && !defined(_LP64)" and not just "#if
!defined(_LP64)" as you suggested, becuase the latter will have
undesired effect on for Win64 [and VMS?].

HP-UX assembler can't handle labels and assembler directives in long C
preprocessor macros expanded as single line. In RC4 case it was rather
easy to work around by spliting it to two, but in MD5 can one would have
to split to like 10, which is nothing but ridiculous. We have to ponder
about how to deal with this...

Feedback to OpenSSL. Timing for patch found at
http://www.hpl.hp.com/research/linux/crypto/ couldn't be worse. Most
notably quoted DEVRANDOM has emerged during beta-testing and was fixed
shortly after. Secondly, DES_INT [and alignment] could have made to
0.9.8, but now it has to wait till next release, because it affects
binary compatibility. As we already established it can be frustrating
experience to provide feedback to <openssl-dev> for unsubscribers [which
must be the explanation for bad timing], but it should be noted that
there *is* a way to provide such feedback as bug-fixes, improvements and
new code without having to subscribe, [hidden email] to be
specific. Hoping for and appreciating cooperation:-)

Cheers. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: unrolled RC4 for ia64

Andy Polyakov
> 2. MD5 implementation.
>
> Once again. I *already* have MD5 working on both Linux and HP-UX and the
> code will be committed shortly.

Original code is at http://cvs.openssl.org/chngview?cn=14252, and my
endian-neutral adaptation at http://cvs.openssl.org/chngview?cn=14253.

> HP-UX assembler can't handle labels and assembler directives in long C
> preprocessor macros expanded as single line. In RC4 case it was rather
> easy to work around by spliting it to two, but in MD5 can one would have
> to split to like 10, which is nothing but ridiculous. We have to ponder
> about how to deal with this...

In this case I settled for butt-ugly '($CC) $(CFLAGS) -E asm/md5-ia64.S
| $(PERL) -ne 's/;\s+/;\n/g; print;' in Makefile, which scrupulously
splits all directives and instructions in multi-directrive/-instruction
lines to multiple lines. In long-run I'd prefer to avoid using C
preprocessor to handle long code-snippets and opt for perl generating
assembler code which would not require C preprocessing [such as those
found in sha/asm]. Cheers. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: unrolled RC4 for ia64

Andy Polyakov
In reply to this post by Andy Polyakov
> 1. RC4 implementation.

I wonder why key schedule prefetch is performed with 128 stride? As far
as I understand 128 bytes is L2 line-size. But the loop is scheduled for
L1D access, which [unilke L2] has 64 byte line-size. In other words it
appears that prefetch fills only every second line in L1D. Is it
intentional? I mean I realize that there is potential trade-off between
amount of lfetch instructions vs. couple of stalls in the first loop
spin... A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: unrolled RC4 for ia64

Andy Polyakov
david mosberger wrote:
> IIRC, the loop should be scheduled for L2 latency.

In respect to input data maybe, but there is no way one can schedule 3*n
[or even 4*n] RC4 loop for L2. Loads from key schedule are commonly used
already in the next cycle, in other words key schedule is expected to
reside in L1D. A.

>>>1. RC4 implementation.
>>
>>I wonder why key schedule prefetch is performed with 128 stride? As far
>>as I understand 128 bytes is L2 line-size. But the loop is scheduled for
>>L1D access, which [unilke L2] has 64 byte line-size. In other words it
>>appears that prefetch fills only every second line in L1D. Is it
>>intentional? I mean I realize that there is potential trade-off between
>>amount of lfetch instructions vs. couple of stalls in the first loop
>>spin...
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]