SHA-256 implementation improvement

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

SHA-256 implementation improvement

Pavel Semjanov
Hello again,

as I promised, here is the optimized code for SHA-256 hash, x86
platform. Should work faster on Core 2/iX up to 20%. This code you are
free to use (or modify) in any form on OpenSSL and GRYPTOGAMS. I guess
you should make it PIC, as any other code for x86 (I didn't make it
because I don't need it in my projects).

Thanks again Andy!

--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|

sha256-586.pl (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
> as I promised, here is the optimized code for SHA-256 hash, x86
> platform. Should work faster on Core 2/iX up to 20%.

I can't replicate the results, not on Intel CPUs. Well, I can get 20% on
Sandy Bridge if I replace rotate with double precision shift, but it's
not fair comparison (in sense that switch would improve original code as
well). I did observe more than 20% on Opteron, but on Core2/Sandy Bridge
I get only 13-11%... I've taken one of ideas, alternative Maj, and
managed to squeeze ~5% on Opteron and Sandy Bridge, none on Core2 and
whole 13% on Atom, see http://cvs.openssl.org/chngview?cn=22587.
Compared to this updated code I observe your code being
+20%/+13%/+6%/-18% faster/slower on Opteron/Core2/Sandy Bridge/Atom. So
that full unroll helps, but apparently less on most recent CPUs (modulo
lack of results for AMD Bulldozer and Bobcat). Something to attempt at
some later point... From Sandy Bridge viewpoint it makes more sense to
arrange run-time switch to shrd-based non-unrolled code path. This way
code increase would be minimal, while performance difference between
tight and fully unrolled loop nominal.

> I guess
> you should make it PIC, as any other code for x86 (I didn't make it
> because I don't need it in my projects).

Pure code, i.e. without references to data, is always
position-independent. As you effectively embed constants into
instructions, it already is PIC.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov
> I did observe more than 20% on Opteron, but on Core2/Sandy Bridge
> I get only 13-11%...

Well, I've got 984 / 1170 clocks on Core 2 (17%)
and 1033 / 1250 on Core i5 (Westmere) (18%)
Anyway, I guess your measurement is more precise.
I'm glad some of my ideas were useful. Thanks.


--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
>> I did observe more than 20% on Opteron, but on Core2/Sandy Bridge
>> I get only 13-11%...
>
> Well, I've got 984 / 1170 clocks on Core 2 (17%)
> and 1033 / 1250 on Core i5 (Westmere) (18%)

Out of curiosity, how fast is updated code from CVS on Westmere?

> Anyway, I guess your measurement is more precise.

Why would you assume that? My measurements are based on 'openssl speed
sha256' output. It's probably least precise, but it has proven to be
adequate at multiple occasions. Your numbers are likely to be more
precise, while 'openssl speed'-based results are likely to be more
practical...
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov
On 19.05.2012 19:04, Andy Polyakov wrote:
>>> I did observe more than 20% on Opteron, but on Core2/Sandy Bridge
>>> I get only 13-11%...
>>
>> Well, I've got 984 / 1170 clocks on Core 2 (17%)
>> and 1033 / 1250 on Core i5 (Westmere) (18%)
>
> Out of curiosity, how fast is updated code from CVS on Westmere?

Sorry, too many codenames. It is Lynnfield. And the result exactly for
Lynnfield is unexpected, see below:

clocks for 1.5 / 1.6 / my version:
Core2 1170 / 1131 / 984
Core i5 1250 / 1430 (!) / 1033
P4 Northwood 2108 / 2046 / 1957
AMD K10 1270 / 1200 / 1058

The same slowdown for 1.6 version I observed for Clarkdale.


--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Jan Just Keijser-2
In reply to this post by Pavel Semjanov
Hi Pavel,

Pavel Semjanov wrote:
> Hello again,
>
> as I promised, here is the optimized code for SHA-256 hash, x86
> platform. Should work faster on Core 2/iX up to 20%. This code you are
> free to use (or modify) in any form on OpenSSL and GRYPTOGAMS. I guess
> you should make it PIC, as any other code for x86 (I didn't make it
> because I don't need it in my projects).
>
FWIW:

I've grabbed this .pl file , downloaded openssl 1.0.0j and compared the
performance of 'openssl speed sha256' with and without the patch;
initially I found *NO* noticable performance difference on any of the
64bit machines I tested . Then it occurred to me that the patch was for
the 32bit version only (the file sha512-x86_64.pl also covers sha256); I
modified the 'Configure' script to allow the compilation of a 32bit
version of openssl *with* the assembly routines. The results for this
version are on various Intel CPUs

Core2 E6550 (Conroe):  22 - 32 % speed up
Xeon E5440 (Harpertown): 24 - 33% speed up
Xeon X5660 (Westmere-EP): 19 - 27% speed up
i5-560M (Arrandale): 18 - 23 % speed up

Note that for the i5-560M the unpatched 64bit version still outperforms
the patched 32bit version....

How can the sha256 patch be applied to the 64bit code base?

cheers,

JJK / Jan Just Keijser

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov
Hello,

> Note that for the i5-560M the unpatched 64bit version still outperforms
> the patched 32bit version....
>
> How can the sha256 patch be applied to the 64bit code base?

It's not done for 64-bit yet.


--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
In reply to this post by Pavel Semjanov
>>>> I did observe more than 20% on Opteron, but on Core2/Sandy Bridge
>>>> I get only 13-11%...
>>>
>>> Well, I've got 984 / 1170 clocks on Core 2 (17%)
>>> and 1033 / 1250 on Core i5 (Westmere) (18%)
>>
>> Out of curiosity, how fast is updated code from CVS on Westmere?
>
> Sorry, too many codenames. It is Lynnfield.

Let's refer to "significant designs" instead. Among contemporary Intel
cores one can recognize Core 2, Nehalem, Sandy Bridge, [Atom] ...
Westmere, Lynnfield, Clarkdale, all fall to Nehalem category.

> And the result exactly for Lynnfield is unexpected,

Don't you feel sometimes that Intel mocks you? :-) :-) :-)

> see below:
> clocks for 1.5 / 1.6 / my version:
> Core2 1170 / 1131 / 984
> Core i5 1250 / 1430 (!) / 1033

Ouch! http://cvs.openssl.org/chngview?cn=22597.

> P4 Northwood 2108 / 2046 / 1957

This contradicts my tests. Specifically I measured slow-down for your
code on P4. Though my P4 is first available model, while Northwood is
later, improved core. Incidentally version 1.7 runs even faster on my
P4, I measured 31->29 cpb improvement. Could you retest 1.7 on your P4?

> AMD K10 1270 / 1200 / 1058

Just to clarify. Purpose of the exercise is not to dismiss the
submission, but to figure out pros and cons on as many CPU
implementations as possible. Though I admit I am a bit reluctant to ~10x
size blow up, especially for small blocks...

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
In reply to this post by Jan Just Keijser-2
> I
> modified the 'Configure' script to allow the compilation of a 32bit
> version of openssl *with* the assembly routines.

What does it mean? Configure supports 32-bit builds *with* assembly as
it is. To build 32-bit version on 64-bit Linux, run './Configure
linux-elf -m32'.

> The results for this
> version are on various Intel CPUs
>
> Core2 E6550 (Conroe):  22 - 32 % speed up
> Xeon E5440 (Harpertown): 24 - 33% speed up
> Xeon X5660 (Westmere-EP): 19 - 27% speed up
> i5-560M (Arrandale): 18 - 23 % speed up

What are the ranges? If we assume that largest coefficient is for
largest block size, then these are too high. What is the base line
exactly? Is it possible that you compare to compiler-generated code?

> Note that for the i5-560M the unpatched 64bit version still outperforms
> the patched 32bit version....
>
> How can the sha256 patch be applied to the 64bit code base?

You have to realize that there is limit for performance. I'm not
actually asserting that it's reached in SHA256 case, but there is no
doubt that 64-bit code is [much] closer to it. This means that you can't
expect comparable improvement coefficients (for integer-only code).

On side note. You mentioned that you tested 1.0.0. It should be noted
that 1.0.1 has updated sha512-x86_64 module that performs ~15% better on
Nehalem [and Atom], nominally better on other Intel CPUs. There also is
simple way to boost performance specifically on Sandy Bridge. There also
is pending submission from Intel that features SIMD operation for X[]
calculations...

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Jan Just Keijser-2
Andy Polyakov wrote:

>> I
>> modified the 'Configure' script to allow the compilation of a 32bit
>> version of openssl *with* the assembly routines.
>>    
>
> What does it mean? Configure supports 32-bit builds *with* assembly as
> it is. To build 32-bit version on 64-bit Linux, run './Configure
> linux-elf -m32'.
>
>  
ah, I did not know about that option - I was looking for a specific
./Configure target ...

>> The results for this
>> version are on various Intel CPUs
>>
>> Core2 E6550 (Conroe):  22 - 32 % speed up
>> Xeon E5440 (Harpertown): 24 - 33% speed up
>> Xeon X5660 (Westmere-EP): 19 - 27% speed up
>> i5-560M (Arrandale): 18 - 23 % speed up
>>    
>
> What are the ranges? If we assume that largest coefficient is for
> largest block size, then these are too high. What is the base line
> exactly? Is it possible that you compare to compiler-generated code?
>  
here are the raw 'openssl speed sha256' results with and without the
patch; all I did was
 
  tar xzf openssl-1.0.0j.tar.gz
  cd openssl-1.0.0j.tar.gz
  <apply patch or not>
  ./Configure linux-elf -m32
  make
  cd apps
  ./openssl speed -evp sha256 | grep ^sha
  ./openssl speed sha256 | grep ^sha

This result is on a Core2duo T9300 laptop:
                     
no patch:
sha256-evp    15721    42178     84527    113902    127184  
sha256        26851    58249     97794    119593    127668  
                       
patch:                      
sha256-evp    18178    51411    108741    150649    169099  
sha256        34380    76627    130753    159497    171054  
                       
               116%     122%      129%      132%      133%  
               128%     132%      134%      133%      134%  


So I'm seeing an increase in performance ranging from 16 to 34% for this
(artificial) test.
I'm seeing similar results on an AMD Opteron 2372HE and Opteron 6140
(non Bulldozer).

Let me know if you would like to see another test (including
commandlines, please ;)); and how should I inspect the generated code?


cheers,

JJK / Jan Just Keijser


PS the reason I have not touched openssl 1.0.1 yet is because most of
the systems I use are CentOS 5 or 6 based ; CentOS 6 comes with 1.0.0,
hence I'm focusing mostly on that.


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Jan Just Keijser-2
Jan Just Keijser wrote:

> Andy Polyakov wrote:
>>> I
>>> modified the 'Configure' script to allow the compilation of a 32bit
>>> version of openssl *with* the assembly routines.
>>>    
>>
>> What does it mean? Configure supports 32-bit builds *with* assembly as
>> it is. To build 32-bit version on 64-bit Linux, run './Configure
>> linux-elf -m32'.
>>
>>  
> ah, I did not know about that option - I was looking for a specific
> ./Configure target ...
>
>>> The results for this
>>> version are on various Intel CPUs
>>>
>>> Core2 E6550 (Conroe):  22 - 32 % speed up
>>> Xeon E5440 (Harpertown): 24 - 33% speed up
>>> Xeon X5660 (Westmere-EP): 19 - 27% speed up
>>> i5-560M (Arrandale): 18 - 23 % speed up
>>>    
>>
>> What are the ranges? If we assume that largest coefficient is for
>> largest block size, then these are too high. What is the base line
>> exactly? Is it possible that you compare to compiler-generated code?
>>  
> here are the raw 'openssl speed sha256' results with and without the
> patch; all I did was
>
>  tar xzf openssl-1.0.0j.tar.gz
>  cd openssl-1.0.0j.tar.gz
>  <apply patch or not>
>  ./Configure linux-elf -m32
>  make
>  cd apps
>  ./openssl speed -evp sha256 | grep ^sha
>  ./openssl speed sha256 | grep ^sha
>
> This result is on a Core2duo T9300 laptop:
>                      no patch:
> sha256-evp    15721    42178     84527    113902    127184  
> sha256        26851    58249     97794    119593    127668  
>                       patch:                       sha256-evp    
> 18178    51411    108741    150649    169099   sha256        34380    
> 76627    130753    159497    171054                        
>               116%     122%      129%      132%      133%  
>               128%     132%      134%      133%      134%  
>
arrrgh, the output got mangled on my first post: here's a second attempt:

no patch:
sha256-evp    15721    42178     84527    113902    127184
sha256        26851    58249     97794    119593    127668

patch:
sha256-evp    18178    51411    108741    150649    169099
sha256        34380    76627    130753    159497    171054

               116%     122%      129%      132%      133%
               128%     132%      134%      133%      134%


JJK

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Jan Just Keijser-2
In reply to this post by Jan Just Keijser-2
Jan Just Keijser wrote:

> Andy Polyakov wrote:
>>> I
>>> modified the 'Configure' script to allow the compilation of a 32bit
>>> version of openssl *with* the assembly routines.
>>>    
>>
>> What does it mean? Configure supports 32-bit builds *with* assembly as
>> it is. To build 32-bit version on 64-bit Linux, run './Configure
>> linux-elf -m32'.
>>
>>  
> ah, I did not know about that option - I was looking for a specific
> ./Configure target ...
>
>>> The results for this
>>> version are on various Intel CPUs
>>>
>>> Core2 E6550 (Conroe):  22 - 32 % speed up
>>> Xeon E5440 (Harpertown): 24 - 33% speed up
>>> Xeon X5660 (Westmere-EP): 19 - 27% speed up
>>> i5-560M (Arrandale): 18 - 23 % speed up
>>>    
>>
>> What are the ranges? If we assume that largest coefficient is for
>> largest block size, then these are too high. What is the base line
>> exactly? Is it possible that you compare to compiler-generated code?
>>  
> here are the raw 'openssl speed sha256' results with and without the
> patch; all I did was
>
>  tar xzf openssl-1.0.0j.tar.gz
>  cd openssl-1.0.0j.tar.gz
>  <apply patch or not>
>  ./Configure linux-elf -m32
>  make
>  cd apps
>  ./openssl speed -evp sha256 | grep ^sha
>  ./openssl speed sha256 | grep ^sha
>
> This result is on a Core2duo T9300 laptop:
>                      no patch:
> sha256-evp    15721    42178     84527    113902    127184  
> sha256        26851    58249     97794    119593    127668  
>                       patch:                       sha256-evp    
> 18178    51411    108741    150649    169099   sha256        34380    
> 76627    130753    159497    171054                        
>               116%     122%      129%      132%      133%  
>               128%     132%      134%      133%      134%  
>
> So I'm seeing an increase in performance ranging from 16 to 34% for
> this (artificial) test.
> I'm seeing similar results on an AMD Opteron 2372HE and Opteron 6140
> (non Bulldozer).
>
> Let me know if you would like to see another test (including
> commandlines, please ;)); and how should I inspect the generated code?

and as a follow up to my previous results: I just reran the test with
openssl 1.0.1c  on the same hardware. The existing sha256 code in 1.0.1c
seems more efficient than 1.0.0j, so the performance gain is now between
12 % and 21 % - is this more inline with what you expected?

cheers,

JJK / Jan Just Keijser

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
In reply to this post by Jan Just Keijser-2
> here are the raw 'openssl speed sha256' results with and without the
> patch; all I did was
>
>  tar xzf openssl-1.0.0j.tar.gz
>  cd openssl-1.0.0j.tar.gz
>  <apply patch or not>
>  ./Configure linux-elf -m32
>  make
>  cd apps
>  ./openssl speed -evp sha256 | grep ^sha
>  ./openssl speed sha256 | grep ^sha
>
> This result is on a Core2duo T9300 laptop:
>                      no patch:
> sha256-evp    15721    42178     84527    113902    127184
> sha256        26851    58249     97794    119593    127668
>
>                       patch:
> sha256-evp    18178    51411    108741    150649    169099
> sha256        34380    76627    130753    159497    171054
>               116%     122%      129%      132%      133%
>               128%     132%      134%      133%      134%

Explanation must be the fact that you use 1.0.0 as reference. I mean
1.0.1 has a bit faster code (yes, even 32-bit version), which is likely
why improvement appears higher than expected. Well, for some reason my
improvement coefficients are even lower than those reported by Pavel,
not to mention yours... Probable explanation can be that in my specific
compilation fully unrolled code causes cache contention... But as you
reported and suggested in another letter, ~20% is more inline with what
was reported [for Core 2].

> PS the reason I have not touched openssl 1.0.1 yet is because most of
> the systems I use are CentOS 5 or 6 based ; CentOS 6 comes with 1.0.0,
> hence I'm focusing mostly on that.

You contradict yourself. Indeed, if you are in business of replacing
code in specific OpenSSL version, you could as well have taken code from
development branch ;-)
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov
In reply to this post by Andy Polyakov-2
>
>> And the result exactly for Lynnfield is unexpected,
>
> Don't you feel sometimes that Intel mocks you? :-) :-) :-)

:-)

>
>> see below:
>> clocks for 1.5 / 1.6 / my version:
>> Core2 1170 / 1131 / 984
>> Core i5 1250 / 1430 (!) / 1033
>
> Ouch! http://cvs.openssl.org/chngview?cn=22597.

Core i5 1.7 is 1270 clocks (a bit worse than 1.5, but 1.7 is better for
all others architectures)

Could you retest 1.7 on your P4?

Surely, on Monday, I'll test on Northwoord and Prescott.



--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
http://cvs.openssl.org/chngview?cn=22599
http://cvs.openssl.org/chngview?cn=22600

For reference. As for full unroll I've taken different approach. Instead
of trying to accommodate additional a-h variable in freed register I
keep a^b->b^c in "rotating" pair of registers instead of stack. And I've
taken instruction sequences from folded loop. As result I get better
performance even in cases your code exhibits regression, biggest gap is
 >40% on Atom. Note that unrolled loop is executed for inputs >=1024
bytes. If you want to experiment, adjust $unroll_after variable. I also
avoid unrolled loop on P4 for reasons discussed below.

 > > Could you retest 1.7 on your P4?
>
> Surely, on Monday, I'll test on Northwoord and Prescott.

It appears that I was wrong about "my" P4 being "initial" version. It's
2.4GHz and has to be Northwood, i.e. "second wave", as well. But there
were "better" P4s released later, at least those that are 64-bit capable
ones ought to be of the kind. On "my" P4 I measure 30 cpb for folded
loop and whole 40[!] cpb for unrolled, while you reported improvement
for unrolled loop. Presumably this is how sensitive it *can* get to
larger code size. As for "better" P4s. I've found 64-bit capable P4 that
executes folded loop in 23.6 cpb and unrolled in 19.7. Yet, as P4 is not
"hot" anymore, I've chosen to opt for folded loop. If you want to
experiment with my unrolled loop on P4, locate "check for P4" in
sha256-586.pl and comment following jump instruction.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov
Andy,

here is my tests of all versions on some hardware:

                     1.5    1.6    1.7    1.8    my
P III (Coppermime) 1821 / 1850 / 1742 / 1574 / 1614
P4 (Prescott)      1544 / 1546 / 1541 / 1375 / 1450
P4 (Northwood)     2200 / 1963 / 1931 / 2483 / 1957
AMD Sempron        1537 / 1450 / 1394 / 1205 / 1305
AMD K10            1270 / 1210 / 1215 / 988  / 1057
Core 2             1170 / 1131 / 1130 / 985  / 984
i5 Lynnfield       1250 / 1426 / 1271 / 1121 / 1033
Sandy Bridge       1265 / 1225 / 1228 / 1115 / 981 (*) with shrd
Atom               2300 / 2050 / 1984 / 1700 / 2455

So, 1.8 version is quite good. It's the best for almost all old/slow
architectures, and my version is still the best for modern/powerful ones.

--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Andy Polyakov-2
Interleaved are my results translated to your units, basically just
multiplied by 64 and rounded to three significant digits.

>                     1.5    1.6    1.7    1.8    my
> P III (Coppermime) 1821 / 1850 / 1742 / 1574 / 1614
                                           1540
> P4 (Prescott)      1544 / 1546 / 1541 / 1375 / 1450
                                           1510
> P4 (Northwood)     2200 / 1963 / 1931 / 2483 / 1957
                                           1920
> AMD Sempron        1537 / 1450 / 1394 / 1205 / 1305
                                           n/a
> AMD K10            1270 / 1210 / 1215 / 988  / 1057
                                           990
> Core 2             1170 / 1131 / 1130 / 985  / 984
                                           1010
> i5 Lynnfield       1250 / 1426 / 1271 / 1121 / 1033
                                           1100
> Sandy Bridge       1265 / 1225 / 1228 / 1115 / 981 (*) with shrd
                                           1010 (folded loop with shrd)
> Atom               2300 / 2050 / 1984 / 1700 / 2455
                                           1660

Results are consistent except for P4, Core 2 and Sandy Bridge.

As for P4 it's probably just to shrug the shoulders, accept whatever the
result is and forget about it. It's a bit hard to accept, but it's
hardly worth figuring it out why our results vary that much.

As for Core 2. Difference is nominal and if I execute my binary with
varying stack seed(*) I can also measure 990 cycles per block. In other
words variation can be explained by environmental factors such as cache
contention.

As for Sandy Bridge. I don't know... I could observe nominal variations,
2-3%, on my machine, but nothing close to 10%, so this is odd... If you
have energy, test with varying stack seed(*)...

(*) because environment variables reside below stack simplest way to
reseed stack is to 'env A=`perl -e 'print "A"x1024"'` ...' and
experiment with number after x.

> So, 1.8 version is quite good. It's the best for almost all old/slow
> architectures,  and my version is still the best for modern/powerful ones.

Come on, apart from your Sandy Bridge result for 1.8, it's virtually
equivalent. Nominal difference can be explained by environmental
factors, and if not, it's really low price to pay for >40% improvement
on Atom. Besides, it's actually "slow" architectures that need
optimization more :-)

Cheers.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov

> As for Sandy Bridge. I don't know... I could observe nominal variations,
> 2-3%, on my machine, but nothing close to 10%, so this is odd... If you
> have energy, test with varying stack seed(*)...

It was my error, because I measured it in special application. It
doesn't know about OPENSSL_ia32cap_P and goes on non-shrd path. The
right numbers are 1005 for small loop and 971 for unrolled one. 971 is
the best value I've ever seen! Great work!


> Come on, apart from your Sandy Bridge result for 1.8, it's virtually
> equivalent. Nominal difference can be explained by environmental
> factors, and if not, it's really low price to pay for >40% improvement
> on Atom. Besides, it's actually "slow" architectures that need
> optimization more :-)

Now I agree ;) 1.8 version is "best-balanced" for all architectures.

--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Jan Just Keijser-2
Hi,

Pavel Semjanov wrote:

>
>> As for Sandy Bridge. I don't know... I could observe nominal variations,
>> 2-3%, on my machine, but nothing close to 10%, so this is odd... If you
>> have energy, test with varying stack seed(*)...
>
> It was my error, because I measured it in special application. It
> doesn't know about OPENSSL_ia32cap_P and goes on non-shrd path. The
> right numbers are 1005 for small loop and 971 for unrolled one. 971 is
> the best value I've ever seen! Great work!
>
>
>> Come on, apart from your Sandy Bridge result for 1.8, it's virtually
>> equivalent. Nominal difference can be explained by environmental
>> factors, and if not, it's really low price to pay for >40% improvement
>> on Atom. Besides, it's actually "slow" architectures that need
>> optimization more :-)
>
> Now I agree ;) 1.8 version is "best-balanced" for all architectures.
>

I'm not sure I agree: I've grabbed the 1.8 version and rebuilt openssl
1.0.1c and tested it on an i5 and a Core 2 Duo; performance is better
than the non-patched version but it is WORSE compared to the original
version of the sha256-586.pl script that was posted here before on May 11th.

version 11/05/2015:
sha256           39017.64k    87648.54k   150106.58k   183705.94k  
197330.99k

version 1.8:
sha256           33560.42k    73153.83k   121472.43k   167948.67k  
180955.23k

all my tests were done using 'openssl speed sha256' , I'm unsure how you
did your testing.

cheers,

JJK / Jan Just Keijser


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SHA-256 implementation improvement

Pavel Semjanov

> I'm not sure I agree: I've grabbed the 1.8 version and rebuilt openssl
> 1.0.1c and tested it on an i5 and a Core 2 Duo; performance is better
> than the non-patched version but it is WORSE compared to the original
> version of the sha256-586.pl script that was posted here before on May
> 11th.

That's why I said "best-balanced". It's possible to get better
performance for _given_ architecture, for example my first version is
better for i5, as you observed.

--

    SY / C4acT/\uBo             Pavel Semjanov
    _   _         _        http://www.semjanov.com
   | | |-| |_|_| |-|
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
12