BN_MUL_MONT for ARM64 v8

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

BN_MUL_MONT for ARM64 v8

Vijay Chander
  Is big number montogomery multiplication as optimized as it can be for ARM64 as compared to X86-64 from the latest openssl github ?
  We are not seeing vmull ( or pmull/pmull2) instructions in armv8-mont.pl.      
  
   On an ARM cortex-A72 (1GHz)  and E5-2620 (2.1 Ghz)  we are seeing an order of 10 difference in RSA signing perf for 2048 bit keys.


  Ran

          openssl speed rsa2048


Here are the openssl speed numbers.

x86-64

[root@nuosrv2 openssl]# ./apps/openssl speed rsa2048 
Doing 2048 bit private rsa's for 10s: 13134 2048 bit private RSA's in 9.97s
Doing 2048 bit public rsa's for 10s: 379019 2048 bit public RSA's in 9.98s
OpenSSL 1.1.1-dev  xx XXX xxxx
built on: reproducible build, date unspecified
options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/local/ssl\"" -DENGINESDIR="\"/usr/local/lib64/engines-1.1\""  -Wa,--noexecstack
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000759s 0.000026s   1317.4  37977.9


arm64:

[root@juno openssl]# ./apps/openssl speed rsa2048
Doing 2048 bit private rsa's for 10s: 1319 2048 bit private RSA's in 9.92s
Doing 2048 bit public rsa's for 10s: 49209 2048 bit public RSA's in 9.93s
OpenSSL 1.1.1-dev  xx XXX xxxx
built on: reproducible build, date unspecified
options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/local/ssl\"" -DENGINESDIR="\"/usr/local/lib/engines-1.1\""  -Wa,--noexecstack
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.007521s 0.000202s    133.0   4955.6



    ARM64 heavy hitters

  

    69.70%  openssl        libcrypto.so.1.1         [.] __bn_sqr8x_mont
    18.64%  openssl        libcrypto.so.1.1         [.] __bn_mul4x_mont
     4.92%  openssl        libcrypto.so.1.1         [.] MOD_EXP_CTIME_COPY_FROM_PREBUF
     1.50%  openssl        libcrypto.so.1.1         [.] bn_mul_add_words


    x86-64 heavy hitters

          

    30.93%  openssl          libcrypto.so.1.1         [.] __bn_sqrx8x_reduction
    17.65%  openssl          libcrypto.so.1.1         [.] bn_sqrx8x_internal
    12.65%  openssl          libcrypto.so.1.1         [.] mulx4x_internal
     8.91%  openssl          libcrypto.so.1.1         [.] bn_mul_add_words
     7.14%  openssl          libcrypto.so.1.1         [.] bn_mulx4x_mont


Code looks different between x86 and ARM64. Is it due to the ISA or ARM64 not yet catching up with
super efficient X86-64.

Basically are we stuck with 1:5 (if we extrapolate A72 to 2Ghz) or is there an optimal code that
we need to pick up for ARM64.  I compiled openssl from github (latest).





Any pointers will be extremely helpful.


Thanks,
-vijay

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Andy Polyakov-2
>   Is big number montogomery multiplication as optimized as it can be for
> ARM64 as compared to X86-64 from the latest openssl github ?
>   We are not seeing vmull ( or pmull/pmull2) instructions in
> armv8-mont.pl <http://armv8-mont.pl>.      
>  
>    On an ARM cortex-A72 (1GHz)  and E5-2620 (2.1 Ghz)  we are seeing an
> order of 10 difference in RSA signing perf for 2048 bit keys.

When it comes to performance correct question actually is not what is
the result in absolute terms, but how far is it from possible maximum
for specific processor [taking into consideration all the factors from
ISA capabilities and specific hardware implementation]. So that implying
that 10x difference between processors in question is result of
insufficient optimization for one is somewhat unjustified. Well, to be
completely honest there are some minor tricks one can pull on ARMv8, but
it will only make the gap a *little* bit smaller. Or in other words suck
it up, that's the way Cortex [currently?] is. If it's so critical *and*
you're in position to choose processor, then Samsung Mongoose core would
be much better choice (but I don't know anything about Qualcomm Kryo).
Yet, even though it would be better choice, it still wouldn't actually
close the gap, so don't get your hopes too high :-)

As for not seeing vector instructions. Pmull[2] is about something
completely different. As for vmull you have to recognize that it's
limited by 32-bit inputs and there is no carry handling in vector
instructions. This means that it would take more instructions to do same
job, even though you perform pair of multiplications in one vector
instruction. Well, it's more complicated than just amount of
instructions, but nevertheless, scalar 64x64 multiplication with carry
processing offered by ARMv8 ISA does deliver better result than 128-bit
vector instructions would.

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Vijay Chander
Thanks Andy. 

A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully get down to -1:5.
There is no L3 cache on the A72 eval board and performance counters do show 9x more DRAM accesses for ARM compared to x86. 

Will check out Mongoose and Kyro.

Do you know of any good hardware crypto intellectual property like synopsis for example which can help in asymmetric crypto part of TLS handshake ?

Thanks, 

Vijay 


On Feb 7, 2017 7:06 AM, "Andy Polyakov" <[hidden email]> wrote:
>   Is big number montogomery multiplication as optimized as it can be for
> ARM64 as compared to X86-64 from the latest openssl github ?
>   We are not seeing vmull ( or pmull/pmull2) instructions in
> armv8-mont.pl <http://armv8-mont.pl>.
>
>    On an ARM cortex-A72 (1GHz)  and E5-2620 (2.1 Ghz)  we are seeing an
> order of 10 difference in RSA signing perf for 2048 bit keys.

When it comes to performance correct question actually is not what is
the result in absolute terms, but how far is it from possible maximum
for specific processor [taking into consideration all the factors from
ISA capabilities and specific hardware implementation]. So that implying
that 10x difference between processors in question is result of
insufficient optimization for one is somewhat unjustified. Well, to be
completely honest there are some minor tricks one can pull on ARMv8, but
it will only make the gap a *little* bit smaller. Or in other words suck
it up, that's the way Cortex [currently?] is. If it's so critical *and*
you're in position to choose processor, then Samsung Mongoose core would
be much better choice (but I don't know anything about Qualcomm Kryo).
Yet, even though it would be better choice, it still wouldn't actually
close the gap, so don't get your hopes too high :-)

As for not seeing vector instructions. Pmull[2] is about something
completely different. As for vmull you have to recognize that it's
limited by 32-bit inputs and there is no carry handling in vector
instructions. This means that it would take more instructions to do same
job, even though you perform pair of multiplications in one vector
instruction. Well, it's more complicated than just amount of
instructions, but nevertheless, scalar 64x64 multiplication with carry
processing offered by ARMv8 ISA does deliver better result than 128-bit
vector instructions would.

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Andy Polyakov-2
> A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
> get down to -1:5.

And Mongoose will take you to ~1:2.5 (scaled to same frequency that is).
Which I'd say is a fair result. Well, still could have been a bit
better, but it's not unreasonable given ISA differences. Keep in mind
that presented x86_64 result is for code utilizing Intel-specific code
extensions.

> There is no L3 cache on the A72 eval board and performance counters do
> show 9x more DRAM accesses for ARM compared to x86.

This is unexpected, because it takes *less* references to memory to
perform it on ARMv8. Because it has larger register bank. And cache
requirement is not that high for L3 to kick in... But at any case memory
is not bottleneck here...

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Vijay Chander
Andy,
   1:2.5 is pretty in my opinion for ARM !  

   We  will check out Mongoose.

   Hmm - will try to get to the bottom of those cache misses (at a lower priority).

Thanks,
-vijay

   

On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]> wrote:
> A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
> get down to -1:5.

And Mongoose will take you to ~1:2.5 (scaled to same frequency that is).
Which I'd say is a fair result. Well, still could have been a bit
better, but it's not unreasonable given ISA differences. Keep in mind
that presented x86_64 result is for code utilizing Intel-specific code
extensions.

> There is no L3 cache on the A72 eval board and performance counters do
> show 9x more DRAM accesses for ARM compared to x86.

This is unexpected, because it takes *less* references to memory to
perform it on ARMv8. Because it has larger register bank. And cache
requirement is not that high for L3 to kick in... But at any case memory
is not bottleneck here...

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Mike Mohr
Have you considered using GMP as a big integer backed for openssl?  It has support for several arm variants using handwritten assembly code and the developers go to great lengths to find optimize runtime on all supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]> wrote:
Andy,
   1:2.5 is pretty in my opinion for ARM !  

   We  will check out Mongoose.

   Hmm - will try to get to the bottom of those cache misses (at a lower priority).

Thanks,
-vijay

   

On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]> wrote:
> A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
> get down to -1:5.

And Mongoose will take you to ~1:2.5 (scaled to same frequency that is).
Which I'd say is a fair result. Well, still could have been a bit
better, but it's not unreasonable given ISA differences. Keep in mind
that presented x86_64 result is for code utilizing Intel-specific code
extensions.

> There is no L3 cache on the A72 eval board and performance counters do
> show 9x more DRAM accesses for ARM compared to x86.

This is unexpected, because it takes *less* references to memory to
perform it on ARMv8. Because it has larger register bank. And cache
requirement is not that high for L3 to kick in... But at any case memory
is not bottleneck here...

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users



--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Vijay Chander
Mike,
   Tried with GMP. Same result for A72.

   Thanks, 
Vijay 

On Tue, Feb 7, 2017 at 3:31 PM, Mike Mohr <[hidden email]> wrote:
Have you considered using GMP as a big integer backed for openssl?  It has support for several arm variants using handwritten assembly code and the developers go to great lengths to find optimize runtime on all supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]> wrote:
Have you considered using GMP as a big integer backed for openssl?  It has support for several arm variants using handwritten assembly code and the developers go to great lengths to find optimize runtime on all supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]> wrote:
Andy,
   1:2.5 is pretty in my opinion for ARM !  

   We  will check out Mongoose.

   Hmm - will try to get to the bottom of those cache misses (at a lower priority).

Thanks,
-vijay

   

On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]> wrote:
> A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
> get down to -1:5.

And Mongoose will take you to ~1:2.5 (scaled to same frequency that is).
Which I'd say is a fair result. Well, still could have been a bit
better, but it's not unreasonable given ISA differences. Keep in mind
that presented x86_64 result is for code utilizing Intel-specific code
extensions.

> There is no L3 cache on the A72 eval board and performance counters do
> show 9x more DRAM accesses for ARM compared to x86.

This is unexpected, because it takes *less* references to memory to
perform it on ARMv8. Because it has larger register bank. And cache
requirement is not that high for L3 to kick in... But at any case memory
is not bottleneck here...

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users



--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

OpenSSL - User mailing list
In reply to this post by Mike Mohr
> Have you considered using GMP as a big integer backed for openssl?  It has support for several arm variants using handwritten assembly code and the developers go to great lengths to find optimize runtime on all supported platforms.

It might be interesting if we could figure out how to handle it as a dynamic library.  License issues prevent anything else.
--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Jakob Bohm-7
In reply to this post by Mike Mohr
OpenSSL also has a lot of handwritten assembly language for ARM,
x86 etc.  Most of it written by Andy Polyakov.

His response about what can and cannot be done on various ARM CPU
models is most probably a result of this work.

Also, OpenSSL has a more permissive license than the GMP, so using
GMP in OpenSSL would cause problems for many OpenSSL using
applications.

On 08/02/2017 00:31, Mike Mohr wrote:

> Have you considered using GMP as a big integer backed for openssl?  It
> has support for several arm variants using handwritten assembly code
> and the developers go to great lengths to find optimize runtime on all
> supported platforms.
>
> On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Andy,
>        1:2.5 is pretty in my opinion for ARM !
>
>        We  will check out Mongoose.
>
>        Hmm - will try to get to the bottom of those cache misses (at a
>     lower priority).
>
>     Thanks,
>     -vijay
>
>
>     On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         > A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
>         > get down to -1:5.
>
>         And Mongoose will take you to ~1:2.5 (scaled to same frequency
>         that is).
>         Which I'd say is a fair result. Well, still could have been a bit
>         better, but it's not unreasonable given ISA differences. Keep
>         in mind
>         that presented x86_64 result is for code utilizing
>         Intel-specific code
>         extensions.
>
>         > There is no L3 cache on the A72 eval board and performance
>         counters do
>         > show 9x more DRAM accesses for ARM compared to x86.
>
>         This is unexpected, because it takes *less* references to
>         memory to
>         perform it on ARMv8. Because it has larger register bank. And
>         cache
>         requirement is not that high for L3 to kick in... But at any
>         case memory
>         is not bottleneck here...
>


--
Jakob Bohm, CIO, partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Soborg, Denmark. direct: +45 31 13 16 10
<tel:+4531131610>
This message is only for its intended recipient, delete if misaddressed.
WiseMo - Remote Service Management for PCs, Phones and Embedded


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Vijay Chander
Yes. Already took Andy's word from his previous replies for precisely this reason. 

GMP exercise was easy enough to get it out of the way. 

Thanks, 
Vijay 

On Feb 7, 2017 4:46 PM, "Jakob Bohm" <[hidden email]> wrote:
OpenSSL also has a lot of handwritten assembly language for ARM,
x86 etc.  Most of it written by Andy Polyakov.

His response about what can and cannot be done on various ARM CPU
models is most probably a result of this work.

Also, OpenSSL has a more permissive license than the GMP, so using
GMP in OpenSSL would cause problems for many OpenSSL using
applications.

On 08/02/2017 00:31, Mike Mohr wrote:
Have you considered using GMP as a big integer backed for openssl?  It
has support for several arm variants using handwritten assembly code
and the developers go to great lengths to find optimize runtime on all
supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]
<mailto:[hidden email]>> wrote:

    Andy,
       1:2.5 is pretty in my opinion for ARM !

       We  will check out Mongoose.

       Hmm - will try to get to the bottom of those cache misses (at a
    lower priority).

    Thanks,
    -vijay


    On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]
    <mailto:[hidden email]>> wrote:

        > A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
        > get down to -1:5.

        And Mongoose will take you to ~1:2.5 (scaled to same frequency
        that is).
        Which I'd say is a fair result. Well, still could have been a bit
        better, but it's not unreasonable given ISA differences. Keep
        in mind
        that presented x86_64 result is for code utilizing
        Intel-specific code
        extensions.

        > There is no L3 cache on the A72 eval board and performance
        counters do
        > show 9x more DRAM accesses for ARM compared to x86.

        This is unexpected, because it takes *less* references to
        memory to
        perform it on ARMv8. Because it has larger register bank. And
        cache
        requirement is not that high for L3 to kick in... But at any
        case memory
        is not bottleneck here...



--
Jakob Bohm, CIO, partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Soborg, Denmark. direct: <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10 <tel:<a href="tel:%2B4531131610" value="+4531131610" target="_blank">+4531131610>
This message is only for its intended recipient, delete if misaddressed.
WiseMo - Remote Service Management for PCs, Phones and Embedded


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Mike Mohr
Licensing issues are indeed thorny. Why can't openssl perform a dynamic link? The soversion should handle any ABI issues introduced in later versions of GMP.

Are you cross compiling GMP for your use on a target device? If so, you'll need to ensure that the MPN_PATH is set appropriately. If you don't do so, you'll get the generic c code instead of optimized assembly routines. The performance difference can be dramatic, potentially several orders of magnitude. I had to deal with this myself when cross compiling GMP for Android.

On Feb 7, 2017 4:51 PM, "Vijay Chander" <[hidden email]> wrote:
Yes. Already took Andy's word from his previous replies for precisely this reason. 

GMP exercise was easy enough to get it out of the way. 

Thanks, 
Vijay 

On Feb 7, 2017 4:46 PM, "Jakob Bohm" <[hidden email]> wrote:
OpenSSL also has a lot of handwritten assembly language for ARM,
x86 etc.  Most of it written by Andy Polyakov.

His response about what can and cannot be done on various ARM CPU
models is most probably a result of this work.

Also, OpenSSL has a more permissive license than the GMP, so using
GMP in OpenSSL would cause problems for many OpenSSL using
applications.

On 08/02/2017 00:31, Mike Mohr wrote:
Have you considered using GMP as a big integer backed for openssl?  It
has support for several arm variants using handwritten assembly code
and the developers go to great lengths to find optimize runtime on all
supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]
<mailto:[hidden email]>> wrote:

    Andy,
       1:2.5 is pretty in my opinion for ARM !

       We  will check out Mongoose.

       Hmm - will try to get to the bottom of those cache misses (at a
    lower priority).

    Thanks,
    -vijay


    On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]
    <mailto:[hidden email]>> wrote:

        > A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
        > get down to -1:5.

        And Mongoose will take you to ~1:2.5 (scaled to same frequency
        that is).
        Which I'd say is a fair result. Well, still could have been a bit
        better, but it's not unreasonable given ISA differences. Keep
        in mind
        that presented x86_64 result is for code utilizing
        Intel-specific code
        extensions.

        > There is no L3 cache on the A72 eval board and performance
        counters do
        > show 9x more DRAM accesses for ARM compared to x86.

        This is unexpected, because it takes *less* references to
        memory to
        perform it on ARMv8. Because it has larger register bank. And
        cache
        requirement is not that high for L3 to kick in... But at any
        case memory
        is not bottleneck here...



--
Jakob Bohm, CIO, partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Soborg, Denmark. direct: <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10 <tel:<a href="tel:%2B4531131610" value="+4531131610" target="_blank">+4531131610>
This message is only for its intended recipient, delete if misaddressed.
WiseMo - Remote Service Management for PCs, Phones and Embedded


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users



--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Mike Mohr
In reply to this post by Jakob Bohm-7
Of course OpenSSL contains hand-optimized assembly routines.  However, GMP has been around since at least 1993 and the library specifically targets heavily optimized multiple precision arithmetic.  OpenSSL is a TLS/SSL toolkit, and necessarily focuses on implementing SSL/TLS correctly - I'd argue that the bigint subsystem is almost tangential to the other parts of any SSL library.  A less optimized bigint subsystem should be reasonably expected.  I would be surprised if the native bigint code could compete against GMP performance-wise, even when OpenSSL's optimized assembly code is used.  I haven't benchmarked OpenSSL's bigint subsystem and would be interested in seeing a comparison against a correctly configured GMP.

On Tue, Feb 7, 2017 at 4:46 PM, Jakob Bohm <[hidden email]> wrote:
OpenSSL also has a lot of handwritten assembly language for ARM,
x86 etc.  Most of it written by Andy Polyakov.

His response about what can and cannot be done on various ARM CPU
models is most probably a result of this work.

Also, OpenSSL has a more permissive license than the GMP, so using
GMP in OpenSSL would cause problems for many OpenSSL using
applications.

On 08/02/2017 00:31, Mike Mohr wrote:
Have you considered using GMP as a big integer backed for openssl?  It
has support for several arm variants using handwritten assembly code
and the developers go to great lengths to find optimize runtime on all
supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]
<mailto:[hidden email]>> wrote:

    Andy,
       1:2.5 is pretty in my opinion for ARM !

       We  will check out Mongoose.

       Hmm - will try to get to the bottom of those cache misses (at a
    lower priority).

    Thanks,
    -vijay


    On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]
    <mailto:[hidden email]>> wrote:

        > A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
        > get down to -1:5.

        And Mongoose will take you to ~1:2.5 (scaled to same frequency
        that is).
        Which I'd say is a fair result. Well, still could have been a bit
        better, but it's not unreasonable given ISA differences. Keep
        in mind
        that presented x86_64 result is for code utilizing
        Intel-specific code
        extensions.

        > There is no L3 cache on the A72 eval board and performance
        counters do
        > show 9x more DRAM accesses for ARM compared to x86.

        This is unexpected, because it takes *less* references to
        memory to
        perform it on ARMv8. Because it has larger register bank. And
        cache
        requirement is not that high for L3 to kick in... But at any
        case memory
        is not bottleneck here...



--
Jakob Bohm, CIO, partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Soborg, Denmark. direct: <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10 <tel:<a href="tel:%2B4531131610" value="+4531131610" target="_blank">+4531131610>
This message is only for its intended recipient, delete if misaddressed.
WiseMo - Remote Service Management for PCs, Phones and Embedded


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

OpenSSL - User mailing list
In reply to this post by Mike Mohr
> Licensing issues are indeed thorny. Why can't openssl perform a dynamic link? The soversion should handle any ABI issues introduced in later versions of GMP.

Anything is possible; it is just  code.

I don't think this is a priority for the team.  A pull request ...
--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Michael Wojcik
In reply to this post by Mike Mohr
> From: openssl-users [mailto:[hidden email]] On Behalf Of Mike Mohr
> Sent: Tuesday, February 07, 2017 22:21

> Licensing issues are indeed thorny. Why can't openssl perform a dynamic link? The soversion should handle any ABI issues
> introduced in later versions of GMP.

Replace "thorny" with "completely unacceptable" for at least some commercial users of  OpenSSL. And dynamic linking does not solve the problem, because customers would still have to get GMP. Some companies refuse to ship GPL code in any form, regardless of whether they've made modifications, and forcing customers to find and install GMP is hardly reasonable.

Providing TLS support in commercial software is already difficult enough. Let's not make it harder in the hypothetical hope of eking out a bit more performance.

Anyone who really wants GMP could implement it as an OpenSSL engine. That is, take the OpenSSL code for the algorithms you're using, copy them into an  engine, and then replace the BN math operations with calls to GMP.

Michael Wojcik
Distinguished Engineer, Micro Focus



--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Matt Caswell-2


On 08/02/17 14:12, Michael Wojcik wrote:

>> From: openssl-users [mailto:[hidden email]] On
>> Behalf Of Mike Mohr Sent: Tuesday, February 07, 2017 22:21
>
>> Licensing issues are indeed thorny. Why can't openssl perform a
>> dynamic link? The soversion should handle any ABI issues introduced
>> in later versions of GMP.
>
> Replace "thorny" with "completely unacceptable" for at least some
> commercial users of  OpenSSL. And dynamic linking does not solve the
> problem, because customers would still have to get GMP. Some
> companies refuse to ship GPL code in any form, regardless of whether
> they've made modifications, and forcing customers to find and install
> GMP is hardly reasonable.
>
> Providing TLS support in commercial software is already difficult
> enough. Let's not make it harder in the hypothetical hope of eking
> out a bit more performance.
>
> Anyone who really wants GMP could implement it as an OpenSSL engine.
> That is, take the OpenSSL code for the algorithms you're using, copy
> them into an  engine, and then replace the BN math operations with
> calls to GMP.

FYI, there already *is* a GMP engine in 1.0.2. It got removed from 1.1.0
due to lack of use. It is not compiled by default. You have to use
"enable-gmp". Not tried it though.

Matt



--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users
Reply | Threaded
Open this post in threaded view
|

Re: BN_MUL_MONT for ARM64 v8

Vijay Chander
In reply to this post by Mike Mohr
Mike,
   I was native compiling on A72 (64 bit) using libgmp version 10.2.

   Thanks,
-vijay

On Feb 7, 2017 7:21 PM, "Mike Mohr" <[hidden email]> wrote:
Licensing issues are indeed thorny. Why can't openssl perform a dynamic link? The soversion should handle any ABI issues introduced in later versions of GMP.

Are you cross compiling GMP for your use on a target device? If so, you'll need to ensure that the MPN_PATH is set appropriately. If you don't do so, you'll get the generic c code instead of optimized assembly routines. The performance difference can be dramatic, potentially several orders of magnitude. I had to deal with this myself when cross compiling GMP for Android.

On Feb 7, 2017 4:51 PM, "Vijay Chander" <[hidden email]> wrote:
Yes. Already took Andy's word from his previous replies for precisely this reason. 

GMP exercise was easy enough to get it out of the way. 

Thanks, 
Vijay 

On Feb 7, 2017 4:46 PM, "Jakob Bohm" <[hidden email]> wrote:
OpenSSL also has a lot of handwritten assembly language for ARM,
x86 etc.  Most of it written by Andy Polyakov.

His response about what can and cannot be done on various ARM CPU
models is most probably a result of this work.

Also, OpenSSL has a more permissive license than the GMP, so using
GMP in OpenSSL would cause problems for many OpenSSL using
applications.

On 08/02/2017 00:31, Mike Mohr wrote:
Have you considered using GMP as a big integer backed for openssl?  It
has support for several arm variants using handwritten assembly code
and the developers go to great lengths to find optimize runtime on all
supported platforms.

On Feb 7, 2017 2:26 PM, "Vijay Chander" <[hidden email]
<mailto:[hidden email]>> wrote:

    Andy,
       1:2.5 is pretty in my opinion for ARM !

       We  will check out Mongoose.

       Hmm - will try to get to the bottom of those cache misses (at a
    lower priority).

    Thanks,
    -vijay


    On Tue, Feb 7, 2017 at 11:07 AM, Andy Polyakov <[hidden email]
    <mailto:[hidden email]>> wrote:

        > A72 is running 1GHz compared to x86 at 2.1Ghz. So that should hopefully
        > get down to -1:5.

        And Mongoose will take you to ~1:2.5 (scaled to same frequency
        that is).
        Which I'd say is a fair result. Well, still could have been a bit
        better, but it's not unreasonable given ISA differences. Keep
        in mind
        that presented x86_64 result is for code utilizing
        Intel-specific code
        extensions.

        > There is no L3 cache on the A72 eval board and performance
        counters do
        > show 9x more DRAM accesses for ARM compared to x86.

        This is unexpected, because it takes *less* references to
        memory to
        perform it on ARMv8. Because it has larger register bank. And
        cache
        requirement is not that high for L3 to kick in... But at any
        case memory
        is not bottleneck here...



--
Jakob Bohm, CIO, partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Soborg, Denmark. direct: <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10 <tel:<a href="tel:%2B4531131610" value="+4531131610" target="_blank">+4531131610>
This message is only for its intended recipient, delete if misaddressed.
WiseMo - Remote Service Management for PCs, Phones and Embedded


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct <a href="tel:%2B45%2031%2013%2016%2010" value="+4531131610" target="_blank">+45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users

--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users



--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users


--
openssl-users mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users