[openssl.org #3607] nistz256 is broken.

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[openssl.org #3607] nistz256 is broken.

Rich Salz via RT
(Affects 1.0.2 only.)

In crypto/ec/asm/ecp_nistz256-x86_64.pl, __ecp_nistz256_sqr_montq,
under "Now the reduction" there are a number of comments saying
"doesn't overflow". Unfortunately, they aren't correct.

Let f be a field element with value
52998265219372519138277318009572834528257482223861497652862868020346603903843.

In Montgomery form, it's represented in memory as f*2^256 mod p, which
is 58733536287848456860684025065811053850702581988990452502702607007944524443511.

When passed to ecp_nistz256_sqr_mont, this results in the intermediate
value (before any reduction)
0x41dd6e8bcf7e19f499c19d0f5f3bba78272201eee64c6a44ca8a4ff275b53fa93b41d5b7035af3effffffff40a05dc36f424ab9438cdec4fa193faebf6ce951.

r10 in this case contains 0xffffffff40a05dc3 and the high-word output
of the multiplication after "# First iteration" is 0xfa193fad. The
addition of r8 and r9 overflows into it leaving it as 0xfa193fae. The
addition of rax and r9 also sets the carry flag thus the final
add-with-carry of rdx into r10 easily overflows and leaves r10 as
0x3ab99d72.

Additionally, I'm not sure about any of the other cases in the same
function that have been annotated the same way. There is also a
similar annotation in ecp_nistz256_mul_mont that I've not
investigated.


Cheers

AGL

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Billy Brumley
1. Where's the security analysis? Does https://eprint.iacr.org/2011/633 apply?

2. When will RT2574 be integrated to protect our ECC keys in the
inevitable presence of software defects like this?
http://rt.openssl.org/Ticket/Display.html?id=2574&user=guest&pass=guest

These questions are not necessarily for Adam, but the OpenSSL team.

BBB

On Sun, Nov 23, 2014 at 8:09 PM, Adam Langley via RT <[hidden email]> wrote:

> (Affects 1.0.2 only.)
>
> In crypto/ec/asm/ecp_nistz256-x86_64.pl, __ecp_nistz256_sqr_montq,
> under "Now the reduction" there are a number of comments saying
> "doesn't overflow". Unfortunately, they aren't correct.
>
> Let f be a field element with value
> 52998265219372519138277318009572834528257482223861497652862868020346603903843.
>
> In Montgomery form, it's represented in memory as f*2^256 mod p, which
> is 58733536287848456860684025065811053850702581988990452502702607007944524443511.
>
> When passed to ecp_nistz256_sqr_mont, this results in the intermediate
> value (before any reduction)
> 0x41dd6e8bcf7e19f499c19d0f5f3bba78272201eee64c6a44ca8a4ff275b53fa93b41d5b7035af3effffffff40a05dc36f424ab9438cdec4fa193faebf6ce951.
>
> r10 in this case contains 0xffffffff40a05dc3 and the high-word output
> of the multiplication after "# First iteration" is 0xfa193fad. The
> addition of r8 and r9 overflows into it leaving it as 0xfa193fae. The
> addition of rax and r9 also sets the carry flag thus the final
> add-with-carry of rdx into r10 easily overflows and leaves r10 as
> 0x3ab99d72.
>
> Additionally, I'm not sure about any of the other cases in the same
> function that have been annotated the same way. There is also a
> similar annotation in ecp_nistz256_mul_mont that I've not
> investigated.
>
>
> Cheers
>
> AGL
>
> ______________________________________________________________________
> OpenSSL Project                                 http://www.openssl.org
> Development Mailing List                       [hidden email]
> Automated List Manager                           [hidden email]
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: [openssl.org #3607] nistz256 is broken.

Salz, Rich
> 2. When will RT2574 be integrated to protect our ECC keys in the inevitable
> presence of software defects like this?
> http://rt.openssl.org/Ticket/Display.html?id=2574&user=guest&pass=guest

Timing attacks on ECC isn't a very high priority right now, given all the other bigger easier to exploit issues with wider deployment :(

I wish it weren't so, but I do want to set your expectations properly.

(Now, of course, having said that, the constant-time folks will swoop in and submit this to master next week :)
:��I"Ϯ��r�m���� (���Z+�7�zZ)���1���x ��h���W^��^��%����&jם.+-1�ځ��j:+v�������h�
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Billy Brumley
Thanks for the reply, Rich.

>> 2. When will RT2574 be integrated to protect our ECC keys in the inevitable
>> presence of software defects like this?
>> http://rt.openssl.org/Ticket/Display.html?id=2574&user=guest&pass=guest
>
> Timing attacks on ECC isn't a very high priority right now, given all the other bigger easier to exploit issues with wider deployment :(

The way I see it, the main purpose of that patch is to prevent bug
attacks, not timing attacks. Hence the relation to the arithmetic bug.

BBB
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
In reply to this post by Rich Salz via RT
> (Affects 1.0.2 only.)
>
> In crypto/ec/asm/ecp_nistz256-x86_64.pl, __ecp_nistz256_sqr_montq,
> under "Now the reduction" there are a number of comments saying
> "doesn't overflow". Unfortunately, they aren't correct.

Got math wrong:-( Attached is not only fixed version, but even faster
one. On related note. It's possible to improve server-side DSA by ~5% by
switching [back] to scatter-gather. [Change from scatter-gather was
caused by concern about timing dependency, but I argue that concern is
not valid in most cases.] There also are x86 and and ARM versions pending:

#               with/without -DECP_NISTZ256_ASM
# Pentium       +66-168%
# PIII          +73-175%
# P4            +68-140%
# Core2         +90-215%
# Sandy Bridge  +105-265% (contemporary i[57]-* are all close to this)
# Atom          +66-160%
# Opteron       +54-112%
# Bulldozer     +99-240%
# VIA Nano      +93-300%

#                       with/without -DECP_NISTZ256_ASM
# Cortex-A8             +53-173%
# Cortex-A9             +76-205%
# Cortex-A15            +100-316%
# Snapdragon S4         +66-187%

No, bug in question is not there. Nor is AD*X code path is affected.


diff --git a/crypto/ec/asm/ecp_nistz256-x86_64.pl b/crypto/ec/asm/ecp_nistz256-x86_64.pl
index 4486a5e..f328b85 100755
--- a/crypto/ec/asm/ecp_nistz256-x86_64.pl
+++ b/crypto/ec/asm/ecp_nistz256-x86_64.pl
@@ -31,15 +31,15 @@
 # Further optimization by <[hidden email]>:
 #
 # this/original
-# Opteron +8-33%
-# Bulldozer +10-30%
-# P4 +14-38%
-# Westmere +8-23%
-# Sandy Bridge +8-24%
-# Ivy Bridge +7-25%
-# Haswell +5-25%
-# Atom +10-32%
-# VIA Nano +37-130%
+# Opteron +10-43%
+# Bulldozer +14-43%
+# P4 +18-50%
+# Westmere +12-36%
+# Sandy Bridge +9-36%
+# Ivy Bridge +9-36%
+# Haswell +8-37%
+# Atom +15-50%
+# VIA Nano +43-160%
 #
 # Ranges denote minimum and maximum improvement coefficients depending
 # on benchmark. Lower coefficients are for ECDSA sign, relatively
@@ -550,28 +550,20 @@ __ecp_nistz256_mul_montq:
  # and add the result to the acc.
  # Due to the special form of p256 we do some optimizations
  #
- # acc[0] x p256[0] = acc[0] x 2^64 - acc[0]
- # then we add acc[0] and get acc[0] x 2^64
-
- mulq $poly1
- xor $t0, $t0
- add $acc0, $acc1 # +=acc[0]*2^64
- adc \$0, %rdx
- add %rax, $acc1
- mov $acc0, %rax
-
- # acc[0] x p256[2] = 0
- adc %rdx, $acc2
- adc \$0, $t0
+ # acc[0] x p256[0..1] = acc[0] x 2^96 - acc[0]
+ # then we add acc[0] and get acc[0] x 2^96
 
+ mov $acc0, $t1
+ shl \$32, $acc0
  mulq $poly3
- xor $acc0, $acc0
- add $t0, $acc3
- adc \$0, %rdx
- add %rax, $acc3
+ shr \$32, $t1
+ add $acc0, $acc1 # +=acc[0]<<96
+ adc $t1, $acc2
+ adc %rax, $acc3
  mov 8*1($b_ptr), %rax
  adc %rdx, $acc4
  adc \$0, $acc5
+ xor $acc0, $acc0
 
  ########################################################################
  # Multiply by b[1]
@@ -608,23 +600,17 @@ __ecp_nistz256_mul_montq:
 
  ########################################################################
  # Second reduction step
- mulq $poly1
- xor $t0, $t0
- add $acc1, $acc2
- adc \$0, %rdx
- add %rax, $acc2
- mov $acc1, %rax
- adc %rdx, $acc3
- adc \$0, $t0
-
+ mov $acc1, $t1
+ shl \$32, $acc1
  mulq $poly3
- xor $acc1, $acc1
- add $t0, $acc4
- adc \$0, %rdx
- add %rax, $acc4
+ shr \$32, $t1
+ add $acc1, $acc2
+ adc $t1, $acc3
+ adc %rax, $acc4
  mov 8*2($b_ptr), %rax
  adc %rdx, $acc5
  adc \$0, $acc0
+ xor $acc1, $acc1
 
  ########################################################################
  # Multiply by b[2]
@@ -661,23 +647,17 @@ __ecp_nistz256_mul_montq:
 
  ########################################################################
  # Third reduction step
- mulq $poly1
- xor $t0, $t0
- add $acc2, $acc3
- adc \$0, %rdx
- add %rax, $acc3
- mov $acc2, %rax
- adc %rdx, $acc4
- adc \$0, $t0
-
+ mov $acc2, $t1
+ shl \$32, $acc2
  mulq $poly3
- xor $acc2, $acc2
- add $t0, $acc5
- adc \$0, %rdx
- add %rax, $acc5
+ shr \$32, $t1
+ add $acc2, $acc3
+ adc $t1, $acc4
+ adc %rax, $acc5
  mov 8*3($b_ptr), %rax
  adc %rdx, $acc0
  adc \$0, $acc1
+ xor $acc2, $acc2
 
  ########################################################################
  # Multiply by b[3]
@@ -714,20 +694,14 @@ __ecp_nistz256_mul_montq:
 
  ########################################################################
  # Final reduction step
- mulq $poly1
- #xor $t0, $t0
- add $acc3, $acc4
- adc \$0, %rdx
- add %rax, $acc4
- mov $acc3, %rax
- adc %rdx, $acc5
- #adc \$0, $t0 # doesn't overflow
-
+ mov $acc3, $t1
+ shl \$32, $acc3
  mulq $poly3
- #add $t0, $acc0
- #adc \$0, %rdx
+ shr \$32, $t1
+ add $acc3, $acc4
+ adc $t1, $acc5
  mov $acc4, $t0
- add %rax, $acc0
+ adc %rax, $acc0
  adc %rdx, $acc1
  mov $acc5, $t1
  adc \$0, $acc2
@@ -897,82 +871,55 @@ __ecp_nistz256_sqr_montq:
  ##########################################
  # Now the reduction
  # First iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc0, $acc1
- adc \$0, %rdx
- add %rax, $acc1
- mov $acc0, %rax
- adc %rdx, $acc2 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc0, $t0
+ shl \$32, $acc0
  mulq $t1
- xor $acc0, $acc0
- #add $t0, $acc3
- #adc \$0, %rdx
- add %rax, $acc3
+ shr \$32, $t0
+ add $acc0, $acc1 # +=acc[0]<<96
+ adc $t0, $acc2
+ adc %rax, $acc3
  mov $acc1, %rax
  adc %rdx, $acc4
- adc \$0, $acc0
+ adc \$0, $acc5
+ xor $acc0, $acc0
 
  ##########################################
  # Second iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc1, $acc2
- adc \$0, %rdx
- add %rax, $acc2
- mov $acc1, %rax
- adc %rdx, $acc3 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc1, $t0
+ shl \$32, $acc1
  mulq $t1
- xor $acc1, $acc1
- #add $t0, $acc4
- #adc \$0, %rdx
- add %rax, $acc4
+ shr \$32, $t0
+ add $acc1, $acc2
+ adc $t0, $acc3
+ adc %rax, $acc4
  mov $acc2, %rax
  adc %rdx, $acc0
- adc \$0, $acc1
+ xor $acc1, $acc1
 
  ##########################################
  # Third iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc2, $acc3
- adc \$0, %rdx
- add %rax, $acc3
- mov $acc2, %rax
- adc %rdx, $acc4 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc2, $t0
+ shl \$32, $acc2
  mulq $t1
- xor $acc2, $acc2
- #add $t0, $acc0
- #adc \$0, %rdx
- add %rax, $acc0
+ shr \$32, $t0
+ add $acc2, $acc3
+ adc $t0, $acc4
+ adc %rax, $acc0
  mov $acc3, %rax
  adc %rdx, $acc1
- adc \$0, $acc2
+ xor $acc2, $acc2
 
  ###########################################
  # Last iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc3, $acc4
- adc \$0, %rdx
- add %rax, $acc4
- mov $acc3, %rax
- adc %rdx, $acc0 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc3, $t0
+ shl \$32, $acc3
  mulq $t1
- xor $acc3, $acc3
- #add $t0, $acc1
- #adc \$0, %rdx
- add %rax, $acc1
+ shr \$32, $t0
+ add $acc3, $acc4
+ adc $t0, $acc0
+ adc %rax, $acc1
  adc %rdx, $acc2
- adc \$0, $acc3
+ xor $acc3, $acc3
 
  ############################################
  # Add the rest of the acc
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>> (Affects 1.0.2 only.)
>>
>> In crypto/ec/asm/ecp_nistz256-x86_64.pl, __ecp_nistz256_sqr_montq,
>> under "Now the reduction" there are a number of comments saying
>> "doesn't overflow". Unfortunately, they aren't correct.
>
> Got math wrong:-( Attached is not only fixed version, but even faster
> one.

Please test attached one instead. Squaring didn't cover one case, and
AD*X is optimized.

> On related note. It's possible to improve server-side DSA by ~5% by
> switching [back] to scatter-gather. [Change from scatter-gather was
> caused by concern about timing dependency, but I argue that concern is
> not valid in most cases.] There also are x86 and and ARM versions pending:
>
> #               with/without -DECP_NISTZ256_ASM
> # Pentium       +66-168%
> # PIII          +73-175%
> # P4            +68-140%
> # Core2         +90-215%
> # Sandy Bridge  +105-265% (contemporary i[57]-* are all close to this)
> # Atom          +66-160%
> # Opteron       +54-112%
> # Bulldozer     +99-240%
> # VIA Nano      +93-300%
>
> #                       with/without -DECP_NISTZ256_ASM
> # Cortex-A8             +53-173%
> # Cortex-A9             +76-205%
> # Cortex-A15            +100-316%
> # Snapdragon S4         +66-187%
>
> No, bug in question is not there. Nor is AD*X code path is affected.
>
>


diff --git a/crypto/ec/asm/ecp_nistz256-x86_64.pl b/crypto/ec/asm/ecp_nistz256-x86_64.pl
index 4486a5e..56f6c2b 100755
--- a/crypto/ec/asm/ecp_nistz256-x86_64.pl
+++ b/crypto/ec/asm/ecp_nistz256-x86_64.pl
@@ -31,15 +31,15 @@
 # Further optimization by <[hidden email]>:
 #
 # this/original
-# Opteron +8-33%
-# Bulldozer +10-30%
-# P4 +14-38%
-# Westmere +8-23%
-# Sandy Bridge +8-24%
-# Ivy Bridge +7-25%
-# Haswell +5-25%
-# Atom +10-32%
-# VIA Nano +37-130%
+# Opteron +10-43%
+# Bulldozer +14-43%
+# P4 +18-50%
+# Westmere +12-36%
+# Sandy Bridge +9-36%
+# Ivy Bridge +9-36%
+# Haswell +8-37%
+# Atom +15-50%
+# VIA Nano +43-160%
 #
 # Ranges denote minimum and maximum improvement coefficients depending
 # on benchmark. Lower coefficients are for ECDSA sign, relatively
@@ -550,28 +550,20 @@ __ecp_nistz256_mul_montq:
  # and add the result to the acc.
  # Due to the special form of p256 we do some optimizations
  #
- # acc[0] x p256[0] = acc[0] x 2^64 - acc[0]
- # then we add acc[0] and get acc[0] x 2^64
-
- mulq $poly1
- xor $t0, $t0
- add $acc0, $acc1 # +=acc[0]*2^64
- adc \$0, %rdx
- add %rax, $acc1
- mov $acc0, %rax
-
- # acc[0] x p256[2] = 0
- adc %rdx, $acc2
- adc \$0, $t0
+ # acc[0] x p256[0..1] = acc[0] x 2^96 - acc[0]
+ # then we add acc[0] and get acc[0] x 2^96
 
+ mov $acc0, $t1
+ shl \$32, $acc0
  mulq $poly3
- xor $acc0, $acc0
- add $t0, $acc3
- adc \$0, %rdx
- add %rax, $acc3
+ shr \$32, $t1
+ add $acc0, $acc1 # +=acc[0]<<96
+ adc $t1, $acc2
+ adc %rax, $acc3
  mov 8*1($b_ptr), %rax
  adc %rdx, $acc4
  adc \$0, $acc5
+ xor $acc0, $acc0
 
  ########################################################################
  # Multiply by b[1]
@@ -608,23 +600,17 @@ __ecp_nistz256_mul_montq:
 
  ########################################################################
  # Second reduction step
- mulq $poly1
- xor $t0, $t0
- add $acc1, $acc2
- adc \$0, %rdx
- add %rax, $acc2
- mov $acc1, %rax
- adc %rdx, $acc3
- adc \$0, $t0
-
+ mov $acc1, $t1
+ shl \$32, $acc1
  mulq $poly3
- xor $acc1, $acc1
- add $t0, $acc4
- adc \$0, %rdx
- add %rax, $acc4
+ shr \$32, $t1
+ add $acc1, $acc2
+ adc $t1, $acc3
+ adc %rax, $acc4
  mov 8*2($b_ptr), %rax
  adc %rdx, $acc5
  adc \$0, $acc0
+ xor $acc1, $acc1
 
  ########################################################################
  # Multiply by b[2]
@@ -661,23 +647,17 @@ __ecp_nistz256_mul_montq:
 
  ########################################################################
  # Third reduction step
- mulq $poly1
- xor $t0, $t0
- add $acc2, $acc3
- adc \$0, %rdx
- add %rax, $acc3
- mov $acc2, %rax
- adc %rdx, $acc4
- adc \$0, $t0
-
+ mov $acc2, $t1
+ shl \$32, $acc2
  mulq $poly3
- xor $acc2, $acc2
- add $t0, $acc5
- adc \$0, %rdx
- add %rax, $acc5
+ shr \$32, $t1
+ add $acc2, $acc3
+ adc $t1, $acc4
+ adc %rax, $acc5
  mov 8*3($b_ptr), %rax
  adc %rdx, $acc0
  adc \$0, $acc1
+ xor $acc2, $acc2
 
  ########################################################################
  # Multiply by b[3]
@@ -714,20 +694,14 @@ __ecp_nistz256_mul_montq:
 
  ########################################################################
  # Final reduction step
- mulq $poly1
- #xor $t0, $t0
- add $acc3, $acc4
- adc \$0, %rdx
- add %rax, $acc4
- mov $acc3, %rax
- adc %rdx, $acc5
- #adc \$0, $t0 # doesn't overflow
-
+ mov $acc3, $t1
+ shl \$32, $acc3
  mulq $poly3
- #add $t0, $acc0
- #adc \$0, %rdx
+ shr \$32, $t1
+ add $acc3, $acc4
+ adc $t1, $acc5
  mov $acc4, $t0
- add %rax, $acc0
+ adc %rax, $acc0
  adc %rdx, $acc1
  mov $acc5, $t1
  adc \$0, $acc2
@@ -897,89 +871,62 @@ __ecp_nistz256_sqr_montq:
  ##########################################
  # Now the reduction
  # First iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc0, $acc1
- adc \$0, %rdx
- add %rax, $acc1
- mov $acc0, %rax
- adc %rdx, $acc2 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc0, $t0
+ shl \$32, $acc0
  mulq $t1
- xor $acc0, $acc0
- #add $t0, $acc3
- #adc \$0, %rdx
- add %rax, $acc3
+ shr \$32, $t0
+ add $acc0, $acc1 # +=acc[0]<<96
+ adc $t0, $acc2
+ adc %rax, $acc3
  mov $acc1, %rax
- adc %rdx, $acc4
- adc \$0, $acc0
+ adc \$0, %rdx
 
  ##########################################
  # Second iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc1, $acc2
- adc \$0, %rdx
- add %rax, $acc2
- mov $acc1, %rax
- adc %rdx, $acc3 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc1, $t0
+ shl \$32, $acc1
+ mov %rdx, $acc0
  mulq $t1
- xor $acc1, $acc1
- #add $t0, $acc4
- #adc \$0, %rdx
- add %rax, $acc4
+ shr \$32, $t0
+ add $acc1, $acc2
+ adc $t0, $acc3
+ adc %rax, $acc0
  mov $acc2, %rax
- adc %rdx, $acc0
- adc \$0, $acc1
+ adc \$0, %rdx
 
  ##########################################
  # Third iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc2, $acc3
- adc \$0, %rdx
- add %rax, $acc3
- mov $acc2, %rax
- adc %rdx, $acc4 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc2, $t0
+ shl \$32, $acc2
+ mov %rdx, $acc1
  mulq $t1
- xor $acc2, $acc2
- #add $t0, $acc0
- #adc \$0, %rdx
- add %rax, $acc0
+ shr \$32, $t0
+ add $acc2, $acc3
+ adc $t0, $acc0
+ adc %rax, $acc1
  mov $acc3, %rax
- adc %rdx, $acc1
- adc \$0, $acc2
+ adc \$0, %rdx
 
  ###########################################
  # Last iteration
- mulq $a_ptr
- #xor $t0, $t0
- add $acc3, $acc4
- adc \$0, %rdx
- add %rax, $acc4
- mov $acc3, %rax
- adc %rdx, $acc0 # doesn't overflow
- #adc \$0, $t0
-
+ mov $acc3, $t0
+ shl \$32, $acc3
+ mov %rdx, $acc2
  mulq $t1
+ shr \$32, $t0
+ add $acc3, $acc0
+ adc $t0, $acc1
+ adc %rax, $acc2
+ adc \$0, %rdx
  xor $acc3, $acc3
- #add $t0, $acc1
- #adc \$0, %rdx
- add %rax, $acc1
- adc %rdx, $acc2
- adc \$0, $acc3
 
  ############################################
  # Add the rest of the acc
- add $acc0, $acc5
+ add $acc0, $acc4
+ adc $acc1, $acc5
  mov $acc4, $acc0
- adc $acc1, $acc6
- adc $acc2, $acc7
+ adc $acc2, $acc6
+ adc %rdx, $acc7
  mov $acc5, $acc1
  adc \$0, $acc3
 
@@ -1028,18 +975,15 @@ __ecp_nistz256_mul_montx:
 
  ########################################################################
  # First reduction step
- xor $acc0, $acc0 # $acc0=0,cf=0,of=0
- adox $t1, $acc1
- adox $t0, $acc2
+ add $t1, $acc1
+ adc $t0, $acc2
 
  mulx $poly3, $t0, $t1
  mov 8*1($b_ptr), %rdx
- adox $t0, $acc3
- adcx $t1, $acc4
-
- adox $acc0, $acc4
- adcx $acc0, $acc5 # cf=0
- adox $acc0, $acc5 # of=0
+ adc $t0, $acc3
+ adc $t1, $acc4
+ adc \$0, $acc5
+ xor $acc0, $acc0 # $acc0=0,cf=0,of=0
 
  ########################################################################
  # Multiply by b[1]
@@ -1068,18 +1012,15 @@ __ecp_nistz256_mul_montx:
 
  ########################################################################
  # Second reduction step
- xor $acc1 ,$acc1 # $acc1=0,cf=0,of=0
- adox $t0, $acc2
- adox $t1, $acc3
+ add $t0, $acc2
+ adc $t1, $acc3
 
  mulx $poly3, $t0, $t1
  mov 8*2($b_ptr), %rdx
- adox $t0, $acc4
- adcx $t1, $acc5
-
- adox $acc1, $acc5
- adcx $acc1, $acc0 # cf=0
- adox $acc1, $acc0 # of=0
+ adc $t0, $acc4
+ adc $t1, $acc5
+ adc \$0, $acc0
+ xor $acc1 ,$acc1 # $acc1=0,cf=0,of=0
 
  ########################################################################
  # Multiply by b[2]
@@ -1108,18 +1049,15 @@ __ecp_nistz256_mul_montx:
 
  ########################################################################
  # Third reduction step
- xor $acc2, $acc2 # $acc2=0,cf=0,of=0
- adox $t0, $acc3
- adox $t1, $acc4
+ add $t0, $acc3
+ adc $t1, $acc4
 
  mulx $poly3, $t0, $t1
  mov 8*3($b_ptr), %rdx
- adox $t0, $acc5
- adcx $t1, $acc0
-
- adox $acc2, $acc0
- adcx $acc2, $acc1 # cf=0
- adox $acc2, $acc1 # of=0
+ adc $t0, $acc5
+ adc $t1, $acc0
+ adc \$0, $acc1
+ xor $acc2, $acc2 # $acc2=0,cf=0,of=0
 
  ########################################################################
  # Multiply by b[3]
@@ -1148,25 +1086,21 @@ __ecp_nistz256_mul_montx:
 
  ########################################################################
  # Fourth reduction step
- xor $acc3, $acc3 # $acc3=0,cf=0,of=0
- adox $t0, $acc4
- adox $t1, $acc5
+ add $t0, $acc4
+ adc $t1, $acc5
 
  mulx $poly3, $t0, $t1
  mov $acc4, $t2
  mov .Lpoly+8*1(%rip), $poly1
- adcx $t0, $acc0
- adox $t1, $acc1
+ adc $t0, $acc0
  mov $acc5, $t3
-
- adcx $acc3, $acc1
- adox $acc3, $acc2
+ adc $t1, $acc1
  adc \$0, $acc2
- mov $acc0, $t0
 
  ########################################################################
  # Branch-less conditional subtraction of P
  xor %eax, %eax
+ mov $acc0, $t0
  sbb \$-1, $acc4 # .Lpoly[0]
  sbb $poly1, $acc5 # .Lpoly[1]
  sbb \$0, $acc0 # .Lpoly[2]
@@ -1247,52 +1181,44 @@ __ecp_nistz256_sqr_montx:
  mov .Lpoly+8*3(%rip), $t1
 
  # reduction step 1
- xor $acc0, $acc0
- adcx $t0, $acc1
- adcx $t4, $acc2
+ add $t0, $acc1
+ adc $t4, $acc2
 
- mulx $t1, $t0, $t4
+ mulx $t1, $t0, $acc0
  mov $acc1, %rdx
- adcx $t0, $acc3
+ adc $t0, $acc3
  shlx $a_ptr, $acc1, $t0
- adox $t4, $acc0
- shrx $a_ptr, $acc1, $t4
  adc \$0, $acc0
+ shrx $a_ptr, $acc1, $t4
 
  # reduction step 2
- xor $acc1, $acc1
- adcx $t0, $acc2
- adcx $t4, $acc3
+ add $t0, $acc2
+ adc $t4, $acc3
 
- mulx $t1, $t0, $t4
+ mulx $t1, $t0, $acc1
  mov $acc2, %rdx
- adcx $t0, $acc0
+ adc $t0, $acc0
  shlx $a_ptr, $acc2, $t0
- adox $t4, $acc1
- shrx $a_ptr, $acc2, $t4
  adc \$0, $acc1
+ shrx $a_ptr, $acc2, $t4
 
  # reduction step 3
- xor $acc2, $acc2
- adcx $t0, $acc3
- adcx $t4, $acc0
+ add $t0, $acc3
+ adc $t4, $acc0
 
- mulx $t1, $t0, $t4
+ mulx $t1, $t0, $acc2
  mov $acc3, %rdx
- adcx $t0, $acc1
+ adc $t0, $acc1
  shlx $a_ptr, $acc3, $t0
- adox $t4, $acc2
- shrx $a_ptr, $acc3, $t4
  adc \$0, $acc2
+ shrx $a_ptr, $acc3, $t4
 
  # reduction step 4
- xor $acc3, $acc3
- adcx $t0, $acc0
- adcx $t4, $acc1
+ add $t0, $acc0
+ adc $t4, $acc1
 
- mulx $t1, $t0, $t4
- adcx $t0, $acc2
- adox $t4, $acc3
+ mulx $t1, $t0, $acc3
+ adc $t0, $acc2
  adc \$0, $acc3
 
  xor $t3, $t3 # cf=0
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Andy Polyakov-2
In reply to this post by Billy Brumley
> 1. Where's the security analysis? Does https://eprint.iacr.org/2011/633 apply?

If question is if referred paper applies literally in this case, then
answer is no, algorithm is different. If question is if spirit of the
paper applies, then answer is that there is no reason to believe that it
was impossible to mount similar attack. Fortunately the code is not
released yet.

> 2. When will RT2574 be integrated to protect our ECC keys in the
> inevitable presence of software defects like this?
> http://rt.openssl.org/Ticket/Display.html?id=2574&user=guest&pass=guest

It will be looked into. [It has been "starred" in my mailbox.] Problem
of course is that it takes an effort to understand and evaluate. But it
keeps falling to low priority, because it protects against something one
doesn't believe exists. In sense that no programmer believes that there
are bugs, because of their, programmers' human nature. This is not to
"devaluate" the suggestion, on the contrary, it's great, thanks, just an
apology for why it's taking so long. Thanks again, for report and reminder.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Adam Langley-3
In reply to this post by Rich Salz via RT
On Mon, Dec 1, 2014 at 3:23 PM, Andy Polyakov via RT <[hidden email]> wrote:

>>> (Affects 1.0.2 only.)
>>>
>>> In crypto/ec/asm/ecp_nistz256-x86_64.pl, __ecp_nistz256_sqr_montq,
>>> under "Now the reduction" there are a number of comments saying
>>> "doesn't overflow". Unfortunately, they aren't correct.
>>
>> Got math wrong:-( Attached is not only fixed version, but even faster
>> one.
>
> Please test attached one instead. Squaring didn't cover one case, and
> AD*X is optimized.

thanks! Was away last week and so didn't have a chance to try fixing this.

I'll patch that it and run the tests against it.


Cheers

AGL

>
>> On related note. It's possible to improve server-side DSA by ~5% by
>> switching [back] to scatter-gather. [Change from scatter-gather was
>> caused by concern about timing dependency, but I argue that concern is
>> not valid in most cases.] There also are x86 and and ARM versions pending:
>>
>> #               with/without -DECP_NISTZ256_ASM
>> # Pentium       +66-168%
>> # PIII          +73-175%
>> # P4            +68-140%
>> # Core2         +90-215%
>> # Sandy Bridge  +105-265% (contemporary i[57]-* are all close to this)
>> # Atom          +66-160%
>> # Opteron       +54-112%
>> # Bulldozer     +99-240%
>> # VIA Nano      +93-300%
>>
>> #                       with/without -DECP_NISTZ256_ASM
>> # Cortex-A8             +53-173%
>> # Cortex-A9             +76-205%
>> # Cortex-A15            +100-316%
>> # Snapdragon S4         +66-187%
>>
>> No, bug in question is not there. Nor is AD*X code path is affected.
>>
>>
>
>
>
> diff --git a/crypto/ec/asm/ecp_nistz256-x86_64.pl b/crypto/ec/asm/ecp_nistz256-x86_64.pl
> index 4486a5e..56f6c2b 100755
> --- a/crypto/ec/asm/ecp_nistz256-x86_64.pl
> +++ b/crypto/ec/asm/ecp_nistz256-x86_64.pl
> @@ -31,15 +31,15 @@
>  # Further optimization by <[hidden email]>:
>  #
>  #              this/original
> -# Opteron      +8-33%
> -# Bulldozer    +10-30%
> -# P4           +14-38%
> -# Westmere     +8-23%
> -# Sandy Bridge +8-24%
> -# Ivy Bridge   +7-25%
> -# Haswell      +5-25%
> -# Atom         +10-32%
> -# VIA Nano     +37-130%
> +# Opteron      +10-43%
> +# Bulldozer    +14-43%
> +# P4           +18-50%
> +# Westmere     +12-36%
> +# Sandy Bridge +9-36%
> +# Ivy Bridge   +9-36%
> +# Haswell      +8-37%
> +# Atom         +15-50%
> +# VIA Nano     +43-160%
>  #
>  # Ranges denote minimum and maximum improvement coefficients depending
>  # on benchmark. Lower coefficients are for ECDSA sign, relatively
> @@ -550,28 +550,20 @@ __ecp_nistz256_mul_montq:
>         # and add the result to the acc.
>         # Due to the special form of p256 we do some optimizations
>         #
> -       # acc[0] x p256[0] = acc[0] x 2^64 - acc[0]
> -       # then we add acc[0] and get acc[0] x 2^64
> -
> -       mulq    $poly1
> -       xor     $t0, $t0
> -       add     $acc0, $acc1            # +=acc[0]*2^64
> -       adc     \$0, %rdx
> -       add     %rax, $acc1
> -       mov     $acc0, %rax
> -
> -       # acc[0] x p256[2] = 0
> -       adc     %rdx, $acc2
> -       adc     \$0, $t0
> +       # acc[0] x p256[0..1] = acc[0] x 2^96 - acc[0]
> +       # then we add acc[0] and get acc[0] x 2^96
>
> +       mov     $acc0, $t1
> +       shl     \$32, $acc0
>         mulq    $poly3
> -       xor     $acc0, $acc0
> -       add     $t0, $acc3
> -       adc     \$0, %rdx
> -       add     %rax, $acc3
> +       shr     \$32, $t1
> +       add     $acc0, $acc1            # +=acc[0]<<96
> +       adc     $t1, $acc2
> +       adc     %rax, $acc3
>          mov    8*1($b_ptr), %rax
>         adc     %rdx, $acc4
>         adc     \$0, $acc5
> +       xor     $acc0, $acc0
>
>         ########################################################################
>         # Multiply by b[1]
> @@ -608,23 +600,17 @@ __ecp_nistz256_mul_montq:
>
>         ########################################################################
>         # Second reduction step
> -       mulq    $poly1
> -       xor     $t0, $t0
> -       add     $acc1, $acc2
> -       adc     \$0, %rdx
> -       add     %rax, $acc2
> -       mov     $acc1, %rax
> -       adc     %rdx, $acc3
> -       adc     \$0, $t0
> -
> +       mov     $acc1, $t1
> +       shl     \$32, $acc1
>         mulq    $poly3
> -       xor     $acc1, $acc1
> -       add     $t0, $acc4
> -       adc     \$0, %rdx
> -       add     %rax, $acc4
> +       shr     \$32, $t1
> +       add     $acc1, $acc2
> +       adc     $t1, $acc3
> +       adc     %rax, $acc4
>          mov    8*2($b_ptr), %rax
>         adc     %rdx, $acc5
>         adc     \$0, $acc0
> +       xor     $acc1, $acc1
>
>         ########################################################################
>         # Multiply by b[2]
> @@ -661,23 +647,17 @@ __ecp_nistz256_mul_montq:
>
>         ########################################################################
>         # Third reduction step
> -       mulq    $poly1
> -       xor     $t0, $t0
> -       add     $acc2, $acc3
> -       adc     \$0, %rdx
> -       add     %rax, $acc3
> -       mov     $acc2, %rax
> -       adc     %rdx, $acc4
> -       adc     \$0, $t0
> -
> +       mov     $acc2, $t1
> +       shl     \$32, $acc2
>         mulq    $poly3
> -       xor     $acc2, $acc2
> -       add     $t0, $acc5
> -       adc     \$0, %rdx
> -       add     %rax, $acc5
> +       shr     \$32, $t1
> +       add     $acc2, $acc3
> +       adc     $t1, $acc4
> +       adc     %rax, $acc5
>          mov    8*3($b_ptr), %rax
>         adc     %rdx, $acc0
>         adc     \$0, $acc1
> +       xor     $acc2, $acc2
>
>         ########################################################################
>         # Multiply by b[3]
> @@ -714,20 +694,14 @@ __ecp_nistz256_mul_montq:
>
>         ########################################################################
>         # Final reduction step
> -       mulq    $poly1
> -       #xor    $t0, $t0
> -       add     $acc3, $acc4
> -       adc     \$0, %rdx
> -       add     %rax, $acc4
> -       mov     $acc3, %rax
> -       adc     %rdx, $acc5
> -       #adc    \$0, $t0                # doesn't overflow
> -
> +       mov     $acc3, $t1
> +       shl     \$32, $acc3
>         mulq    $poly3
> -       #add    $t0, $acc0
> -       #adc    \$0, %rdx
> +       shr     \$32, $t1
> +       add     $acc3, $acc4
> +       adc     $t1, $acc5
>          mov    $acc4, $t0
> -       add     %rax, $acc0
> +       adc     %rax, $acc0
>         adc     %rdx, $acc1
>          mov    $acc5, $t1
>         adc     \$0, $acc2
> @@ -897,89 +871,62 @@ __ecp_nistz256_sqr_montq:
>         ##########################################
>         # Now the reduction
>         # First iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc0, $acc1
> -       adc     \$0, %rdx
> -       add     %rax, $acc1
> -       mov     $acc0, %rax
> -       adc     %rdx, $acc2     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc0, $t0
> +       shl     \$32, $acc0
>         mulq    $t1
> -       xor     $acc0, $acc0
> -       #add    $t0, $acc3
> -       #adc    \$0, %rdx
> -       add     %rax, $acc3
> +       shr     \$32, $t0
> +       add     $acc0, $acc1            # +=acc[0]<<96
> +       adc     $t0, $acc2
> +       adc     %rax, $acc3
>          mov    $acc1, %rax
> -       adc     %rdx, $acc4
> -       adc     \$0, $acc0
> +       adc     \$0, %rdx
>
>         ##########################################
>         # Second iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc1, $acc2
> -       adc     \$0, %rdx
> -       add     %rax, $acc2
> -       mov     $acc1, %rax
> -       adc     %rdx, $acc3     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc1, $t0
> +       shl     \$32, $acc1
> +       mov     %rdx, $acc0
>         mulq    $t1
> -       xor     $acc1, $acc1
> -       #add    $t0, $acc4
> -       #adc    \$0, %rdx
> -       add     %rax, $acc4
> +       shr     \$32, $t0
> +       add     $acc1, $acc2
> +       adc     $t0, $acc3
> +       adc     %rax, $acc0
>          mov    $acc2, %rax
> -       adc     %rdx, $acc0
> -       adc     \$0, $acc1
> +       adc     \$0, %rdx
>
>         ##########################################
>         # Third iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc2, $acc3
> -       adc     \$0, %rdx
> -       add     %rax, $acc3
> -       mov     $acc2, %rax
> -       adc     %rdx, $acc4     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc2, $t0
> +       shl     \$32, $acc2
> +       mov     %rdx, $acc1
>         mulq    $t1
> -       xor     $acc2, $acc2
> -       #add    $t0, $acc0
> -       #adc    \$0, %rdx
> -       add     %rax, $acc0
> +       shr     \$32, $t0
> +       add     $acc2, $acc3
> +       adc     $t0, $acc0
> +       adc     %rax, $acc1
>          mov    $acc3, %rax
> -       adc     %rdx, $acc1
> -       adc     \$0, $acc2
> +       adc     \$0, %rdx
>
>         ###########################################
>         # Last iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc3, $acc4
> -       adc     \$0, %rdx
> -       add     %rax, $acc4
> -       mov     $acc3, %rax
> -       adc     %rdx, $acc0     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc3, $t0
> +       shl     \$32, $acc3
> +       mov     %rdx, $acc2
>         mulq    $t1
> +       shr     \$32, $t0
> +       add     $acc3, $acc0
> +       adc     $t0, $acc1
> +       adc     %rax, $acc2
> +       adc     \$0, %rdx
>         xor     $acc3, $acc3
> -       #add    $t0, $acc1
> -       #adc    \$0, %rdx
> -       add     %rax, $acc1
> -       adc     %rdx, $acc2
> -       adc     \$0, $acc3
>
>         ############################################
>         # Add the rest of the acc
> -       add     $acc0, $acc5
> +       add     $acc0, $acc4
> +       adc     $acc1, $acc5
>          mov    $acc4, $acc0
> -       adc     $acc1, $acc6
> -       adc     $acc2, $acc7
> +       adc     $acc2, $acc6
> +       adc     %rdx, $acc7
>          mov    $acc5, $acc1
>         adc     \$0, $acc3
>
> @@ -1028,18 +975,15 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # First reduction step
> -       xor     $acc0, $acc0            # $acc0=0,cf=0,of=0
> -       adox    $t1, $acc1
> -       adox    $t0, $acc2
> +       add     $t1, $acc1
> +       adc     $t0, $acc2
>
>         mulx    $poly3, $t0, $t1
>          mov    8*1($b_ptr), %rdx
> -       adox    $t0, $acc3
> -       adcx    $t1, $acc4
> -
> -       adox    $acc0, $acc4
> -       adcx    $acc0, $acc5            # cf=0
> -       adox    $acc0, $acc5            # of=0
> +       adc     $t0, $acc3
> +       adc     $t1, $acc4
> +       adc     \$0, $acc5
> +       xor     $acc0, $acc0            # $acc0=0,cf=0,of=0
>
>         ########################################################################
>         # Multiply by b[1]
> @@ -1068,18 +1012,15 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # Second reduction step
> -       xor     $acc1 ,$acc1            # $acc1=0,cf=0,of=0
> -       adox    $t0, $acc2
> -       adox    $t1, $acc3
> +       add     $t0, $acc2
> +       adc     $t1, $acc3
>
>         mulx    $poly3, $t0, $t1
>          mov    8*2($b_ptr), %rdx
> -       adox    $t0, $acc4
> -       adcx    $t1, $acc5
> -
> -       adox    $acc1, $acc5
> -       adcx    $acc1, $acc0            # cf=0
> -       adox    $acc1, $acc0            # of=0
> +       adc     $t0, $acc4
> +       adc     $t1, $acc5
> +       adc     \$0, $acc0
> +       xor     $acc1 ,$acc1            # $acc1=0,cf=0,of=0
>
>         ########################################################################
>         # Multiply by b[2]
> @@ -1108,18 +1049,15 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # Third reduction step
> -       xor     $acc2, $acc2            # $acc2=0,cf=0,of=0
> -       adox    $t0, $acc3
> -       adox    $t1, $acc4
> +       add     $t0, $acc3
> +       adc     $t1, $acc4
>
>         mulx    $poly3, $t0, $t1
>          mov    8*3($b_ptr), %rdx
> -       adox    $t0, $acc5
> -       adcx    $t1, $acc0
> -
> -       adox    $acc2, $acc0
> -       adcx    $acc2, $acc1            # cf=0
> -       adox    $acc2, $acc1            # of=0
> +       adc     $t0, $acc5
> +       adc     $t1, $acc0
> +       adc     \$0, $acc1
> +       xor     $acc2, $acc2            # $acc2=0,cf=0,of=0
>
>         ########################################################################
>         # Multiply by b[3]
> @@ -1148,25 +1086,21 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # Fourth reduction step
> -       xor     $acc3, $acc3            # $acc3=0,cf=0,of=0
> -       adox    $t0, $acc4
> -       adox    $t1, $acc5
> +       add     $t0, $acc4
> +       adc     $t1, $acc5
>
>         mulx    $poly3, $t0, $t1
>          mov    $acc4, $t2
>         mov     .Lpoly+8*1(%rip), $poly1
> -       adcx    $t0, $acc0
> -       adox    $t1, $acc1
> +       adc     $t0, $acc0
>          mov    $acc5, $t3
> -
> -       adcx    $acc3, $acc1
> -       adox    $acc3, $acc2
> +       adc     $t1, $acc1
>         adc     \$0, $acc2
> -        mov    $acc0, $t0
>
>         ########################################################################
>         # Branch-less conditional subtraction of P
>         xor     %eax, %eax
> +        mov    $acc0, $t0
>         sbb     \$-1, $acc4             # .Lpoly[0]
>         sbb     $poly1, $acc5           # .Lpoly[1]
>         sbb     \$0, $acc0              # .Lpoly[2]
> @@ -1247,52 +1181,44 @@ __ecp_nistz256_sqr_montx:
>          mov    .Lpoly+8*3(%rip), $t1
>
>         # reduction step 1
> -       xor     $acc0, $acc0
> -       adcx    $t0, $acc1
> -       adcx    $t4, $acc2
> +       add     $t0, $acc1
> +       adc     $t4, $acc2
>
> -       mulx    $t1, $t0, $t4
> +       mulx    $t1, $t0, $acc0
>          mov    $acc1, %rdx
> -       adcx    $t0, $acc3
> +       adc     $t0, $acc3
>          shlx   $a_ptr, $acc1, $t0
> -       adox    $t4, $acc0
> -        shrx   $a_ptr, $acc1, $t4
>         adc     \$0, $acc0
> +        shrx   $a_ptr, $acc1, $t4
>
>         # reduction step 2
> -       xor     $acc1, $acc1
> -       adcx    $t0, $acc2
> -       adcx    $t4, $acc3
> +       add     $t0, $acc2
> +       adc     $t4, $acc3
>
> -       mulx    $t1, $t0, $t4
> +       mulx    $t1, $t0, $acc1
>          mov    $acc2, %rdx
> -       adcx    $t0, $acc0
> +       adc     $t0, $acc0
>          shlx   $a_ptr, $acc2, $t0
> -       adox    $t4, $acc1
> -        shrx   $a_ptr, $acc2, $t4
>         adc     \$0, $acc1
> +        shrx   $a_ptr, $acc2, $t4
>
>         # reduction step 3
> -       xor     $acc2, $acc2
> -       adcx    $t0, $acc3
> -       adcx    $t4, $acc0
> +       add     $t0, $acc3
> +       adc     $t4, $acc0
>
> -       mulx    $t1, $t0, $t4
> +       mulx    $t1, $t0, $acc2
>          mov    $acc3, %rdx
> -       adcx    $t0, $acc1
> +       adc     $t0, $acc1
>          shlx   $a_ptr, $acc3, $t0
> -       adox    $t4, $acc2
> -        shrx   $a_ptr, $acc3, $t4
>         adc     \$0, $acc2
> +        shrx   $a_ptr, $acc3, $t4
>
>         # reduction step 4
> -       xor     $acc3, $acc3
> -       adcx    $t0, $acc0
> -       adcx    $t4, $acc1
> +       add     $t0, $acc0
> +       adc     $t4, $acc1
>
> -       mulx    $t1, $t0, $t4
> -       adcx    $t0, $acc2
> -       adox    $t4, $acc3
> +       mulx    $t1, $t0, $acc3
> +       adc     $t0, $acc2
>         adc     \$0, $acc3
>
>         xor     $t3, $t3                # cf=0
>
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
On Mon, Dec 1, 2014 at 3:23 PM, Andy Polyakov via RT <[hidden email]> wrote:

>>> (Affects 1.0.2 only.)
>>>
>>> In crypto/ec/asm/ecp_nistz256-x86_64.pl, __ecp_nistz256_sqr_montq,
>>> under "Now the reduction" there are a number of comments saying
>>> "doesn't overflow". Unfortunately, they aren't correct.
>>
>> Got math wrong:-( Attached is not only fixed version, but even faster
>> one.
>
> Please test attached one instead. Squaring didn't cover one case, and
> AD*X is optimized.

thanks! Was away last week and so didn't have a chance to try fixing this.

I'll patch that it and run the tests against it.


Cheers

AGL

>
>> On related note. It's possible to improve server-side DSA by ~5% by
>> switching [back] to scatter-gather. [Change from scatter-gather was
>> caused by concern about timing dependency, but I argue that concern is
>> not valid in most cases.] There also are x86 and and ARM versions pending:
>>
>> #               with/without -DECP_NISTZ256_ASM
>> # Pentium       +66-168%
>> # PIII          +73-175%
>> # P4            +68-140%
>> # Core2         +90-215%
>> # Sandy Bridge  +105-265% (contemporary i[57]-* are all close to this)
>> # Atom          +66-160%
>> # Opteron       +54-112%
>> # Bulldozer     +99-240%
>> # VIA Nano      +93-300%
>>
>> #                       with/without -DECP_NISTZ256_ASM
>> # Cortex-A8             +53-173%
>> # Cortex-A9             +76-205%
>> # Cortex-A15            +100-316%
>> # Snapdragon S4         +66-187%
>>
>> No, bug in question is not there. Nor is AD*X code path is affected.
>>
>>
>
>
>
> diff --git a/crypto/ec/asm/ecp_nistz256-x86_64.pl b/crypto/ec/asm/ecp_nistz256-x86_64.pl
> index 4486a5e..56f6c2b 100755
> --- a/crypto/ec/asm/ecp_nistz256-x86_64.pl
> +++ b/crypto/ec/asm/ecp_nistz256-x86_64.pl
> @@ -31,15 +31,15 @@
>  # Further optimization by <[hidden email]>:
>  #
>  #              this/original
> -# Opteron      +8-33%
> -# Bulldozer    +10-30%
> -# P4           +14-38%
> -# Westmere     +8-23%
> -# Sandy Bridge +8-24%
> -# Ivy Bridge   +7-25%
> -# Haswell      +5-25%
> -# Atom         +10-32%
> -# VIA Nano     +37-130%
> +# Opteron      +10-43%
> +# Bulldozer    +14-43%
> +# P4           +18-50%
> +# Westmere     +12-36%
> +# Sandy Bridge +9-36%
> +# Ivy Bridge   +9-36%
> +# Haswell      +8-37%
> +# Atom         +15-50%
> +# VIA Nano     +43-160%
>  #
>  # Ranges denote minimum and maximum improvement coefficients depending
>  # on benchmark. Lower coefficients are for ECDSA sign, relatively
> @@ -550,28 +550,20 @@ __ecp_nistz256_mul_montq:
>         # and add the result to the acc.
>         # Due to the special form of p256 we do some optimizations
>         #
> -       # acc[0] x p256[0] = acc[0] x 2^64 - acc[0]
> -       # then we add acc[0] and get acc[0] x 2^64
> -
> -       mulq    $poly1
> -       xor     $t0, $t0
> -       add     $acc0, $acc1            # +=acc[0]*2^64
> -       adc     \$0, %rdx
> -       add     %rax, $acc1
> -       mov     $acc0, %rax
> -
> -       # acc[0] x p256[2] = 0
> -       adc     %rdx, $acc2
> -       adc     \$0, $t0
> +       # acc[0] x p256[0..1] = acc[0] x 2^96 - acc[0]
> +       # then we add acc[0] and get acc[0] x 2^96
>
> +       mov     $acc0, $t1
> +       shl     \$32, $acc0
>         mulq    $poly3
> -       xor     $acc0, $acc0
> -       add     $t0, $acc3
> -       adc     \$0, %rdx
> -       add     %rax, $acc3
> +       shr     \$32, $t1
> +       add     $acc0, $acc1            # +=acc[0]<<96
> +       adc     $t1, $acc2
> +       adc     %rax, $acc3
>          mov    8*1($b_ptr), %rax
>         adc     %rdx, $acc4
>         adc     \$0, $acc5
> +       xor     $acc0, $acc0
>
>         ########################################################################
>         # Multiply by b[1]
> @@ -608,23 +600,17 @@ __ecp_nistz256_mul_montq:
>
>         ########################################################################
>         # Second reduction step
> -       mulq    $poly1
> -       xor     $t0, $t0
> -       add     $acc1, $acc2
> -       adc     \$0, %rdx
> -       add     %rax, $acc2
> -       mov     $acc1, %rax
> -       adc     %rdx, $acc3
> -       adc     \$0, $t0
> -
> +       mov     $acc1, $t1
> +       shl     \$32, $acc1
>         mulq    $poly3
> -       xor     $acc1, $acc1
> -       add     $t0, $acc4
> -       adc     \$0, %rdx
> -       add     %rax, $acc4
> +       shr     \$32, $t1
> +       add     $acc1, $acc2
> +       adc     $t1, $acc3
> +       adc     %rax, $acc4
>          mov    8*2($b_ptr), %rax
>         adc     %rdx, $acc5
>         adc     \$0, $acc0
> +       xor     $acc1, $acc1
>
>         ########################################################################
>         # Multiply by b[2]
> @@ -661,23 +647,17 @@ __ecp_nistz256_mul_montq:
>
>         ########################################################################
>         # Third reduction step
> -       mulq    $poly1
> -       xor     $t0, $t0
> -       add     $acc2, $acc3
> -       adc     \$0, %rdx
> -       add     %rax, $acc3
> -       mov     $acc2, %rax
> -       adc     %rdx, $acc4
> -       adc     \$0, $t0
> -
> +       mov     $acc2, $t1
> +       shl     \$32, $acc2
>         mulq    $poly3
> -       xor     $acc2, $acc2
> -       add     $t0, $acc5
> -       adc     \$0, %rdx
> -       add     %rax, $acc5
> +       shr     \$32, $t1
> +       add     $acc2, $acc3
> +       adc     $t1, $acc4
> +       adc     %rax, $acc5
>          mov    8*3($b_ptr), %rax
>         adc     %rdx, $acc0
>         adc     \$0, $acc1
> +       xor     $acc2, $acc2
>
>         ########################################################################
>         # Multiply by b[3]
> @@ -714,20 +694,14 @@ __ecp_nistz256_mul_montq:
>
>         ########################################################################
>         # Final reduction step
> -       mulq    $poly1
> -       #xor    $t0, $t0
> -       add     $acc3, $acc4
> -       adc     \$0, %rdx
> -       add     %rax, $acc4
> -       mov     $acc3, %rax
> -       adc     %rdx, $acc5
> -       #adc    \$0, $t0                # doesn't overflow
> -
> +       mov     $acc3, $t1
> +       shl     \$32, $acc3
>         mulq    $poly3
> -       #add    $t0, $acc0
> -       #adc    \$0, %rdx
> +       shr     \$32, $t1
> +       add     $acc3, $acc4
> +       adc     $t1, $acc5
>          mov    $acc4, $t0
> -       add     %rax, $acc0
> +       adc     %rax, $acc0
>         adc     %rdx, $acc1
>          mov    $acc5, $t1
>         adc     \$0, $acc2
> @@ -897,89 +871,62 @@ __ecp_nistz256_sqr_montq:
>         ##########################################
>         # Now the reduction
>         # First iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc0, $acc1
> -       adc     \$0, %rdx
> -       add     %rax, $acc1
> -       mov     $acc0, %rax
> -       adc     %rdx, $acc2     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc0, $t0
> +       shl     \$32, $acc0
>         mulq    $t1
> -       xor     $acc0, $acc0
> -       #add    $t0, $acc3
> -       #adc    \$0, %rdx
> -       add     %rax, $acc3
> +       shr     \$32, $t0
> +       add     $acc0, $acc1            # +=acc[0]<<96
> +       adc     $t0, $acc2
> +       adc     %rax, $acc3
>          mov    $acc1, %rax
> -       adc     %rdx, $acc4
> -       adc     \$0, $acc0
> +       adc     \$0, %rdx
>
>         ##########################################
>         # Second iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc1, $acc2
> -       adc     \$0, %rdx
> -       add     %rax, $acc2
> -       mov     $acc1, %rax
> -       adc     %rdx, $acc3     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc1, $t0
> +       shl     \$32, $acc1
> +       mov     %rdx, $acc0
>         mulq    $t1
> -       xor     $acc1, $acc1
> -       #add    $t0, $acc4
> -       #adc    \$0, %rdx
> -       add     %rax, $acc4
> +       shr     \$32, $t0
> +       add     $acc1, $acc2
> +       adc     $t0, $acc3
> +       adc     %rax, $acc0
>          mov    $acc2, %rax
> -       adc     %rdx, $acc0
> -       adc     \$0, $acc1
> +       adc     \$0, %rdx
>
>         ##########################################
>         # Third iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc2, $acc3
> -       adc     \$0, %rdx
> -       add     %rax, $acc3
> -       mov     $acc2, %rax
> -       adc     %rdx, $acc4     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc2, $t0
> +       shl     \$32, $acc2
> +       mov     %rdx, $acc1
>         mulq    $t1
> -       xor     $acc2, $acc2
> -       #add    $t0, $acc0
> -       #adc    \$0, %rdx
> -       add     %rax, $acc0
> +       shr     \$32, $t0
> +       add     $acc2, $acc3
> +       adc     $t0, $acc0
> +       adc     %rax, $acc1
>          mov    $acc3, %rax
> -       adc     %rdx, $acc1
> -       adc     \$0, $acc2
> +       adc     \$0, %rdx
>
>         ###########################################
>         # Last iteration
> -       mulq    $a_ptr
> -       #xor    $t0, $t0
> -       add     $acc3, $acc4
> -       adc     \$0, %rdx
> -       add     %rax, $acc4
> -       mov     $acc3, %rax
> -       adc     %rdx, $acc0     # doesn't overflow
> -       #adc    \$0, $t0
> -
> +       mov     $acc3, $t0
> +       shl     \$32, $acc3
> +       mov     %rdx, $acc2
>         mulq    $t1
> +       shr     \$32, $t0
> +       add     $acc3, $acc0
> +       adc     $t0, $acc1
> +       adc     %rax, $acc2
> +       adc     \$0, %rdx
>         xor     $acc3, $acc3
> -       #add    $t0, $acc1
> -       #adc    \$0, %rdx
> -       add     %rax, $acc1
> -       adc     %rdx, $acc2
> -       adc     \$0, $acc3
>
>         ############################################
>         # Add the rest of the acc
> -       add     $acc0, $acc5
> +       add     $acc0, $acc4
> +       adc     $acc1, $acc5
>          mov    $acc4, $acc0
> -       adc     $acc1, $acc6
> -       adc     $acc2, $acc7
> +       adc     $acc2, $acc6
> +       adc     %rdx, $acc7
>          mov    $acc5, $acc1
>         adc     \$0, $acc3
>
> @@ -1028,18 +975,15 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # First reduction step
> -       xor     $acc0, $acc0            # $acc0=0,cf=0,of=0
> -       adox    $t1, $acc1
> -       adox    $t0, $acc2
> +       add     $t1, $acc1
> +       adc     $t0, $acc2
>
>         mulx    $poly3, $t0, $t1
>          mov    8*1($b_ptr), %rdx
> -       adox    $t0, $acc3
> -       adcx    $t1, $acc4
> -
> -       adox    $acc0, $acc4
> -       adcx    $acc0, $acc5            # cf=0
> -       adox    $acc0, $acc5            # of=0
> +       adc     $t0, $acc3
> +       adc     $t1, $acc4
> +       adc     \$0, $acc5
> +       xor     $acc0, $acc0            # $acc0=0,cf=0,of=0
>
>         ########################################################################
>         # Multiply by b[1]
> @@ -1068,18 +1012,15 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # Second reduction step
> -       xor     $acc1 ,$acc1            # $acc1=0,cf=0,of=0
> -       adox    $t0, $acc2
> -       adox    $t1, $acc3
> +       add     $t0, $acc2
> +       adc     $t1, $acc3
>
>         mulx    $poly3, $t0, $t1
>          mov    8*2($b_ptr), %rdx
> -       adox    $t0, $acc4
> -       adcx    $t1, $acc5
> -
> -       adox    $acc1, $acc5
> -       adcx    $acc1, $acc0            # cf=0
> -       adox    $acc1, $acc0            # of=0
> +       adc     $t0, $acc4
> +       adc     $t1, $acc5
> +       adc     \$0, $acc0
> +       xor     $acc1 ,$acc1            # $acc1=0,cf=0,of=0
>
>         ########################################################################
>         # Multiply by b[2]
> @@ -1108,18 +1049,15 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # Third reduction step
> -       xor     $acc2, $acc2            # $acc2=0,cf=0,of=0
> -       adox    $t0, $acc3
> -       adox    $t1, $acc4
> +       add     $t0, $acc3
> +       adc     $t1, $acc4
>
>         mulx    $poly3, $t0, $t1
>          mov    8*3($b_ptr), %rdx
> -       adox    $t0, $acc5
> -       adcx    $t1, $acc0
> -
> -       adox    $acc2, $acc0
> -       adcx    $acc2, $acc1            # cf=0
> -       adox    $acc2, $acc1            # of=0
> +       adc     $t0, $acc5
> +       adc     $t1, $acc0
> +       adc     \$0, $acc1
> +       xor     $acc2, $acc2            # $acc2=0,cf=0,of=0
>
>         ########################################################################
>         # Multiply by b[3]
> @@ -1148,25 +1086,21 @@ __ecp_nistz256_mul_montx:
>
>         ########################################################################
>         # Fourth reduction step
> -       xor     $acc3, $acc3            # $acc3=0,cf=0,of=0
> -       adox    $t0, $acc4
> -       adox    $t1, $acc5
> +       add     $t0, $acc4
> +       adc     $t1, $acc5
>
>         mulx    $poly3, $t0, $t1
>          mov    $acc4, $t2
>         mov     .Lpoly+8*1(%rip), $poly1
> -       adcx    $t0, $acc0
> -       adox    $t1, $acc1
> +       adc     $t0, $acc0
>          mov    $acc5, $t3
> -
> -       adcx    $acc3, $acc1
> -       adox    $acc3, $acc2
> +       adc     $t1, $acc1
>         adc     \$0, $acc2
> -        mov    $acc0, $t0
>
>         ########################################################################
>         # Branch-less conditional subtraction of P
>         xor     %eax, %eax
> +        mov    $acc0, $t0
>         sbb     \$-1, $acc4             # .Lpoly[0]
>         sbb     $poly1, $acc5           # .Lpoly[1]
>         sbb     \$0, $acc0              # .Lpoly[2]
> @@ -1247,52 +1181,44 @@ __ecp_nistz256_sqr_montx:
>          mov    .Lpoly+8*3(%rip), $t1
>
>         # reduction step 1
> -       xor     $acc0, $acc0
> -       adcx    $t0, $acc1
> -       adcx    $t4, $acc2
> +       add     $t0, $acc1
> +       adc     $t4, $acc2
>
> -       mulx    $t1, $t0, $t4
> +       mulx    $t1, $t0, $acc0
>          mov    $acc1, %rdx
> -       adcx    $t0, $acc3
> +       adc     $t0, $acc3
>          shlx   $a_ptr, $acc1, $t0
> -       adox    $t4, $acc0
> -        shrx   $a_ptr, $acc1, $t4
>         adc     \$0, $acc0
> +        shrx   $a_ptr, $acc1, $t4
>
>         # reduction step 2
> -       xor     $acc1, $acc1
> -       adcx    $t0, $acc2
> -       adcx    $t4, $acc3
> +       add     $t0, $acc2
> +       adc     $t4, $acc3
>
> -       mulx    $t1, $t0, $t4
> +       mulx    $t1, $t0, $acc1
>          mov    $acc2, %rdx
> -       adcx    $t0, $acc0
> +       adc     $t0, $acc0
>          shlx   $a_ptr, $acc2, $t0
> -       adox    $t4, $acc1
> -        shrx   $a_ptr, $acc2, $t4
>         adc     \$0, $acc1
> +        shrx   $a_ptr, $acc2, $t4
>
>         # reduction step 3
> -       xor     $acc2, $acc2
> -       adcx    $t0, $acc3
> -       adcx    $t4, $acc0
> +       add     $t0, $acc3
> +       adc     $t4, $acc0
>
> -       mulx    $t1, $t0, $t4
> +       mulx    $t1, $t0, $acc2
>          mov    $acc3, %rdx
> -       adcx    $t0, $acc1
> +       adc     $t0, $acc1
>          shlx   $a_ptr, $acc3, $t0
> -       adox    $t4, $acc2
> -        shrx   $a_ptr, $acc3, $t4
>         adc     \$0, $acc2
> +        shrx   $a_ptr, $acc3, $t4
>
>         # reduction step 4
> -       xor     $acc3, $acc3
> -       adcx    $t0, $acc0
> -       adcx    $t4, $acc1
> +       add     $t0, $acc0
> +       adc     $t4, $acc1
>
> -       mulx    $t1, $t0, $t4
> -       adcx    $t0, $acc2
> -       adox    $t4, $acc3
> +       mulx    $t1, $t0, $acc3
> +       adc     $t0, $acc2
>         adc     \$0, $acc3
>
>         xor     $t3, $t3                # cf=0
>


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Adam Langley-3
In reply to this post by Adam Langley-3
On Tue, Dec 2, 2014 at 12:33 PM, Adam Langley <[hidden email]> wrote:
> thanks! Was away last week and so didn't have a chance to try fixing this.
>
> I'll patch that it and run the tests against it.

I've run out of time tracking this down for today, but I got to the
point where setting the Jacobian coordinates:

X: C4EB2994C09557B400FF6A543CFB257F945E86FE3DF1D32A8128F32927666A8F
Y: 3D5283F8F10F559AE5310005005F321B28D2D699F3E01F179F91AC6660013328
Z: F97FD7E6757991A2C7E0C2488FF3C54E58030BCACF3FB95954FD3EF211C24631

and multiplying that point by
2269520AFB46450398DE95AE59DDBDC1D42B8B7030F81BCFEF12D819C1D678DD
results in the affine point:

x: 4BBC2813F69EF6A4D3E69E2832E9A9E97FF59F8C136DCDBD9509BC685FF337FD
y: BDCB623715CE2D983CFC2776C6EED4375454BE2C88932D43856906C1DC7A0BD7

However, I believe that the result should be:

x: C2910AA0216D12DE30C5573CCFC4116546E3091DC1E9EC8604F634185CE40863
y: C9071E13D688C305CE179C6168DD9066657BC6CDC1639A44B68DF7F1E0A40EDF


Cheers

AGL
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
On Tue, Dec 2, 2014 at 12:33 PM, Adam Langley <[hidden email]> wrote:
> thanks! Was away last week and so didn't have a chance to try fixing this.
>
> I'll patch that it and run the tests against it.

I've run out of time tracking this down for today, but I got to the
point where setting the Jacobian coordinates:

X: C4EB2994C09557B400FF6A543CFB257F945E86FE3DF1D32A8128F32927666A8F
Y: 3D5283F8F10F559AE5310005005F321B28D2D699F3E01F179F91AC6660013328
Z: F97FD7E6757991A2C7E0C2488FF3C54E58030BCACF3FB95954FD3EF211C24631

and multiplying that point by
2269520AFB46450398DE95AE59DDBDC1D42B8B7030F81BCFEF12D819C1D678DD
results in the affine point:

x: 4BBC2813F69EF6A4D3E69E2832E9A9E97FF59F8C136DCDBD9509BC685FF337FD
y: BDCB623715CE2D983CFC2776C6EED4375454BE2C88932D43856906C1DC7A0BD7

However, I believe that the result should be:

x: C2910AA0216D12DE30C5573CCFC4116546E3091DC1E9EC8604F634185CE40863
y: C9071E13D688C305CE179C6168DD9066657BC6CDC1639A44B68DF7F1E0A40EDF


Cheers

AGL


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>> thanks! Was away last week and so didn't have a chance to try fixing this.
>>
>> I'll patch that it and run the tests against it.
>
> I've run out of time tracking this down for today, but I got to the
> point where setting the Jacobian coordinates:
>
> X: C4EB2994C09557B400FF6A543CFB257F945E86FE3DF1D32A8128F32927666A8F
> Y: 3D5283F8F10F559AE5310005005F321B28D2D699F3E01F179F91AC6660013328
> Z: F97FD7E6757991A2C7E0C2488FF3C54E58030BCACF3FB95954FD3EF211C24631
>
> and multiplying that point by
> 2269520AFB46450398DE95AE59DDBDC1D42B8B7030F81BCFEF12D819C1D678DD
> results in the affine point:
>
> x: 4BBC2813F69EF6A4D3E69E2832E9A9E97FF59F8C136DCDBD9509BC685FF337FD
> y: BDCB623715CE2D983CFC2776C6EED4375454BE2C88932D43856906C1DC7A0BD7
>
> However, I believe that the result should be:
>
> x: C2910AA0216D12DE30C5573CCFC4116546E3091DC1E9EC8604F634185CE40863
> y: C9071E13D688C305CE179C6168DD9066657BC6CDC1639A44B68DF7F1E0A40EDF

I do get the latter...





______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>>> thanks! Was away last week and so didn't have a chance to try fixing this.
>>>
>>> I'll patch that it and run the tests against it.
>>
>> I've run out of time tracking this down for today, but I got to the
>> point where setting the Jacobian coordinates:
>>
>> X: C4EB2994C09557B400FF6A543CFB257F945E86FE3DF1D32A8128F32927666A8F
>> Y: 3D5283F8F10F559AE5310005005F321B28D2D699F3E01F179F91AC6660013328
>> Z: F97FD7E6757991A2C7E0C2488FF3C54E58030BCACF3FB95954FD3EF211C24631
>>
>> and multiplying that point by
>> 2269520AFB46450398DE95AE59DDBDC1D42B8B7030F81BCFEF12D819C1D678DD
>> results in the affine point:
>>
>> x: 4BBC2813F69EF6A4D3E69E2832E9A9E97FF59F8C136DCDBD9509BC685FF337FD
>> y: BDCB623715CE2D983CFC2776C6EED4375454BE2C88932D43856906C1DC7A0BD7
>>
>> However, I believe that the result should be:
>>
>> x: C2910AA0216D12DE30C5573CCFC4116546E3091DC1E9EC8604F634185CE40863
>> y: C9071E13D688C305CE179C6168DD9066657BC6CDC1639A44B68DF7F1E0A40EDF
>
> I do get the latter...

... in master, and I get the former in 1.0.2. Looking into it...



______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>>>> thanks! Was away last week and so didn't have a chance to try fixing this.

>>>>
>>>> I'll patch that it and run the tests against it.
>>>
>>> I've run out of time tracking this down for today, but I got to the
>>> point where setting the Jacobian coordinates:
>>>
>>> X: C4EB2994C09557B400FF6A543CFB257F945E86FE3DF1D32A8128F32927666A8F
>>> Y: 3D5283F8F10F559AE5310005005F321B28D2D699F3E01F179F91AC6660013328
>>> Z: F97FD7E6757991A2C7E0C2488FF3C54E58030BCACF3FB95954FD3EF211C24631
>>>
>>> and multiplying that point by
>>> 2269520AFB46450398DE95AE59DDBDC1D42B8B7030F81BCFEF12D819C1D678DD
>>> results in the affine point:
>>>
>>> x: 4BBC2813F69EF6A4D3E69E2832E9A9E97FF59F8C136DCDBD9509BC685FF337FD
>>> y: BDCB623715CE2D983CFC2776C6EED4375454BE2C88932D43856906C1DC7A0BD7
>>>
>>> However, I believe that the result should be:
>>>
>>> x: C2910AA0216D12DE30C5573CCFC4116546E3091DC1E9EC8604F634185CE40863
>>> y: C9071E13D688C305CE179C6168DD9066657BC6CDC1639A44B68DF7F1E0A40EDF
>>
>> I do get the latter...
>
> ... in master, and I get the former in 1.0.2. Looking into it...
Attached patch produces correct result in 1.0.2. Looking further for
explanation...



commit 9ca10dca3ca6bc5ea739a998258351dbfd261b42
Author: Andy Polyakov <[hidden email]>
Date:   Sun Oct 26 12:45:47 2014 +0100

    Add ARMv4 ECP_NISTZ256 implementation.

diff --git a/Configure b/Configure
index 2eda5e6..777e8cf 100755
--- a/Configure
+++ b/Configure
@@ -138,7 +138,7 @@ my $alpha_asm="alphacpuid.o:bn_asm.o alpha-mont.o::::::sha1-alpha.o:::::::ghash-
 my $mips64_asm=":bn-mips.o mips-mont.o:::aes_cbc.o aes-mips.o:::sha1-mips.o sha256-mips.o sha512-mips.o::::::::";
 my $mips32_asm=$mips64_asm; $mips32_asm =~ s/\s*sha512\-mips\.o//;
 my $s390x_asm="s390xcap.o s390xcpuid.o:bn-s390x.o s390x-mont.o s390x-gf2m.o:::aes-s390x.o aes-ctr.o aes-xts.o:::sha1-s390x.o sha256-s390x.o sha512-s390x.o::rc4-s390x.o:::::ghash-s390x.o:";
-my $armv4_asm="armcap.o armv4cpuid.o:bn_asm.o armv4-mont.o armv4-gf2m.o:::aes_cbc.o aes-armv4.o bsaes-armv7.o aesv8-armx.o:::sha1-armv4-large.o sha256-armv4.o sha512-armv4.o:::::::ghash-armv4.o ghashv8-armx.o::void";
+my $armv4_asm="armcap.o armv4cpuid.o:bn_asm.o armv4-mont.o armv4-gf2m.o:ecp_nistz256.o ecp_nistz256-armv4.o::aes_cbc.o aes-armv4.o bsaes-armv7.o aesv8-armx.o:::sha1-armv4-large.o sha256-armv4.o sha512-armv4.o:::::::ghash-armv4.o ghashv8-armx.o::void";
 my $aarch64_asm="armcap.o arm64cpuid.o mem_clr.o::::aes_core.o aes_cbc.o aesv8-armx.o:::sha1-armv8.o sha256-armv8.o sha512-armv8.o:::::::ghashv8-armx.o:";
 my $parisc11_asm="pariscid.o:bn_asm.o parisc-mont.o:::aes_core.o aes_cbc.o aes-parisc.o:::sha1-parisc.o sha256-parisc.o sha512-parisc.o::rc4-parisc.o:::::ghash-parisc.o::32";
 my $parisc20_asm="pariscid.o:pa-risc2W.o parisc-mont.o:::aes_core.o aes_cbc.o aes-parisc.o:::sha1-parisc.o sha256-parisc.o sha512-parisc.o::rc4-parisc.o:::::ghash-parisc.o::64";
diff --git a/TABLE b/TABLE
index d778dac..b1d4543 100644
--- a/TABLE
+++ b/TABLE
@@ -1132,7 +1132,7 @@ $lflags       = -ldl
 $bn_ops       = BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR
 $cpuid_obj    = armcap.o armv4cpuid.o
 $bn_obj       = bn_asm.o armv4-mont.o armv4-gf2m.o
-$ec_obj       =
+$ec_obj       = ecp_nistz256.o ecp_nistz256-armv4.o
 $des_obj      =
 $aes_obj      = aes_cbc.o aes-armv4.o bsaes-armv7.o aesv8-armx.o
 $bf_obj       =
@@ -4362,7 +4362,7 @@ $lflags       = -ldl
 $bn_ops       = BN_LLONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR
 $cpuid_obj    = armcap.o armv4cpuid.o
 $bn_obj       = bn_asm.o armv4-mont.o armv4-gf2m.o
-$ec_obj       =
+$ec_obj       = ecp_nistz256.o ecp_nistz256-armv4.o
 $des_obj      =
 $aes_obj      = aes_cbc.o aes-armv4.o bsaes-armv7.o aesv8-armx.o
 $bf_obj       =
diff --git a/crypto/ec/Makefile b/crypto/ec/Makefile
index 898e43d..e020c93 100644
--- a/crypto/ec/Makefile
+++ b/crypto/ec/Makefile
@@ -54,6 +54,9 @@ ecp_nistz256-x86_64.s: asm/ecp_nistz256-x86_64.pl
 ecp_nistz256-avx2.s:   asm/ecp_nistz256-avx2.pl
  $(PERL) asm/ecp_nistz256-avx2.pl $(PERLASM_SCHEME) > $@
 
+ecp_nistz256-%.S: asm/ecp_nistz256-%.pl;  $(PERL) $< $(PERLASM_SCHEME) $@
+ecp_nistz256-armv4.o: ecp_nistz256-armv4.S
+
 files:
  $(PERL) $(TOP)/util/files.pl Makefile >> $(TOP)/MINFO
 
diff --git a/crypto/ec/asm/ecp_nistz256-armv4.pl b/crypto/ec/asm/ecp_nistz256-armv4.pl
new file mode 100755
index 0000000..ad29948
--- /dev/null
+++ b/crypto/ec/asm/ecp_nistz256-armv4.pl
@@ -0,0 +1,1720 @@
+#!/usr/bin/env perl
+
+# ====================================================================
+# Written by Andy Polyakov <[hidden email]> for the OpenSSL
+# project. The module is, however, dual licensed under OpenSSL and
+# CRYPTOGAMS licenses depending on where you obtain it. For further
+# details see http://www.openssl.org/~appro/cryptogams/.
+# ====================================================================
+#
+# ECP_NISTZ256 module for ARMv4.
+#
+# October 2014.
+#
+# Original ECP_NISTZ256 submission targeting x86_64 is detailed in
+# http://eprint.iacr.org/2013/816. In the process of adaptation
+# original .c module was made 32-bit savvy in order to make this
+# implementation possible.
+#
+# with/without -DECP_NISTZ256_ASM
+# Cortex A5 +74-209%
+# Cortex A8 +53-173%
+# Snapdragon S4 +66-187%
+#
+# Ranges denote minimum and maximum improvement coefficients depending
+# on benchmark. Lower coefficients are for server-side operations.
+# Keep in mind that +200% means 3x improvement.
+
+while (($output=shift) && ($output!~/^\w[\w\-]*\.\w+$/)) {}
+open STDOUT,">$output";
+
+$code.=<<___;
+#include "arm_arch.h"
+
+.text
+.code 32
+___
+########################################################################
+# Convert ecp_nistz256_table.c to layout expected by ecp_nistz_gather_w7
+#
+$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
+open TABLE,"<ecp_nistz256_table.c" or
+open TABLE,"<${dir}../ecp_nistz256_table.c" or
+die "failed to open ecp_nistz256_table.c:",$!;
+
+use integer;
+
+foreach(<TABLE>) {
+ s/TOBN\(\s*(0x[0-9a-f]+),\s*(0x[0-9a-f]+)\s*\)/push @arr,hex($2),hex($1)/geo;
+}
+close TABLE;
+
+die "insane number of elements" if ($#arr != 64*16*37-1);
+
+$code.=<<___;
+.globl ecp_nistz256_precomputed
+.type ecp_nistz256_precomputed,%object
+.align 12
+ecp_nistz256_precomputed:
+___
+for(1..37) {
+ @tbl = splice(@arr,0,64*16);
+ for($i=0;$i<64;$i++) {
+ undef @line;
+ for($j=0;$j<64;$j++) {
+ push @line,(@tbl[$j*16+$i/4]>>(($i%4)*8))&0xff;
+ }
+ $code.=".byte\t";
+ $code.=join(',',map { sprintf "0x%02x",$_} @line);
+ $code.="\n";
+ }
+}
+$code.=<<___;
+.size ecp_nistz256_precomputed,.-ecp_nistz256_precomputed
+.align 5
+.LRR: @ 2^512 mod P precomputed for NIST P256 polynomial
+.long 0x00000003, 0x00000000, 0xffffffff, 0xfffffffb
+.long 0xfffffffe, 0xffffffff, 0xfffffffd, 0x00000004
+.Lone:
+.long 1,0,0,0,0,0,0,0
+.asciz "ECP_NISTZ256 for ARMv4, CRYPTOGAMS by <appro\@openssl.org>"
+.align 6
+
+.globl ecp_nistz256_to_mont
+.type ecp_nistz256_to_mont,%function
+ecp_nistz256_to_mont:
+ adr r2,.LRR
+ b .Lecp_nistz256_mul_mont
+.size ecp_nistz256_to_mont,.-ecp_nistz256_to_mont
+
+.globl ecp_nistz256_from_mont
+.type ecp_nistz256_from_mont,%function
+ecp_nistz256_from_mont:
+ adr r2,.Lone
+ b .Lecp_nistz256_mul_mont
+.size ecp_nistz256_from_mont,.-ecp_nistz256_from_mont
+___
+
+($r_ptr,$a_ptr,$b_ptr,$ff,$a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7,$t1,$t2)=
+ map("r$_",(0..12,14));
+($t0,$t3)=($ff,$a_ptr);
+
+$code.=<<___;
+.globl ecp_nistz256_mul_by_2
+.type ecp_nistz256_mul_by_2,%function
+.align 4
+ecp_nistz256_mul_by_2:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_mul_by_2
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_mul_by_2,.-ecp_nistz256_mul_by_2
+
+.type _ecp_nistz256_mul_by_2,%function
+.align 4
+_ecp_nistz256_mul_by_2:
+ ldr $a0,[$a_ptr,#0]
+ ldr $a1,[$a_ptr,#4]
+ ldr $a2,[$a_ptr,#8]
+ adds $a0,$a0,$a0 @ a[0:7]+=a[0:7]
+ ldr $a3,[$a_ptr,#12]
+ adcs $a1,$a1,$a1
+ ldr $a4,[$a_ptr,#16]
+ adcs $a2,$a2,$a2
+ ldr $a5,[$a_ptr,#20]
+ adcs $a3,$a3,$a3
+ ldr $a6,[$a_ptr,#24]
+ adcs $a4,$a4,$a4
+ ldr $a7,[$a_ptr,#28]
+ adcs $a5,$a5,$a5
+ adcs $a6,$a6,$a6
+ mov $ff,#0
+ adcs $a7,$a7,$a7
+ movcs $ff,#-1
+
+ subs $a0,$a0,$ff
+ sbcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ sbcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ sbcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ sbcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ sbcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ sbcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ sbcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size _ecp_nistz256_mul_by_2,.-_ecp_nistz256_mul_by_2
+
+.globl ecp_nistz256_mul_by_3
+.type ecp_nistz256_mul_by_3,%function
+.align 4
+ecp_nistz256_mul_by_3:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_mul_by_3
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_mul_by_3,.-ecp_nistz256_mul_by_3
+
+.type _ecp_nistz256_mul_by_3,%function
+.align 4
+_ecp_nistz256_mul_by_3:
+ str lr,[sp,#-4]! @ push lr
+
+ ldr $a0,[$a_ptr,#0]
+ ldr $a1,[$a_ptr,#4]
+ ldr $a2,[$a_ptr,#8]
+ adds $a0,$a0,$a0 @ a[0:7]+=a[0:7]
+ ldr $a3,[$a_ptr,#12]
+ adcs $a1,$a1,$a1
+ ldr $a4,[$a_ptr,#16]
+ adcs $a2,$a2,$a2
+ ldr $a5,[$a_ptr,#20]
+ adcs $a3,$a3,$a3
+ ldr $a6,[$a_ptr,#24]
+ adcs $a4,$a4,$a4
+ ldr $a7,[$a_ptr,#28]
+ adcs $a5,$a5,$a5
+ adcs $a6,$a6,$a6
+ mov $ff,#0
+ adcs $a7,$a7,$a7
+ movcs $ff,#-1
+
+ subs $a0,$a0,$ff
+ sbcs $a1,$a1,$ff
+ sbcs $a2,$a2,$ff
+ sbcs $a3,$a3,#0
+ sbcs $a4,$a4,#0
+ ldr $b_ptr,[$a_ptr,#0]
+ sbcs $a5,$a5,#0
+ ldr $t1,[$a_ptr,#4]
+ sbcs $a6,$a6,$ff,lsr#31
+ ldr $t2,[$a_ptr,#8]
+ sbcs $a7,$a7,$ff
+
+ ldr $t0,[$a_ptr,#12]
+ adds $a0,$b_ptr @ 2*a[0:7]+=a[0:7]
+ ldr $b_ptr,[$a_ptr,#16]
+ adcs $a1,$t1
+ ldr $t1,[$a_ptr,#20]
+ adcs $a2,$t2
+ ldr $t2,[$a_ptr,#24]
+ adcs $a3,$t0
+ ldr $t3,[$a_ptr,#28]
+ adcs $a4,$b_ptr
+ adcs $a5,$t1
+ adcs $a6,$t2
+ mov $ff,#0
+ adcs $a7,$t3
+ movcs $ff,#-1
+ ldr lr,[sp],#4 @ pop lr
+
+ subs $a0,$a0,$ff
+ sbcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ sbcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ sbcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ sbcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ sbcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ sbcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ sbcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size ecp_nistz256_mul_by_3,.-ecp_nistz256_mul_by_3
+
+.globl ecp_nistz256_div_by_2
+.type ecp_nistz256_div_by_2,%function
+.align 4
+ecp_nistz256_div_by_2:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_div_by_2
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_div_by_2,.-ecp_nistz256_div_by_2
+
+.type _ecp_nistz256_div_by_2,%function
+.align 4
+_ecp_nistz256_div_by_2:
+ ldr $a0,[$a_ptr,#0]
+ ldr $a1,[$a_ptr,#4]
+ ldr $a2,[$a_ptr,#8]
+ mov $ff,$a0,lsl#31
+ ldr $a3,[$a_ptr,#12]
+ adds $a0,$a0,$ff,asr#31
+ ldr $a4,[$a_ptr,#16]
+ adcs $a1,$a1,$ff,asr#31
+ ldr $a5,[$a_ptr,#20]
+ adcs $a2,$a2,$ff,asr#31
+ ldr $a6,[$a_ptr,#24]
+ adcs $a3,$a3,#0
+ ldr $a7,[$a_ptr,#28]
+ adcs $a4,$a4,#0
+ mov $a0,$a0,lsr#1 @ a[0:7]>>=1
+ adcs $a5,$a5,#0
+ orr $a0,$a1,lsl#31
+ adcs $a6,$a6,$ff,lsr#31
+ mov $b_ptr,#0
+ adcs $a7,$a7,$ff,asr#31
+ mov $a1,$a1,lsr#1
+ adc $b_ptr,#0
+
+ orr $a1,$a2,lsl#31
+ mov $a2,$a2,lsr#1
+ str $a0,[$r_ptr,#0]
+ orr $a2,$a3,lsl#31
+ mov $a3,$a3,lsr#1
+ str $a1,[$r_ptr,#4]
+ orr $a3,$a4,lsl#31
+ mov $a4,$a4,lsr#1
+ str $a2,[$r_ptr,#8]
+ orr $a4,$a5,lsl#31
+ mov $a5,$a5,lsr#1
+ str $a3,[$r_ptr,#12]
+ orr $a5,$a6,lsl#31
+ mov $a6,$a6,lsr#1
+ str $a4,[$r_ptr,#16]
+ orr $a6,$a7,lsl#31
+ mov $a7,$a7,lsr#1
+ str $a5,[$r_ptr,#20]
+ orr $a7,$b_ptr,lsl#31
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size _ecp_nistz256_div_by_2,.-_ecp_nistz256_div_by_2
+
+.globl ecp_nistz256_add
+.type ecp_nistz256_add,%function
+.align 4
+ecp_nistz256_add:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_add
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_add,.-ecp_nistz256_add
+
+.type _ecp_nistz256_add,%function
+.align 4
+_ecp_nistz256_add:
+ str lr,[sp,#-4]! @ push lr
+
+ ldr $a0,[$a_ptr,#0]
+ ldr $a1,[$a_ptr,#4]
+ ldr $a2,[$a_ptr,#8]
+ ldr $a3,[$a_ptr,#12]
+ ldr $a4,[$a_ptr,#16]
+ ldr $t0,[$b_ptr,#0]
+ ldr $a5,[$a_ptr,#20]
+ ldr $t1,[$b_ptr,#4]
+ ldr $a6,[$a_ptr,#24]
+ ldr $t2,[$b_ptr,#8]
+ ldr $a7,[$a_ptr,#28]
+ ldr $t3,[$b_ptr,#12]
+ adds $a0,$t0
+ ldr $t0,[$b_ptr,#16]
+ adcs $a1,$t1
+ ldr $t1,[$b_ptr,#20]
+ adcs $a2,$t2
+ ldr $t2,[$b_ptr,#24]
+ adcs $a3,$t3
+ ldr $t3,[$b_ptr,#28]
+ adcs $a4,$t0
+ adcs $a5,$t1
+ adcs $a6,$t2
+ mov $ff,#0
+ adcs $a7,$t3
+ movcs $ff,#-1
+ ldr lr,[sp],#4 @ pop lr
+
+ subs $a0,$a0,$ff
+ sbcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ sbcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ sbcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ sbcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ sbcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ sbcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ sbcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size _ecp_nistz256_add,.-_ecp_nistz256_add
+
+.globl ecp_nistz256_sub
+.type ecp_nistz256_sub,%function
+.align 4
+ecp_nistz256_sub:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_sub
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_sub,.-ecp_nistz256_sub
+
+.type _ecp_nistz256_sub,%function
+.align 4
+_ecp_nistz256_sub:
+ str lr,[sp,#-4]! @ push lr
+
+ ldr $a0,[$a_ptr,#0]
+ ldr $a1,[$a_ptr,#4]
+ ldr $a2,[$a_ptr,#8]
+ ldr $a3,[$a_ptr,#12]
+ ldr $a4,[$a_ptr,#16]
+ ldr $t0,[$b_ptr,#0]
+ ldr $a5,[$a_ptr,#20]
+ ldr $t1,[$b_ptr,#4]
+ ldr $a6,[$a_ptr,#24]
+ ldr $t2,[$b_ptr,#8]
+ ldr $a7,[$a_ptr,#28]
+ ldr $t3,[$b_ptr,#12]
+ subs $a0,$t0
+ ldr $t0,[$b_ptr,#16]
+ sbcs $a1,$t1
+ ldr $t1,[$b_ptr,#20]
+ sbcs $a2,$t2
+ ldr $t2,[$b_ptr,#24]
+ sbcs $a3,$t3
+ ldr $t3,[$b_ptr,#28]
+ sbcs $a4,$t0
+ sbcs $a5,$t1
+ sbcs $a6,$t2
+ sbcs $a7,$t3
+ sbc $ff,$ff,$ff
+ ldr lr,[sp],#4 @ pop lr
+
+ adds $a0,$a0,$ff
+ adcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ adcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ adcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ adcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ adcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ adcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ adcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size _ecp_nistz256_sub,.-_ecp_nistz256_sub
+
+.globl ecp_nistz256_neg
+.type ecp_nistz256_neg,%function
+.align 4
+ecp_nistz256_neg:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_neg
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_neg,.-ecp_nistz256_neg
+
+.type _ecp_nistz256_neg,%function
+.align 4
+_ecp_nistz256_neg:
+ ldr $a0,[$a_ptr,#0]
+ ldr $a1,[$a_ptr,#4]
+ ldr $a2,[$a_ptr,#8]
+ rsbs $a0,$a0,#0
+ ldr $a3,[$a_ptr,#12]
+ rscs $a1,$a1,#0
+ ldr $a4,[$a_ptr,#16]
+ rscs $a2,$a2,#0
+ ldr $a5,[$a_ptr,#20]
+ rscs $a3,$a3,#0
+ ldr $a6,[$a_ptr,#24]
+ rscs $a4,$a4,#0
+ ldr $a7,[$a_ptr,#28]
+ rscs $a5,$a5,#0
+ rscs $a6,$a6,#0
+ rscs $a7,$a7,#0
+ sbc $ff,$ff,$ff
+
+ adds $a0,$a0,$ff
+ adcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ adcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ adcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ adcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ adcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ adcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ adcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size _ecp_nistz256_neg,.-_ecp_nistz256_neg
+___
+{
+my @acc=map("r$_",(3..11));
+my ($t0,$t1,$bj,$t2,$t3)=map("r$_",(0,1,2,12,14));
+
+$code.=<<___;
+.globl ecp_nistz256_sqr_mont
+.type ecp_nistz256_sqr_mont,%function
+.align 4
+ecp_nistz256_sqr_mont:
+ mov $b_ptr,$a_ptr
+ b .Lecp_nistz256_mul_mont
+.size ecp_nistz256_sqr_mont,.-ecp_nistz256_sqr_mont
+
+.globl ecp_nistz256_mul_mont
+.type ecp_nistz256_mul_mont,%function
+.align 4
+ecp_nistz256_mul_mont:
+.Lecp_nistz256_mul_mont:
+ stmdb sp!,{r4-r12,lr}
+ bl _ecp_nistz256_mul_mont
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_mul_mont,.-ecp_nistz256_mul_mont
+
+.type _ecp_nistz256_mul_mont,%function
+.align 4
+_ecp_nistz256_mul_mont:
+ stmdb sp!,{r0-r2,lr}
+
+ ldr $bj,[$b_ptr,#0] @ b[0]
+ ldmia $a_ptr,{@acc[1]-@acc[8]}
+
+ umull @acc[0],$t3,@acc[1],$bj @ r[0]=a[0]*b[0]
+ stmdb sp!,{$acc[1]-@acc[8]} @ copy a[0-7] to stack, so
+ @ that it can be addressed
+ @ without spending register
+ @ on address
+ umull @acc[1],$t0,@acc[2],$bj
+ umull @acc[2],$t1,@acc[3],$bj
+ adds @acc[1],@acc[1],$t3
+ umull @acc[3],$t2,@acc[4],$bj
+ adcs @acc[2],@acc[2],$t0
+ umull @acc[4],$t3,@acc[5],$bj
+ adcs @acc[3],@acc[3],$t1
+ umull @acc[5],$t0,@acc[6],$bj
+ adcs @acc[4],@acc[4],$t2
+ umull @acc[6],$t1,@acc[7],$bj
+ adcs @acc[5],@acc[5],$t3
+ umull @acc[7],$t2,@acc[8],$bj
+ adcs @acc[6],@acc[6],$t0
+ adcs @acc[7],@acc[7],$t1
+ eor $t3,$t3,$t3 @ first overflow bit is zero
+ adc @acc[8],$t2,#0
+___
+for(my $i=1;$i<8;$i++) {
+my $t4=@acc[0];
+$code.=<<___;
+ @ multiplication-less reduction $i
+ adds @acc[3],@acc[3],@acc[0]
+ ldr $bj,[sp,#40] @ resotre b_ptr
+ adcs @acc[4],@acc[4],#0
+ adcs @acc[5],@acc[5],#0
+ adcs @acc[6],@acc[6],@acc[0]
+ ldr $t1,[sp,#0] @ a[0]
+ adcs @acc[7],@acc[7],#0
+ ldr $bj,[$bj,#4*$i] @ b[i]
+ adcs @acc[8],@acc[8],@acc[0]
+ eor $t0,$t0,$t0
+ adc $t3,$t3,#0
+ subs @acc[7],@acc[7],@acc[0]
+ ldr $t2,[sp,#4] @ a[1]
+ sbcs @acc[8],@acc[8],#0
+ umlal @acc[1],$t0,$t1,$bj @ r[i]+=a[0]*b[i]
+ eor $t1,$t1,$t1
+ sbc @acc[0],$t3,#0 @ overflow bit
+
+ ldr $t3,[sp,#8] @ a[2]
+ umlal @acc[2],$t1,$t2,$bj
+ str @acc[0],[sp,#36] @ temporarily offload overflow
+ eor $t2,$t2,$t2
+ ldr $t4,[sp,#12] @ a[3], $t4 is alias @acc[0]
+ umlal @acc[3],$t2,$t3,$bj
+ eor $t3,$t3,$t3
+ adds @acc[2],@acc[2],$t0
+ ldr $t0,[sp,#16] @ a[4]
+ umlal @acc[4],$t3,$t4,$bj
+ eor $t4,$t4,$t4
+ adcs @acc[3],@acc[3],$t1
+ ldr $t1,[sp,#20] @ a[5]
+ umlal @acc[5],$t4,$t0,$bj
+ eor $t0,$t0,$t0
+ adcs @acc[4],@acc[4],$t2
+ ldr $t2,[sp,#24] @ a[6]
+ umlal @acc[6],$t0,$t1,$bj
+ eor $t1,$t1,$t1
+ adcs @acc[5],@acc[5],$t3
+ ldr $t3,[sp,#28] @ a[7]
+ umlal @acc[7],$t1,$t2,$bj
+ eor $t2,$t2,$t2
+ adcs @acc[6],@acc[6],$t4
+ ldr @acc[0],[sp,#36] @ restore overflow bit
+ umlal @acc[8],$t2,$t3,$bj
+ eor $t3,$t3,$t3
+ adcs @acc[7],@acc[7],$t0
+ adcs @acc[8],@acc[8],$t1
+ adcs @acc[0],$acc[0],$t2
+ adc $t3,$t3,#0 @ new overflow bit
+___
+ push(@acc,shift(@acc));
+}
+$code.=<<___;
+ @ last multiplication-less reduction
+ adds @acc[3],@acc[3],@acc[0]
+ ldr $r_ptr,[sp,#32] @ restore r_ptr
+ adcs @acc[4],@acc[4],#0
+ adcs @acc[5],@acc[5],#0
+ adcs @acc[6],@acc[6],@acc[0]
+ adcs @acc[7],@acc[7],#0
+ adcs @acc[8],@acc[8],@acc[0]
+ adc $t3,$t3,#0
+ subs @acc[7],@acc[7],@acc[0]
+ sbcs @acc[8],@acc[8],#0
+ sbc @acc[0],$t3,#0 @ overflow bit
+
+ neg @acc[0],@acc[0] @ 1 -> 0xffffffff, 0 -> 0
+ ldr lr,[sp,#44]
+ add sp,sp,#48
+
+ subs @acc[1],@acc[1],@acc[0] @ "conditional" subtract
+ sbcs @acc[2],@acc[2],@acc[0]
+ str @acc[1],[$r_ptr,#0]
+ sbcs @acc[3],@acc[3],@acc[0]
+ str @acc[2],[$r_ptr,#4]
+ sbcs @acc[4],@acc[4],#0
+ str @acc[3],[$r_ptr,#8]
+ sbcs @acc[5],@acc[5],#0
+ str @acc[4],[$r_ptr,#12]
+ sbcs @acc[6],@acc[6],#0
+ str @acc[5],[$r_ptr,#16]
+ sbcs @acc[7],@acc[7],@acc[0],lsr#31
+ str @acc[6],[$r_ptr,#20]
+ sbc @acc[8],@acc[8],@acc[0]
+ str @acc[7],[$r_ptr,#24]
+ str @acc[8],[$r_ptr,#28]
+
+ mov pc,lr
+.size _ecp_nistz256_mul_mont,.-_ecp_nistz256_mul_mont
+___
+}
+
+{
+my ($out,$inp,$index,$mask)=map("r$_",(0..3));
+$code.=<<___;
+.globl ecp_nistz256_scatter_w5
+.type ecp_nistz256_scatter_w5,%function
+.align 5
+ecp_nistz256_scatter_w5:
+ stmdb sp!,{r4-r11}
+
+ add $out,$out,$index,lsl#2
+
+ ldmia $inp!,{r4-r11} @ X
+ str r4,[$out,#64*0-4]
+ str r5,[$out,#64*1-4]
+ str r6,[$out,#64*2-4]
+ str r7,[$out,#64*3-4]
+ str r8,[$out,#64*4-4]
+ str r9,[$out,#64*5-4]
+ str r10,[$out,#64*6-4]
+ str r11,[$out,#64*7-4]
+ add $out,$out,#64*8
+
+ ldmia $inp!,{r4-r11} @ Y
+ str r4,[$out,#64*0-4]
+ str r5,[$out,#64*1-4]
+ str r6,[$out,#64*2-4]
+ str r7,[$out,#64*3-4]
+ str r8,[$out,#64*4-4]
+ str r9,[$out,#64*5-4]
+ str r10,[$out,#64*6-4]
+ str r11,[$out,#64*7-4]
+ add $out,$out,#64*8
+
+ ldmia $inp,{r4-r11} @ Z
+ str r4,[$out,#64*0-4]
+ str r5,[$out,#64*1-4]
+ str r6,[$out,#64*2-4]
+ str r7,[$out,#64*3-4]
+ str r8,[$out,#64*4-4]
+ str r9,[$out,#64*5-4]
+ str r10,[$out,#64*6-4]
+ str r11,[$out,#64*7-4]
+
+ ldmia sp!,{r4-r11}
+#if __ARM_ARCH__>=5
+ ret
+#else
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_scatter_w5,.-ecp_nistz256_scatter_w5
+
+.globl ecp_nistz256_gather_w5
+.type ecp_nistz256_gather_w5,%function
+.align 5
+ecp_nistz256_gather_w5:
+ stmdb sp!,{r4-r11}
+
+ cmp $index,#0
+ mov $mask,#0
+ subne $index,$index,#1
+ movne $mask,#-1
+ add $inp,$inp,$index,lsl#2
+
+ ldr r4,[$inp,#64*0]
+ ldr r5,[$inp,#64*1]
+ ldr r6,[$inp,#64*2]
+ and r4,r4,$mask
+ ldr r7,[$inp,#64*3]
+ and r5,r5,$mask
+ ldr r8,[$inp,#64*4]
+ and r6,r6,$mask
+ ldr r9,[$inp,#64*5]
+ and r7,r7,$mask
+ ldr r10,[$inp,#64*6]
+ and r8,r8,$mask
+ ldr r11,[$inp,#64*7]
+ add $inp,$inp,#64*8
+ and r9,r9,$mask
+ and r10,r10,$mask
+ and r11,r11,$mask
+ stmia $out!,{r4-r11} @ X
+
+ ldr r4,[$inp,#64*0]
+ ldr r5,[$inp,#64*1]
+ ldr r6,[$inp,#64*2]
+ and r4,r4,$mask
+ ldr r7,[$inp,#64*3]
+ and r5,r5,$mask
+ ldr r8,[$inp,#64*4]
+ and r6,r6,$mask
+ ldr r9,[$inp,#64*5]
+ and r7,r7,$mask
+ ldr r10,[$inp,#64*6]
+ and r8,r8,$mask
+ ldr r11,[$inp,#64*7]
+ add $inp,$inp,#64*8
+ and r9,r9,$mask
+ and r10,r10,$mask
+ and r11,r11,$mask
+ stmia $out!,{r4-r11} @ Y
+
+ ldr r4,[$inp,#64*0]
+ ldr r5,[$inp,#64*1]
+ ldr r6,[$inp,#64*2]
+ and r4,r4,$mask
+ ldr r7,[$inp,#64*3]
+ and r5,r5,$mask
+ ldr r8,[$inp,#64*4]
+ and r6,r6,$mask
+ ldr r9,[$inp,#64*5]
+ and r7,r7,$mask
+ ldr r10,[$inp,#64*6]
+ and r8,r8,$mask
+ ldr r11,[$inp,#64*7]
+ and r9,r9,$mask
+ and r10,r10,$mask
+ and r11,r11,$mask
+ stmia $out,{r4-r11} @ Z
+
+ ldmia sp!,{r4-r11}
+#if __ARM_ARCH__>=5
+ ret
+#else
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_gather_w5,.-ecp_nistz256_gather_w5
+
+.globl ecp_nistz256_scatter_w7
+.type ecp_nistz256_scatter_w7,%function
+.align 5
+ecp_nistz256_scatter_w7:
+ add $out,$out,$index
+ mov $index,#64/4
+.Loop_scatter_w7:
+ ldr $mask,[$inp],#4
+ subs $index,$index,#1
+ strb $mask,[$out,#64*0-1]
+ lsr $mask,$mask,#8
+ strb $mask,[$out,#64*1-1]
+ lsr $mask,$mask,#8
+ strb $mask,[$out,#64*2-1]
+ lsr $mask,$mask,#8
+ strb $mask,[$out,#64*3-1]
+ add $out,$out,#64*4
+ bne .Loop_scatter_w7
+
+#if __ARM_ARCH__>=5
+ ret
+#else
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_scatter_w7,.-ecp_nistz256_scatter_w7
+
+.globl ecp_nistz256_gather_w7
+.type ecp_nistz256_gather_w7,%function
+.align 5
+ecp_nistz256_gather_w7:
+ stmdb sp!,{r4-r7}
+
+ cmp $index,#0
+ mov $mask,#0
+ subne $index,$index,#1
+ movne $mask,#-1
+ add $inp,$inp,$index
+ mov $index,#64/4
+ nop
+.Loop_gather_w7:
+ ldrb r4,[$inp,#64*0]
+ subs $index,$index,#1
+ ldrb r5,[$inp,#64*1]
+ ldrb r6,[$inp,#64*2]
+ ldrb r7,[$inp,#64*3]
+ add $inp,$inp,#64*4
+ orr r4,r4,r5,lsl#8
+ orr r4,r4,r6,lsl#16
+ orr r4,r4,r7,lsl#24
+ and r4,r4,$mask
+ str r4,[$out],#4
+ bne .Loop_gather_w7
+
+ ldmia sp!,{r4-r7}
+#if __ARM_ARCH__>=5
+ ret
+#else
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_gather_w7,.-ecp_nistz256_gather_w7
+___
+}
+if (0) {
+# In comparison to integer-only equivalent of below subroutine:
+#
+# Cortex-A8 +10%
+# Cortex-A9 -10%
+# Snapdragon S4 +5%
+#
+# As not all time is spent in multiplication, overall impact is deemed
+# too low to care about.
+
+my ($A0,$A1,$A2,$A3,$Bi,$zero,$temp)=map("d$_",(0..7));
+my $mask="q4";
+my $mult="q5";
+my @AxB=map("q$_",(8..15));
+
+my ($rptr,$aptr,$bptr,$toutptr)=map("r$_",(0..3));
+
+$code.=<<___;
+#if __ARM_ARCH__>=7
+.fpu neon
+
+.globl ecp_nistz256_mul_mont_neon
+.type ecp_nistz256_mul_mont_neon,%function
+.align 5
+ecp_nistz256_mul_mont_neon:
+ mov ip,sp
+ stmdb sp!,{r4-r9}
+ vstmdb sp!,{q4-q5} @ ABI specification says so
+
+ sub $toutptr,sp,#40
+ vld1.32 {${Bi}[0]},[$bptr,:32]!
+ veor $zero,$zero,$zero
+ vld1.32 {$A0-$A3}, [$aptr] @ can't specify :32 :-(
+ vzip.16 $Bi,$zero
+ mov sp,$toutptr @ alloca
+ vmov.i64 $mask,#0xffff
+
+ vmull.u32 @AxB[0],$Bi,${A0}[0]
+ vmull.u32 @AxB[1],$Bi,${A0}[1]
+ vmull.u32 @AxB[2],$Bi,${A1}[0]
+ vmull.u32 @AxB[3],$Bi,${A1}[1]
+ vshr.u64 $temp,@AxB[0]#lo,#16
+ vmull.u32 @AxB[4],$Bi,${A2}[0]
+ vadd.u64 @AxB[0]#hi,@AxB[0]#hi,$temp
+ vmull.u32 @AxB[5],$Bi,${A2}[1]
+ vshr.u64 $temp,@AxB[0]#hi,#16 @ upper 32 bits of a[0]*b[0]
+ vmull.u32 @AxB[6],$Bi,${A3}[0]
+ vand.u64 @AxB[0],@AxB[0],$mask @ lower 32 bits of a[0]*b[0]
+ vmull.u32 @AxB[7],$Bi,${A3}[1]
+___
+for($i=1;$i<8;$i++) {
+$code.=<<___;
+ vld1.32 {${Bi}[0]},[$bptr,:32]!
+ veor $zero,$zero,$zero
+ vadd.u64 @AxB[1]#lo,@AxB[1]#lo,$temp @ reduction
+ vshl.u64 $mult,@AxB[0],#32
+ vadd.u64 @AxB[3],@AxB[3],@AxB[0]
+ vsub.u64 $mult,$mult,@AxB[0]
+ vzip.16 $Bi,$zero
+ vadd.u64 @AxB[6],@AxB[6],@AxB[0]
+ vadd.u64 @AxB[7],@AxB[7],$mult
+___
+ push(@AxB,shift(@AxB));
+$code.=<<___;
+ vmlal.u32 @AxB[0],$Bi,${A0}[0]
+ vmlal.u32 @AxB[1],$Bi,${A0}[1]
+ vmlal.u32 @AxB[2],$Bi,${A1}[0]
+ vmlal.u32 @AxB[3],$Bi,${A1}[1]
+ vshr.u64 $temp,@AxB[0]#lo,#16
+ vmlal.u32 @AxB[4],$Bi,${A2}[0]
+ vadd.u64 @AxB[0]#hi,@AxB[0]#hi,$temp
+ vmlal.u32 @AxB[5],$Bi,${A2}[1]
+ vshr.u64 $temp,@AxB[0]#hi,#16 @ upper 33 bits of a[0]*b[i]+t[0]
+ vmlal.u32 @AxB[6],$Bi,${A3}[0]
+ vand.u64 @AxB[0],@AxB[0],$mask @ lower 32 bits of a[0]*b[0]
+ vmull.u32 @AxB[7],$Bi,${A3}[1]
+___
+}
+$code.=<<___;
+ vadd.u64 @AxB[1]#lo,@AxB[1]#lo,$temp @ last reduction
+ vshl.u64 $mult,@AxB[0],#32
+ vadd.u64 @AxB[3],@AxB[3],@AxB[0]
+ vsub.u64 $mult,$mult,@AxB[0]
+ vadd.u64 @AxB[6],@AxB[6],@AxB[0]
+ vadd.u64 @AxB[7],@AxB[7],$mult
+
+ vshr.u64 $temp,@AxB[1]#lo,#16 @ convert
+ vadd.u64 @AxB[1]#hi,@AxB[1]#hi,$temp
+ vshr.u64 $temp,@AxB[1]#hi,#16
+ vzip.16 @AxB[1]#lo,@AxB[1]#hi
+___
+foreach (2..7) {
+$code.=<<___;
+ vadd.u64 @AxB[$_]#lo,@AxB[$_]#lo,$temp
+ vst1.32 {@AxB[$_-1]#lo[0]},[$toutptr,:32]!
+ vshr.u64 $temp,@AxB[$_]#lo,#16
+ vadd.u64 @AxB[$_]#hi,@AxB[$_]#hi,$temp
+ vshr.u64 $temp,@AxB[$_]#hi,#16
+ vzip.16 @AxB[$_]#lo,@AxB[$_]#hi
+___
+}
+$code.=<<___;
+ vst1.32 {@AxB[7]#lo[0]},[$toutptr,:32]!
+ vst1.32 {$temp},[$toutptr] @ upper 33 bits
+
+ ldr r1,[sp,#0]
+ ldr r2,[sp,#4]
+ ldr r3,[sp,#8]
+ ldr r4,[sp,#12]
+ ldr r5,[sp,#16]
+ ldr r6,[sp,#20]
+ ldr r7,[sp,#24]
+ ldr r8,[sp,#28]
+ ldr r9,[sp,#32] @ top-most bit
+ sub sp,ip,#40+16
+ neg r9,r9
+        vldmia  sp!,{q4-q5}
+
+ subs r1,r1,r9
+ sbcs r2,r2,r9
+ str r1,[$rptr,#0]
+ sbcs r3,r3,r9
+ str r2,[$rptr,#4]
+ sbcs r4,r4,#0
+ str r3,[$rptr,#8]
+ sbcs r5,r5,#0
+ str r4,[$rptr,#12]
+ sbcs r6,r6,#0
+ str r5,[$rptr,#16]
+ sbcs r7,r7,r9,lsr#31
+ str r6,[$rptr,#20]
+ sbcs r8,r8,r9
+ str r7,[$rptr,#24]
+ str r8,[$rptr,#28]
+
+        ldmia   sp!,{r4-r9}
+ ret @ bx lr
+.size ecp_nistz256_mul_mont_neon,.-ecp_nistz256_mul_mont_neon
+#endif
+___
+}
+
+{{{
+my ($a0,$a1,$a2,$a3,$a4,$a5,$a6,$a7,
+    $t0,$t1,$t2,$t3)=map("r$_",(11,3..10,12,14,1));
+my $ff=$b_ptr;
+
+$code.=<<___;
+.type __ecp_nistz256_sub_from,%function
+.align 5
+__ecp_nistz256_sub_from:
+ str lr,[sp,#-4]! @ push lr
+
+ ldr $t0,[$b_ptr,#0]
+ ldr $t1,[$b_ptr,#4]
+ ldr $t2,[$b_ptr,#8]
+ ldr $t3,[$b_ptr,#12]
+ subs $a0,$t0
+ ldr $t0,[$b_ptr,#16]
+ sbcs $a1,$t1
+ ldr $t1,[$b_ptr,#20]
+ sbcs $a2,$t2
+ ldr $t2,[$b_ptr,#24]
+ sbcs $a3,$t3
+ ldr $t3,[$b_ptr,#28]
+ sbcs $a4,$t0
+ sbcs $a5,$t1
+ sbcs $a6,$t2
+ sbcs $a7,$t3
+ sbc $ff,$ff,$ff
+ ldr lr,[sp],#4 @ pop lr
+
+ adds $a0,$a0,$ff
+ adcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ adcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ adcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ adcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ adcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ adcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ adcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size __ecp_nistz256_sub_from,.-__ecp_nistz256_sub_from
+
+.type __ecp_nistz256_sub_morf,%function
+.align 5
+__ecp_nistz256_sub_morf:
+ str lr,[sp,#-4]! @ push lr
+
+ ldr $t0,[$b_ptr,#0]
+ ldr $t1,[$b_ptr,#4]
+ ldr $t2,[$b_ptr,#8]
+ ldr $t3,[$b_ptr,#12]
+ rsbs $a0,$t0
+ ldr $t0,[$b_ptr,#16]
+ rscs $a1,$t1
+ ldr $t1,[$b_ptr,#20]
+ rscs $a2,$t2
+ ldr $t2,[$b_ptr,#24]
+ rscs $a3,$t3
+ ldr $t3,[$b_ptr,#28]
+ rscs $a4,$t0
+ rscs $a5,$t1
+ rscs $a6,$t2
+ rscs $a7,$t3
+ sbc $ff,$ff,$ff
+ ldr lr,[sp],#4 @ pop lr
+
+ adds $a0,$a0,$ff
+ adcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ adcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ adcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ adcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ adcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ adcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ adcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size __ecp_nistz256_sub_morf,.-__ecp_nistz256_sub_morf
+
+.type __ecp_nistz256_mul_by_2,%function
+.align 4
+__ecp_nistz256_mul_by_2:
+ adds $a0,$a0,$a0 @ a[0:7]+=a[0:7]
+ adcs $a1,$a1,$a1
+ adcs $a2,$a2,$a2
+ adcs $a3,$a3,$a3
+ adcs $a4,$a4,$a4
+ adcs $a5,$a5,$a5
+ adcs $a6,$a6,$a6
+ mov $ff,#0
+ adcs $a7,$a7,$a7
+ movcs $ff,#-1
+
+ subs $a0,$a0,$ff
+ sbcs $a1,$a1,$ff
+ str $a0,[$r_ptr,#0]
+ sbcs $a2,$a2,$ff
+ str $a1,[$r_ptr,#4]
+ sbcs $a3,$a3,#0
+ str $a2,[$r_ptr,#8]
+ sbcs $a4,$a4,#0
+ str $a3,[$r_ptr,#12]
+ sbcs $a5,$a5,#0
+ str $a4,[$r_ptr,#16]
+ sbcs $a6,$a6,$ff,lsr#31
+ str $a5,[$r_ptr,#20]
+ sbcs $a7,$a7,$ff
+ str $a6,[$r_ptr,#24]
+ str $a7,[$r_ptr,#28]
+
+ mov pc,lr
+.size __ecp_nistz256_mul_by_2,.-__ecp_nistz256_mul_by_2
+
+___
+
+{
+my ($S,$M,$Zsqr,$in_x,$tmp0)=map(32*$_,(0..4));
+
+$code.=<<___;
+.globl ecp_nistz256_point_double
+.type ecp_nistz256_point_double,%function
+.align 5
+ecp_nistz256_point_double:
+ stmdb sp!,{r0-r12,lr}
+ sub sp,sp,#32*5
+
+ add r3,sp,#$in_x
+ ldmia $a_ptr!,{r4-r11} @ copy in_x
+ stmia r3,{r4-r11}
+
+ add $r_ptr,sp,#$S
+ bl _ecp_nistz256_mul_by_2 @ p256_mul_by_2(S, in_y);
+
+ add $b_ptr,$a_ptr,#32
+ add $a_ptr,$a_ptr,#32
+ add $r_ptr,sp,#$Zsqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Zsqr, in_z);
+
+ add $a_ptr,sp,#$S
+ add $b_ptr,sp,#$S
+ add $r_ptr,sp,#$S
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(S, S);
+
+ ldr $b_ptr,[sp,#32*5+4]
+ add $a_ptr,$b_ptr,#32
+ add $b_ptr,$b_ptr,#64
+ add $r_ptr,sp,#$tmp0
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(tmp0, in_z, in_y);
+
+ ldr $r_ptr,[sp,#32*5]
+ add $r_ptr,$r_ptr,#64
+ bl __ecp_nistz256_mul_by_2 @ p256_mul_by_2(res_z, tmp0);
+
+ add $a_ptr,sp,#$in_x
+ add $b_ptr,sp,#$Zsqr
+ add $r_ptr,sp,#$M
+ bl _ecp_nistz256_add @ p256_add(M, in_x, Zsqr);
+
+ add $a_ptr,sp,#$in_x
+ add $b_ptr,sp,#$Zsqr
+ add $r_ptr,sp,#$Zsqr
+ bl _ecp_nistz256_sub @ p256_sub(Zsqr, in_x, Zsqr);
+
+ add $a_ptr,sp,#$S
+ add $b_ptr,sp,#$S
+ add $r_ptr,sp,#$tmp0
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(tmp0, S);
+
+ add $a_ptr,sp,#$Zsqr
+ add $b_ptr,sp,#$M
+ add $r_ptr,sp,#$M
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(M, M, Zsqr);
+
+ ldr $r_ptr,[sp,#32*5]
+ add $a_ptr,sp,#$tmp0
+ add $r_ptr,$r_ptr,#32
+ bl _ecp_nistz256_div_by_2 @ p256_div_by_2(res_y, tmp0);
+
+ add $a_ptr,sp,#$M
+ add $r_ptr,sp,#$M
+ bl _ecp_nistz256_mul_by_3 @ p256_mul_by_3(M, M);
+
+ add $a_ptr,sp,#$in_x
+ add $b_ptr,sp,#$S
+ add $r_ptr,sp,#$S
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S, S, in_x);
+
+ add $r_ptr,sp,#$tmp0
+ bl __ecp_nistz256_mul_by_2 @ p256_mul_by_2(tmp0, S);
+
+ ldr $r_ptr,[sp,#32*5]
+ add $a_ptr,sp,#$M
+ add $b_ptr,sp,#$M
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(res_x, M);
+
+ add $b_ptr,sp,#$tmp0
+ bl __ecp_nistz256_sub_from @ p256_sub(res_x, res_x, tmp0);
+
+ add $b_ptr,sp,#$S
+ add $r_ptr,sp,#$S
+ bl __ecp_nistz256_sub_morf @ p256_sub(S, S, res_x);
+
+ add $a_ptr,sp,#$M
+ add $b_ptr,sp,#$S
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S, S, M);
+
+ ldr $r_ptr,[sp,#32*5]
+ add $b_ptr,$r_ptr,#32
+ add $r_ptr,$r_ptr,#32
+ bl __ecp_nistz256_sub_from @ p256_sub(res_y, S, res_y);
+
+ add sp,sp,#32*5+16
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_point_double,.-ecp_nistz256_point_double
+___
+}
+{
+my ($res_x,$res_y,$res_z,
+    $in1_x,$in1_y,$in1_z,
+    $in2_x,$in2_y,$in2_z,
+    $H,$Hsqr,$R,$Rsqr,$Hcub,
+    $U1,$U2,$S1,$S2)=map(32*$_,(0..17));
+my ($Z1sqr, $Z2sqr) = ($Hsqr, $Rsqr);
+
+$code.=<<___;
+.globl ecp_nistz256_point_add
+.type ecp_nistz256_point_add,%function
+.align 5
+ecp_nistz256_point_add:
+ stmdb sp!,{r0-r12,lr}
+ sub sp,sp,#32*18
+
+ ldmia $b_ptr!,{r4-r11} @ copy in2
+ add r3,sp,#$in2_x
+ orr r12,r4,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $b_ptr!,{r4-r11}
+ orr r12,r12,r4
+ orr r12,r12,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $b_ptr,{r4-r11}
+ cmp r12,#0
+ movne r12,#-1
+ stmia r3,{r4-r11}
+ str r12,[sp,#32*18+8] @ !in2infty
+
+ ldmia $a_ptr!,{r4-r11} @ copy in1
+ add r3,sp,#$in1_x
+ orr r12,r4,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $a_ptr!,{r4-r11}
+ orr r12,r12,r4
+ orr r12,r12,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $a_ptr,{r4-r11}
+ cmp r12,#0
+ movne r12,#-1
+ stmia r3,{r4-r11}
+ str r12,[sp,#32*18+4] @ !in1infty
+
+ add $a_ptr,sp,#$in2_z
+ add $b_ptr,sp,#$in2_z
+ add $r_ptr,sp,#$Z2sqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Z2sqr, in2_z);
+
+ add $a_ptr,sp,#$in1_z
+ add $b_ptr,sp,#$in1_z
+ add $r_ptr,sp,#$Z1sqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Z1sqr, in1_z);
+
+ add $a_ptr,sp,#$in2_z
+ add $b_ptr,sp,#$Z2sqr
+ add $r_ptr,sp,#$S1
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S1, Z2sqr, in2_z);
+
+ add $a_ptr,sp,#$in1_z
+ add $b_ptr,sp,#$Z1sqr
+ add $r_ptr,sp,#$S2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S2, Z1sqr, in1_z);
+
+ add $a_ptr,sp,#$in1_y
+ add $b_ptr,sp,#$S1
+ add $r_ptr,sp,#$S1
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S1, S1, in1_y);
+
+ add $a_ptr,sp,#$in2_y
+ add $b_ptr,sp,#$S2
+ add $r_ptr,sp,#$S2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S2, S2, in2_y);
+
+ add $b_ptr,sp,#$S1
+ add $r_ptr,sp,#$R
+ bl __ecp_nistz256_sub_from @ p256_sub(R, S2, S1);
+
+ orr $a0,$a0,$a1 @ see if result is zero
+ orr $a2,$a2,$a3
+ orr $a4,$a4,$a5
+ orr $a0,$a0,$a2
+ orr $a4,$a4,$a6
+ orr $a0,$a0,$a7
+ add $a_ptr,sp,#$in1_x
+ orr $a0,$a0,$a4
+ add $b_ptr,sp,#$Z2sqr
+ str $a0,[sp,#32*18+12]
+
+ add $r_ptr,sp,#$U1
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(U1, in1_x, Z2sqr);
+
+ add $a_ptr,sp,#$in2_x
+ add $b_ptr,sp,#$Z1sqr
+ add $r_ptr,sp,#$U2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(U2, in2_x, Z1sqr);
+
+ add $b_ptr,sp,#$U1
+ add $r_ptr,sp,#$H
+ bl __ecp_nistz256_sub_from @ p256_sub(H, U2, U1);
+
+ orr $a0,$a0,$a1 @ see if result is zero
+ orr $a2,$a2,$a3
+ orr $a4,$a4,$a5
+ orr $a0,$a0,$a2
+ orr $a4,$a4,$a6
+ orr $a0,$a0,$a7
+ orrs $a0,$a0,$a4
+
+ bne .Ladd_proceed @ is_equal(U1,U2)?
+
+ ldr $t0,[sp,#32*18+4]
+ ldr $t1,[sp,#32*18+8]
+ ldr $t2,[sp,#32*18+12]
+ tst $t0,$t1
+ beq .Ladd_proceed @ (in1infty || in2infty)?
+ tst $t2,$t2
+ beq .Ladd_proceed @ is_equal(S1,S2)?
+
+ ldr $r_ptr,[sp,#32*18]
+ eor r4,r4,r4
+ eor r5,r5,r5
+ eor r6,r6,r6
+ eor r7,r7,r7
+ eor r8,r8,r8
+ eor r9,r9,r9
+ eor r10,r10,r10
+ eor r11,r11,r11
+ stmia $r_ptr!,{r4-r11}
+ stmia $r_ptr!,{r4-r11}
+ stmia $r_ptr!,{r4-r11}
+ b .Ladd_done
+
+.align 4
+.Ladd_proceed:
+ add $a_ptr,sp,#$R
+ add $b_ptr,sp,#$R
+ add $r_ptr,sp,#$Rsqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Rsqr, R);
+
+ add $a_ptr,sp,#$H
+ add $b_ptr,sp,#$in1_z
+ add $r_ptr,sp,#$res_z
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(res_z, H, in1_z);
+
+ add $a_ptr,sp,#$H
+ add $b_ptr,sp,#$H
+ add $r_ptr,sp,#$Hsqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Hsqr, H);
+
+ add $a_ptr,sp,#$in2_z
+ add $b_ptr,sp,#$res_z
+ add $r_ptr,sp,#$res_z
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(res_z, res_z, in2_z);
+
+ add $a_ptr,sp,#$H
+ add $b_ptr,sp,#$Hsqr
+ add $r_ptr,sp,#$Hcub
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(Hcub, Hsqr, H);
+
+ add $a_ptr,sp,#$Hsqr
+ add $b_ptr,sp,#$U1
+ add $r_ptr,sp,#$U2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(U2, U1, Hsqr);
+
+ add $r_ptr,sp,#$Hsqr
+ bl __ecp_nistz256_mul_by_2 @ p256_mul_by_2(Hsqr, U2);
+
+ add $b_ptr,sp,#$Rsqr
+ add $r_ptr,sp,#$res_x
+ bl __ecp_nistz256_sub_morf @ p256_sub(res_x, Rsqr, Hsqr);
+
+ add $b_ptr,sp,#$Hcub
+ bl __ecp_nistz256_sub_from @  p256_sub(res_x, res_x, Hcub);
+
+ add $b_ptr,sp,#$U2
+ add $r_ptr,sp,#$res_y
+ bl __ecp_nistz256_sub_morf @ p256_sub(res_y, U2, res_x);
+
+ add $a_ptr,sp,#$Hcub
+ add $b_ptr,sp,#$S1
+ add $r_ptr,sp,#$S2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S2, S1, Hcub);
+
+ add $a_ptr,sp,#$R
+ add $b_ptr,sp,#$res_y
+ add $r_ptr,sp,#$res_y
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(res_y, res_y, R);
+
+ add $b_ptr,sp,#$S2
+ bl __ecp_nistz256_sub_from @ p256_sub(res_y, res_y, S2);
+
+ ldr r11,[sp,#32*18+4] @ !in1intfy
+ ldr r12,[sp,#32*18+8] @ !in2intfy
+ add r1,sp,#$res_x
+ add r2,sp,#$in2_x
+ and r10,r11,r12
+ mvn r11,r11
+ add r3,sp,#$in1_x
+ and r11,r11,r12
+ mvn r12,r12
+ ldr $r_ptr,[sp,#32*18]
+___
+for($i=0;$i<96;$i+=8) { # conditional moves
+$code.=<<___;
+ ldmia r1!,{r4-r5} @ res_x
+ ldmia r2!,{r6-r7} @ in2_x
+ ldmia r3!,{r8-r9} @ in1_x
+ and r4,r4,r10
+ and r5,r5,r10
+ and r6,r6,r11
+ and r7,r7,r11
+ and r8,r8,r12
+ and r9,r9,r12
+ orr r4,r4,r6
+ orr r5,r5,r7
+ orr r4,r4,r8
+ orr r5,r5,r9
+ stmia $r_ptr!,{r4-r5}
+___
+}
+$code.=<<___;
+.Ladd_done:
+ add sp,sp,#32*18+16
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_point_add,.-ecp_nistz256_point_add
+___
+}
+{
+my ($res_x,$res_y,$res_z,
+    $in1_x,$in1_y,$in1_z,
+    $in2_x,$in2_y,
+    $U2,$S2,$H,$R,$Hsqr,$Hcub,$Rsqr)=map(32*$_,(0..14));
+my $Z1sqr = $S2;
+my @ONE_mont=(1,0,0,-1,-1,-1,-2,0);
+
+$code.=<<___;
+.globl ecp_nistz256_point_add_affine
+.type ecp_nistz256_point_add_affine,%function
+.align 5
+ecp_nistz256_point_add_affine:
+ stmdb sp!,{r0-r12,lr}
+ sub sp,sp,#32*15
+
+ ldmia $a_ptr!,{r4-r11} @ copy in1
+ add r3,sp,#$in1_x
+ orr r12,r4,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $a_ptr!,{r4-r11}
+ orr r12,r12,r4
+ orr r12,r12,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $a_ptr,{r4-r11}
+ cmp r12,#0
+ movne r12,#-1
+ stmia r3,{r4-r11}
+ str r12,[sp,#32*15+4] @ !in1infty
+
+ ldmia $b_ptr!,{r4-r11} @ copy in2
+ add r3,sp,#$in2_x
+ orr r12,r4,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ ldmia $b_ptr!,{r4-r11}
+ orr r12,r12,r4
+ orr r12,r12,r5
+ orr r12,r12,r6
+ orr r12,r12,r7
+ orr r12,r12,r8
+ orr r12,r12,r9
+ orr r12,r12,r10
+ orr r12,r12,r11
+ stmia r3!,{r4-r11}
+ cmp r12,#0
+ movne r12,#-1
+ str r12,[sp,#32*15+8] @ !in2infty
+
+ add $a_ptr,sp,#$in1_z
+ add $b_ptr,sp,#$in1_z
+ add $r_ptr,sp,#$Z1sqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Z1sqr, in1_z);
+
+ add $a_ptr,sp,#$Z1sqr
+ add $b_ptr,sp,#$in2_x
+ add $r_ptr,sp,#$U2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(U2, Z1sqr, in2_x);
+
+ add $b_ptr,sp,#$in1_x
+ add $r_ptr,sp,#$H
+ bl __ecp_nistz256_sub_from @ p256_sub(H, U2, in1_x);
+
+ add $a_ptr,sp,#$Z1sqr
+ add $b_ptr,sp,#$in1_z
+ add $r_ptr,sp,#$S2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S2, Z1sqr, in1_z);
+
+ add $a_ptr,sp,#$H
+ add $b_ptr,sp,#$in1_z
+ add $r_ptr,sp,#$res_z
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(res_z, H, in1_z);
+
+ add $a_ptr,sp,#$in2_y
+ add $b_ptr,sp,#$S2
+ add $r_ptr,sp,#$S2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S2, S2, in2_y);
+
+ add $b_ptr,sp,#$in1_y
+ add $r_ptr,sp,#$R
+ bl __ecp_nistz256_sub_from @ p256_sub(R, S2, in1_y);
+
+ add $a_ptr,sp,#$H
+ add $b_ptr,sp,#$H
+ add $r_ptr,sp,#$Hsqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Hsqr, H);
+
+ add $a_ptr,sp,#$R
+ add $b_ptr,sp,#$R
+ add $r_ptr,sp,#$Rsqr
+ bl _ecp_nistz256_mul_mont @ p256_sqr_mont(Rsqr, R);
+
+ add $a_ptr,sp,#$H
+ add $b_ptr,sp,#$Hsqr
+ add $r_ptr,sp,#$Hcub
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(Hcub, Hsqr, H);
+
+ add $a_ptr,sp,#$Hsqr
+ add $b_ptr,sp,#$in1_x
+ add $r_ptr,sp,#$U2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(U2, in1_x, Hsqr);
+
+ add $r_ptr,sp,#$Hsqr
+ bl __ecp_nistz256_mul_by_2 @ p256_mul_by_2(Hsqr, U2);
+
+ add $b_ptr,sp,#$Rsqr
+ add $r_ptr,sp,#$res_x
+ bl __ecp_nistz256_sub_morf @ p256_sub(res_x, Rsqr, Hsqr);
+
+ add $b_ptr,sp,#$Hcub
+ bl __ecp_nistz256_sub_from @  p256_sub(res_x, res_x, Hcub);
+
+ add $b_ptr,sp,#$U2
+ add $r_ptr,sp,#$res_y
+ bl __ecp_nistz256_sub_morf @ p256_sub(res_y, U2, res_x);
+
+ add $a_ptr,sp,#$Hcub
+ add $b_ptr,sp,#$in1_y
+ add $r_ptr,sp,#$S2
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(S2, in1_y, Hcub);
+
+ add $a_ptr,sp,#$R
+ add $b_ptr,sp,#$res_y
+ add $r_ptr,sp,#$res_y
+ bl _ecp_nistz256_mul_mont @ p256_mul_mont(res_y, res_y, R);
+
+ add $b_ptr,sp,#$S2
+ bl __ecp_nistz256_sub_from @ p256_sub(res_y, res_y, S2);
+
+ ldr r11,[sp,#32*15+4] @ !in1intfy
+ ldr r12,[sp,#32*15+8] @ !in2intfy
+ add r1,sp,#$res_x
+ add r2,sp,#$in2_x
+ and r10,r11,r12
+ mvn r11,r11
+ add r3,sp,#$in1_x
+ and r11,r11,r12
+ mvn r12,r12
+ ldr $r_ptr,[sp,#32*15]
+___
+for($i=0;$i<64;$i+=8) { # conditional moves
+$code.=<<___;
+ ldmia r1!,{r4-r5} @ res_x
+ ldmia r2!,{r6-r7} @ in2_x
+ ldmia r3!,{r8-r9} @ in1_x
+ and r4,r4,r10
+ and r5,r5,r10
+ and r6,r6,r11
+ and r7,r7,r11
+ and r8,r8,r12
+ and r9,r9,r12
+ orr r4,r4,r6
+ orr r5,r5,r7
+ orr r4,r4,r8
+ orr r5,r5,r9
+ stmia $r_ptr!,{r4-r5}
+___
+}
+for(;$i<96;$i+=8) {
+my $j=($i-64)/4;
+$code.=<<___;
+ ldmia r1!,{r4-r5} @ res_z
+ @ ldmia r2!,{r6-r7} @ in2_z
+ ldmia r3!,{r8-r9} @ in1_z
+ and r4,r4,r10
+ and r5,r5,r10
+ and r6,r11,#@ONE_mont[$j]
+ and r7,r11,#@ONE_mont[$j+1]
+ and r8,r8,r12
+ and r9,r9,r12
+ orr r4,r4,r6
+ orr r5,r5,r7
+ orr r4,r4,r8
+ orr r5,r5,r9
+ stmia $r_ptr!,{r4-r5}
+___
+}
+$code.=<<___;
+ add sp,sp,#32*15+16
+#if __ARM_ARCH__>=5
+ ldmia sp!,{r4-r12,pc}
+#else
+ ldmia sp!,{r4-r12,lr}
+ tst lr,#1
+ moveq pc,lr @ be binary compatible with V4, yet
+ bx lr @ interoperable with Thumb ISA:-)
+#endif
+.size ecp_nistz256_point_add_affine,.-ecp_nistz256_point_add_affine
+___
+} }}}
+
+foreach (split("\n",$code)) {
+ s/\`([^\`]*)\`/eval $1/geo;
+
+ s/\bq([0-9]+)#(lo|hi)/sprintf "d%d",2*$1+($2 eq "hi")/geo or
+ s/\bret\b/bx lr/go or
+ s/\bbx\s+lr\b/.word\t0xe12fff1e/go; # make it possible to compile with -march=armv4
+
+ print $_,"\n";
+}
+close STDOUT; # enforce flush
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>>>>> thanks! Was away last week and so didn't have a chance to try fixing this.

>>>>>
>>>>> I'll patch that it and run the tests against it.
>>>> I've run out of time tracking this down for today, but I got to the
>>>> point where setting the Jacobian coordinates:
>>>>
>>>> X: C4EB2994C09557B400FF6A543CFB257F945E86FE3DF1D32A8128F32927666A8F
>>>> Y: 3D5283F8F10F559AE5310005005F321B28D2D699F3E01F179F91AC6660013328
>>>> Z: F97FD7E6757991A2C7E0C2488FF3C54E58030BCACF3FB95954FD3EF211C24631
>>>>
>>>> and multiplying that point by
>>>> 2269520AFB46450398DE95AE59DDBDC1D42B8B7030F81BCFEF12D819C1D678DD
>>>> results in the affine point:
>>>>
>>>> x: 4BBC2813F69EF6A4D3E69E2832E9A9E97FF59F8C136DCDBD9509BC685FF337FD
>>>> y: BDCB623715CE2D983CFC2776C6EED4375454BE2C88932D43856906C1DC7A0BD7
>>>>
>>>> However, I believe that the result should be:
>>>>
>>>> x: C2910AA0216D12DE30C5573CCFC4116546E3091DC1E9EC8604F634185CE40863
>>>> y: C9071E13D688C305CE179C6168DD9066657BC6CDC1639A44B68DF7F1E0A40EDF
>>> I do get the latter...
>> ... in master, and I get the former in 1.0.2. Looking into it...
>
> Attached patch produces correct result in 1.0.2. Looking further for
> explanation...
Oops! Wrong patch! Correct one attached. If you feel like testing the
wrong one, go ahead, but there are some later non-essential adjustments.



diff --git a/crypto/ec/ecp_nistz256.c b/crypto/ec/ecp_nistz256.c
index bf3fcc6..33b07ce 100644
--- a/crypto/ec/ecp_nistz256.c
+++ b/crypto/ec/ecp_nistz256.c
@@ -637,7 +637,7 @@ static void ecp_nistz256_windowed_mul(const EC_GROUP * group,
         ecp_nistz256_point_double(&row[10 - 1], &row[ 5 - 1]);
         ecp_nistz256_point_add   (&row[15 - 1], &row[14 - 1], &row[1 - 1]);
         ecp_nistz256_point_add   (&row[11 - 1], &row[10 - 1], &row[1 - 1]);
-        ecp_nistz256_point_add   (&row[16 - 1], &row[15 - 1], &row[1 - 1]);
+        ecp_nistz256_point_double(&row[16 - 1], &row[ 8 - 1]);
     }
 
     index = 255;
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Bodo Moeller
In reply to this post by Billy Brumley
2. When will RT2574 be integrated to protect our ECC keys in the
inevitable presence of software defects like this?
http://rt.openssl.org/Ticket/Display.html?id=2574&user=guest&pass=guest

Reportedly, Cryptography Research (i.e., Rambus) alleges to have broad patents on techniques like this (and they might not be the only ones). I'm not going to look for specific patents and can't assess the validity of that rumor, the only thing I know for certain is that Cryptography Research and Rambus are famous, above all else, for starting patent lawsuits (see, e.g., http://www.sec.gov/Archives/edgar/data/1403161/000119312507270394/d10k.htm).

Unfortunately, this means that the OpenSSL project may not be willing to incorporate coordinate-blinding techniques at this time.

Bodo


Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Adam Langley-3
In reply to this post by Rich Salz via RT
On Wed, Dec 3, 2014 at 10:12 AM, Andy Polyakov via RT <[hidden email]> wrote:

> Oops! Wrong patch! Correct one attached. If you feel like testing the
> wrong one, go ahead, but there are some later non-essential adjustments.
>
> diff --git a/crypto/ec/ecp_nistz256.c b/crypto/ec/ecp_nistz256.c
> index bf3fcc6..33b07ce 100644
> --- a/crypto/ec/ecp_nistz256.c
> +++ b/crypto/ec/ecp_nistz256.c
> @@ -637,7 +637,7 @@ static void ecp_nistz256_windowed_mul(const EC_GROUP * group,
>          ecp_nistz256_point_double(&row[10 - 1], &row[ 5 - 1]);
>          ecp_nistz256_point_add   (&row[15 - 1], &row[14 - 1], &row[1 - 1]);
>          ecp_nistz256_point_add   (&row[11 - 1], &row[10 - 1], &row[1 - 1]);
> -        ecp_nistz256_point_add   (&row[16 - 1], &row[15 - 1], &row[1 - 1]);
> +        ecp_nistz256_point_double(&row[16 - 1], &row[ 8 - 1]);
>      }
>
>      index = 255;

I can believe that this fixes the issue, but it's just masking it, no?
I'll see if I can track it down more precisely tomorrow.


Cheers

AGL
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
On Wed, Dec 3, 2014 at 10:12 AM, Andy Polyakov via RT <[hidden email]> wrote:

> Oops! Wrong patch! Correct one attached. If you feel like testing the
> wrong one, go ahead, but there are some later non-essential adjustments.
>
> diff --git a/crypto/ec/ecp_nistz256.c b/crypto/ec/ecp_nistz256.c
> index bf3fcc6..33b07ce 100644
> --- a/crypto/ec/ecp_nistz256.c
> +++ b/crypto/ec/ecp_nistz256.c
> @@ -637,7 +637,7 @@ static void ecp_nistz256_windowed_mul(const EC_GROUP * group,
>          ecp_nistz256_point_double(&row[10 - 1], &row[ 5 - 1]);
>          ecp_nistz256_point_add   (&row[15 - 1], &row[14 - 1], &row[1 - 1]);
>          ecp_nistz256_point_add   (&row[11 - 1], &row[10 - 1], &row[1 - 1]);
> -        ecp_nistz256_point_add   (&row[16 - 1], &row[15 - 1], &row[1 - 1]);
> +        ecp_nistz256_point_double(&row[16 - 1], &row[ 8 - 1]);
>      }
>
>      index = 255;

I can believe that this fixes the issue, but it's just masking it, no?
I'll see if I can track it down more precisely tomorrow.


Cheers

AGL


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>> Oops! Wrong patch! Correct one attached. If you feel like testing the
>> wrong one, go ahead, but there are some later non-essential adjustments.
>>
>> diff --git a/crypto/ec/ecp_nistz256.c b/crypto/ec/ecp_nistz256.c
>> index bf3fcc6..33b07ce 100644
>> --- a/crypto/ec/ecp_nistz256.c
>> +++ b/crypto/ec/ecp_nistz256.c
>> @@ -637,7 +637,7 @@ static void ecp_nistz256_windowed_mul(const EC_GROUP * group,
>>          ecp_nistz256_point_double(&row[10 - 1], &row[ 5 - 1]);
>>          ecp_nistz256_point_add   (&row[15 - 1], &row[14 - 1], &row[1 - 1]);
>>          ecp_nistz256_point_add   (&row[11 - 1], &row[10 - 1], &row[1 - 1]);
>> -        ecp_nistz256_point_add   (&row[16 - 1], &row[15 - 1], &row[1 - 1]);
>> +        ecp_nistz256_point_double(&row[16 - 1], &row[ 8 - 1]);
>>      }
>>
>>      index = 255;
>
> I can believe that this fixes the issue, but it's just masking it, no?

It's not a coincidence that I didn't say "fixes the issue" or "solves
the problem", but "produces correct result". BTW, it seems to be
unrelated to the original problem with carries handling in assembly.


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [openssl.org #3607] nistz256 is broken.

Rich Salz via RT
>>> Oops! Wrong patch! Correct one attached. If you feel like testing the
>>> wrong one, go ahead, but there are some later non-essential adjustments.
>>>
>>> diff --git a/crypto/ec/ecp_nistz256.c b/crypto/ec/ecp_nistz256.c
>>> index bf3fcc6..33b07ce 100644
>>> --- a/crypto/ec/ecp_nistz256.c
>>> +++ b/crypto/ec/ecp_nistz256.c
>>> @@ -637,7 +637,7 @@ static void ecp_nistz256_windowed_mul(const EC_GROUP * group,
>>>          ecp_nistz256_point_double(&row[10 - 1], &row[ 5 - 1]);
>>>          ecp_nistz256_point_add   (&row[15 - 1], &row[14 - 1], &row[1 - 1]);
>>>          ecp_nistz256_point_add   (&row[11 - 1], &row[10 - 1], &row[1 - 1]);
>>> -        ecp_nistz256_point_add   (&row[16 - 1], &row[15 - 1], &row[1 - 1]);
>>> +        ecp_nistz256_point_double(&row[16 - 1], &row[ 8 - 1]);
>>>      }
>>>
>>>      index = 255;
>> I can believe that this fixes the issue, but it's just masking it, no?

The underlying problem is that assembly routines return "partially
reduced" results. "Partially reduced" means that it can return result +
modulus if it fits in 256 bits. Rationale is that
((x+m)*y)%m=(x*y+m*y)%m=x*y%m+m*y%m and last term is 0. While it does
work with series of multiplications, I failed to recognize that there
are corner cases in non-multiplication operations. I'm preparing an
update...


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
12