0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Tomas Svensson
Hi,

I have some questions/observations:

1) In openssl-0.9.8/crypto/des/cfb_enc.c line 170 there is "memcpy
(ovec,ovec+num,8);" and since ovec and ovec+num will overlap sometimes,
this function relies on undocumented/undefined behavior of memcpy?

If I use the Intel C++ Compiler 9.0 for EM64T with /O2 or higher, it
replaces the above memcpy with the optimized function __intel_fast_memcpy,
which breaks DES in OpenSSL.

It seems like memcpy should be replaced with memmove here?

2) On Win64 platforms, a socket is now a 64-bit pointer but SSL_set_fd and
BIO_set_fd accept only 32-bit integers. Can't this cause problems if the
pointer points higher than the lowest 4 gig address space?

3) Is AES really a lot faster on Win64/x64 compared to the i586 asm
version or am I doing something wrong? I get:

Microsoft VC6 + MASM:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      55420.65k    57962.40k    58913.94k    58436.84k  
58513.27k
aes-192 cbc      46281.98k    49366.53k    48806.45k    47196.61k  
47934.90k
aes-256 cbc      43472.74k    43560.21k    43515.02k    43427.73k  
43515.02k

Microsoft AMD64 compiler:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      91404.06k    98055.03k    98951.44k    99420.54k  
100342.20k
aes-192 cbc      81049.35k    86236.01k    88023.17k    87290.41k  
88370.90k
aes-256 cbc      72801.98k    76695.84k    77243.17k    77798.36k  
78088.04k

Intel C++ Compiler 9.0 for EM64T, OpenSSL compiled with SSE3
optimizations:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc     106319.49k   116085.22k   119304.65k   120634.31k  
120655.99k
aes-192 cbc      95869.81k   103739.16k   105783.20k   106319.49k  
104238.68k
aes-256 cbc      87301.76k    93976.84k    95651.17k    95664.81k  
96075.68k

All tests using OpenSSL 0.9.8 on a P4 3.6 GHz with EM64T and 2MB L2-cache.

-Tomas

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Andy Polyakov
> 1) In openssl-0.9.8/crypto/des/cfb_enc.c line 170 there is "memcpy
> (ovec,ovec+num,8);" and since ovec and ovec+num will overlap sometimes,
> this function relies on undocumented/undefined behavior of memcpy?

The original reason for choosing of memcpy was a) it's comonly inlined
by compilers [most notably gcc], but not memmove, b) I fail to imagine
how it can fail with overlapping regions if num is guaranteed to be
positive, even if the routine is super-optimized, inlined, whatever. Can
you?

> If I use the Intel C++ Compiler 9.0 for EM64T with /O2 or higher, it
> replaces the above memcpy with the optimized function __intel_fast_memcpy,
> which breaks DES in OpenSSL.

For reference, note that Linux version avoids __intel_fast_memcpy with
-Dmemcpy=__builtin_memcpy, because libirc.a caused griefs when linked
into shared library. __intel_fast_memcpy feels as overkill in OpenSSL
context and inlined code [movs or unrolled loop] should do better job.
Can you try to compile with -Dmemcpy=__builtin_memcpy
-Dmemset=__builting_memset?

> It seems like memcpy should be replaced with memmove here?

Does it mean that you've tried to replace it with memmove and can
confirm that DES works if compiled with ICC /O2 or higher? It actually
smells more like compiler bug than memcpy vs. memmove issue...

> 2) On Win64 platforms, a socket is now a 64-bit pointer but SSL_set_fd and
> BIO_set_fd accept only 32-bit integers. Can't this cause problems if the
> pointer points higher than the lowest 4 gig address space?

There is explicit comment about this in e_os.h. The socket value
constitutes offset in a table [it's per-process kernel-side table] of
limited size, less than 2GB, and therefore it's safe to use int to
accomodate the value.

> 3) Is AES really a lot faster on Win64/x64 compared to the i586 asm
> version or am I doing something wrong?

1. Who says that AES is assembler empowered on Win32? It's not, not yet:-)
2. What's wrong with 64-bit code being faster then 32-bit one? 64-bit
code has access to wider register bank, 8 extra registers, and in AES
case there is no need to spill any registers to stack in every loop
spin. Less instructions, no wasted bus bandwidth -> better performance.

A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Jack Lloyd
On Thu, Jul 07, 2005 at 07:42:37PM +0200, Andy Polyakov wrote:
> >1) In openssl-0.9.8/crypto/des/cfb_enc.c line 170 there is "memcpy
> >(ovec,ovec+num,8);" and since ovec and ovec+num will overlap sometimes,
> >this function relies on undocumented/undefined behavior of memcpy?
>
> The original reason for choosing of memcpy was a) it's comonly inlined
> by compilers [most notably gcc], but not memmove, b) I fail to imagine
> how it can fail with overlapping regions if num is guaranteed to be
> positive, even if the routine is super-optimized, inlined, whatever. Can
> you?

This doesn't make any sense - if memcpy can handle overlapping regions
without any slowdown, then wouldn't it make sense to implemenent
memmove as a #define (or inline call to) memcpy? Either memcpy does
not handle overlaps while memmove does, or memcpy and memmove work at
the same speed, because the ability to handle overlapping memory
regions is the only difference between the two. The only other
alternative is that memcpy and memmove do the exact same thing, but
memmove is slower. That seems quite unlikely.

-Jack
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Brian Hurt-4


On Thu, 7 Jul 2005, Jack Lloyd wrote:

> On Thu, Jul 07, 2005 at 07:42:37PM +0200, Andy Polyakov wrote:
>>> 1) In openssl-0.9.8/crypto/des/cfb_enc.c line 170 there is "memcpy
>>> (ovec,ovec+num,8);" and since ovec and ovec+num will overlap sometimes,
>>> this function relies on undocumented/undefined behavior of memcpy?
>>
>> The original reason for choosing of memcpy was a) it's comonly inlined
>> by compilers [most notably gcc], but not memmove, b) I fail to imagine
>> how it can fail with overlapping regions if num is guaranteed to be
>> positive, even if the routine is super-optimized, inlined, whatever. Can
>> you?
>
> This doesn't make any sense - if memcpy can handle overlapping regions
> without any slowdown, then wouldn't it make sense to implemenent
> memmove as a #define (or inline call to) memcpy? Either memcpy does
> not handle overlaps while memmove does, or memcpy and memmove work at
> the same speed, because the ability to handle overlapping memory
> regions is the only difference between the two. The only other
> alternative is that memcpy and memmove do the exact same thing, but
> memmove is slower. That seems quite unlikely.

If the regions overlap, the behavior is undefined according to the
standard- which means that the compiler or produced code can do something
odd, segfault, or whistle dixie and explode, and still be conformant.

And it can fail with overlapping arguments.  Consider the "normal"
implementation (which is in no way gaurenteed) of memcpy:

void * memcpy(void * dst, const void * src, size_t len) {
     char * d = (char *) dst;
     const char * s = (const char *) src;

     while (len-- > 0) {
         *d++ = *s++;
     }

     return dst;
}

Now, call the above code the following way:
     {
         char mem[] = "Hello, world!";
         memcpy(mem+1, mem, sizeof(mem)-1);
     }

Instead of doing what was intended, moving the string up one place, the
code has different behavior.

One other comment I will make: I can write a faster memcpy for aligned or
alignable medium to large copies (which is where, generally, performance
is important) than "rep movsb" on the x86.  Which, for those who don't
know x86 assembly, is the hardware equivelent of my implementation above.

Brian
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Andy Polyakov
In reply to this post by Jack Lloyd
>>>1) In openssl-0.9.8/crypto/des/cfb_enc.c line 170 there is "memcpy
>>>(ovec,ovec+num,8);" and since ovec and ovec+num will overlap sometimes,
>>>this function relies on undocumented/undefined behavior of memcpy?
>>
>>The original reason for choosing of memcpy was a) it's comonly inlined
>>by compilers [most notably gcc], but not memmove, b) I fail to imagine
>>how it can fail with overlapping regions if num is guaranteed to be
>>positive, even if the routine is super-optimized, inlined, whatever. Can
>>you?
>
>
> This doesn't make any sense - if memcpy can handle overlapping regions
> without any slowdown, then wouldn't it make sense to implemenent
> memmove as a #define (or inline call to) memcpy?

Do note "[when] num [as in memcpy(ovec,ovec+num,8)] is guaranteed to be
positive." Question was can you imagine memcpy implementation that would
fail to handle overlapping regions when source address is *larger* than
destination? Question was *not* if you can imagine memcpy implementation
that would fail to handle arbitrary overlapping regions.

> Either memcpy does
> not handle overlaps while memmove does, or memcpy and memmove work at
> the same speed, because the ability to handle overlapping memory
> regions is the only difference between the two.

See a). Inlining is believed/expected to be faster than call to a
function. For example in the referred case in particular inline memcpy
can be compiled as two 32-bit loads followed by two stores on IA-32.
memmove call on the other hand shall involve register spills, putting
arguments and return address to stack, branch, pulling arguments,
comparing them, etc., etc., etc. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Brian Hurt-4


On Thu, 7 Jul 2005, Andy Polyakov wrote:

>>>> 1) In openssl-0.9.8/crypto/des/cfb_enc.c line 170 there is "memcpy
>>>> (ovec,ovec+num,8);" and since ovec and ovec+num will overlap
>>>> sometimes,
>>>> this function relies on undocumented/undefined behavior of memcpy?
>>>
>>> The original reason for choosing of memcpy was a) it's comonly inlined
>>> by compilers [most notably gcc], but not memmove, b) I fail to imagine
>>> how it can fail with overlapping regions if num is guaranteed to be
>>> positive, even if the routine is super-optimized, inlined, whatever. Can
>>> you?
>>
>>
>> This doesn't make any sense - if memcpy can handle overlapping regions
>> without any slowdown, then wouldn't it make sense to implemenent
>> memmove as a #define (or inline call to) memcpy?
>
> Do note "[when] num [as in memcpy(ovec,ovec+num,8)] is guaranteed to be
> positive." Question was can you imagine memcpy implementation that would fail
> to handle overlapping regions when source address is *larger* than
> destination? Question was *not* if you can imagine memcpy implementation that
> would fail to handle arbitrary overlapping regions.

Yes.

void * memcpy(void * dst, const void * src, size_t len) {
     char * d = ((char *) dst) + len;
     const char * s = ((const char *) src) + len;

     while (len-- > 0) {
         *--d = *--s;
     }
     return dst;
}

This is a fully conformant implementation of memcpy.  Not sure why you'd
implement it this way, but it's legal.

>
>> Either memcpy does
>> not handle overlaps while memmove does, or memcpy and memmove work at
>> the same speed, because the ability to handle overlapping memory
>> regions is the only difference between the two.
>
> See a). Inlining is believed/expected to be faster than call to a function.

This is not always true.  If the inlining causes the code size to bloat
and no longer fit into cache, for example.  Also, shared copies of the
function can share branch prediction information.

It is true in this case, I mention.  At least on the x86.

But this is an example of programming by coincidence.

Brian

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Tomas Svensson
In reply to this post by Andy Polyakov
>> If I use the Intel C++ Compiler 9.0 for EM64T with /O2 or higher, it
>> replaces the above memcpy with the optimized function
>> __intel_fast_memcpy,
>> which breaks DES in OpenSSL.
>
> For reference, note that Linux version avoids __intel_fast_memcpy with
> -Dmemcpy=__builtin_memcpy, because libirc.a caused griefs when linked
> into shared library. __intel_fast_memcpy feels as overkill in OpenSSL
> context and inlined code [movs or unrolled loop] should do better job.
> Can you try to compile with -Dmemcpy=__builtin_memcpy
> -Dmemset=__builting_memset?

I get

unresolved external symbol __builtin_memcpy
unresolved external symbol __builtin_memset

but /Oi- disables inlining of all intrinsic functions, and it works (as
far as destest is concerned) if I compile cfb_enc.c with that.

>> It seems like memcpy should be replaced with memmove here?
>
> Does it mean that you've tried to replace it with memmove and can
> confirm that DES works if compiled with ICC /O2 or higher? It actually
> smells more like compiler bug than memcpy vs. memmove issue...
>

Yes, DES works with memmove and breaks with memcpy for /O2 and higher.

>> 3) Is AES really a lot faster on Win64/x64 compared to the i586 asm
>> version or am I doing something wrong?
>
> 1. Who says that AES is assembler empowered on Win32? It's not, not yet:-)
> 2. What's wrong with 64-bit code being faster then 32-bit one? 64-bit
> code has access to wider register bank, 8 extra registers, and in AES
> case there is no need to spill any registers to stack in every loop
> spin. Less instructions, no wasted bus bandwidth -> better performance.

I assumed that the asm file was used since it was included... Some
OpenSSL-algorithms are slower on x64, like RSA. SHA1 and RC4 seem to be
faster, but the speed command breaks for all but the first test:

Doing rc4 209715200 times on 16 size blocks: 209715200 rc4's in 20.84s
Doing rc4 -14680064 times on 64 size blocks: 0 rc4's in 0.00s
etc...

-Tomas
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Marc Bevand
Tomas Svensson wrote:
|
| Some OpenSSL-algorithms are slower on x64, like RSA. SHA1 and RC4 seem to
| be faster [...]

Cases where an OpenSSL algorithm is slower on AMD64 than on i386 are
almost always due to a substandard AMD64 implementation. For example,
some algorithms are written using hand-coded assembly when compiled
for i386, and a generic (slower) C language implementation when
compiled for AMD64.

Fortunately, such cases are becoming less and less common, as more and
more algorithms are being optimized for AMD64 (for instance I contri-
buted an assembly version of MD5 for AMD64, new in 0.9.8). But I think
that more work is needed, as about 50% of the algorithms are still
faster on i386.

You said that RSA is slower on AMD64 ? This is unmistakably wrong, or
at least, something must have gone terribly wrong during your
benchmark. RSA heavily benefits from 64-bit arithmetic and by
consequent AMD64 has a clear advantage over i386 (it is usually up to
3x faster).

--
Marc Bevand                              http://epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

David C. Partridge
In reply to this post by Brian Hurt-4
>Instead of doing what was intended, moving the string up one place, the
code has different behaviour.

Yes, it will fill the buffer with "H" which is what I would expect to happen
- not immediately obvious, but sensible.  (any 370 assembler guys will
recognise MVC as doing this).

If you want to copy from one mem location to another even if they overlap
*and* preserve the contents, then you should use memmove and pay the
overhead of the temporary buffer it probably allocates.

Dave


______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Richard Levitte - VMS Whacker
David C. Partridge writes:

> If you want to copy from one mem location to another even if they overlap
> *and* preserve the contents, then you should use memmove and pay the
> overhead of the temporary buffer it probably allocates.

Just a note: memmove doesn't need any temporary storage.  It just has to
decide if it's going to copy the bytes front-to-back or back-to-front
depending on the relative position of the to pointers.

Cheers,
Richard

 -----
Please consider sponsoring my work on free software.
See http://www.free.lp.se/sponsoring.html for details.

--
Richard Levitte                         [hidden email]
                                       http://richard.levitte.org/ 

"When I became a man I put away childish things, including
the fear of childishness and the desire to be very grown up."
                                               -- C.S. Lewis

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Andy Polyakov
In reply to this post by Marc Bevand
> | Some OpenSSL-algorithms are slower on x64, like RSA. SHA1 and RC4 seem to
> | be faster [...]
>
> You said that RSA is slower on AMD64 ?

This is not what he said. The claim was that RSA is slower on *EM64T*
core and Win64 and this is not surprising. First of all note that 64-bit
C implementation goes with 4x32x32-bit multiplication to produce 128-bit
result. This is exactly as much as 32-bit assembler spends to produce as
much resulting bits. Now, benchmarks show that even if you deploy
64x64-bit multiplication instruction [this is what happens on Linux
thanks to GCC inline assembler support, which is *not* case/option on
Win64!], it doesn't get any faster on *Intel core*, presumably because
64x64-bit multiplication is implemented with 4x32x32 multiplications in
microcode anyway.

> This is unmistakably wrong, or
> at least, something must have gone terribly wrong during your
> benchmark. RSA heavily benefits from 64-bit arithmetic and by
> consequent AMD64 has a clear advantage over i386 (it is usually up to
> 3x faster).

On AMD64 cores, yes, but not on EM64T, at least not on currently
available cores. A.





______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Andy Polyakov
In reply to this post by Brian Hurt-4
>> Do note "[when] num [as in memcpy(ovec,ovec+num,8)] is guaranteed to
>> be positive." Question was can you imagine memcpy implementation that
>> would fail to handle overlapping regions when source address is
>> *larger* than destination? Question was *not* if you can imagine
>> memcpy implementation that would fail to handle arbitrary overlapping
>> regions.
>
>
> Yes.
>
> void * memcpy(void * dst, const void * src, size_t len) {
>     char * d = ((char *) dst) + len;
>     const char * s = ((const char *) src) + len;
>
>     while (len-- > 0) {
>         *--d = *--s;
>     }
>     return dst;
> }
>
> This is a fully conformant implementation of memcpy.  Not sure why you'd
> implement it this way, but it's legal.

Question is not how I implement it, but why ICC would. What would be a
performance reason to implement something similar to this... But
whatever, memmove it is...

>> See a). Inlining is believed/expected to be faster than call to a
>> function.
>
> This is not always true.  If the inlining causes the code size to bloat
> and no longer fit into cache, for example.  Also, shared copies of the
> function can share branch prediction information.

Well, if one uses designated intrinsic function, compiler has a chance
to evaluate the trade-off and "decide" when it's appropriate to inline
or call a function, while in case of memmove you're bound to call...

> It is true in this case, I mention.  At least on the x86.

"This case?" Two 32-bit loads + two 32-bit stores [both gcc and icc 8
manage to inline it like this] vs. call to a function to copy 8 bytes?
But as said, whatever, memmove for cfb_enc is it... A.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Brian Hurt-4


On Fri, 8 Jul 2005, Andy Polyakov wrote:

>>> Do note "[when] num [as in memcpy(ovec,ovec+num,8)] is guaranteed to be
>>> positive." Question was can you imagine memcpy implementation that would
>>> fail to handle overlapping regions when source address is *larger* than
>>> destination? Question was *not* if you can imagine memcpy implementation
>>> that would fail to handle arbitrary overlapping regions.
>>
>>
>> Yes.
>>
>> void * memcpy(void * dst, const void * src, size_t len) {
>>     char * d = ((char *) dst) + len;
>>     const char * s = ((const char *) src) + len;
>>
>>     while (len-- > 0) {
>>         *--d = *--s;
>>     }
>>     return dst;
>> }
>>
>> This is a fully conformant implementation of memcpy.  Not sure why you'd
>> implement it this way, but it's legal.
>
> Question is not how I implement it, but why ICC would. What would be a
> performance reason to implement something similar to this... But whatever,
> memmove it is...

Hmm.  I am sort of jumping into the middle of things here.  The question
is how portable the code needs to be?  If it's using inline assembly, and
as such isn't very portable by it's nature- gcc and icc and that's about
it- then this isn't that big of a problem.  Make sure it works correctly
on both compilers, and you're fine.  If it's generic C code that needs to
work on a large number of platforms, then this might be a problem.

My argument isn't so much to be against non-portable code, it's to be
aware when you're writting non-portable code.

Although, if it's to be x86-specific, I'd be tempted to replace it with:
     ((unsigned int *) ptr)[0] = ((unsigned int *) (ptr+off))[0];
     ((unsigned int *) ptr)[1] = ((unsigned int *) (ptr+off))[1];

Note that this bit of code contains both gcc-isms (void * arith) and
x86-specific things (not checking the alignment of 32-bit loads and
stores, assuming ints are 32 bits).  But it has the advantage of being
what you really meant to do.

>
>>> See a). Inlining is believed/expected to be faster than call to a
>>> function.
>>
>> This is not always true.  If the inlining causes the code size to bloat
>> and no longer fit into cache, for example.  Also, shared copies of the
>> function can share branch prediction information.
>
> Well, if one uses designated intrinsic function, compiler has a chance
> to evaluate the trade-off and "decide" when it's appropriate to inline
> or call a function, while in case of memmove you're bound to call...

Actually, the argument in favor of inlining memcpy also applies to
memmove- especially in the case when the length is short enough the data
can first be copied to a temporary place (registers), and then copied back
to it's destination.  I don't know why gcc doesn't do this.

Brian

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Andy Polyakov
> Hmm.  I am sort of jumping into the middle of things here.  The question
> is how portable the code needs to be?  If it's using inline assembly,

It was *not* using inline assembly, but function which is *known* to be
implemented as intrinsic by *a number* of compilers. So it was portable,
yet leaving compilers with a choice to inline few instructions instead.
Note "was," as it's memmove now:-)

> Although, if it's to be x86-specific, I'd be tempted to replace it with:
>     ((unsigned int *) ptr)[0] = ((unsigned int *) (ptr+off))[0];
>     ((unsigned int *) ptr)[1] = ((unsigned int *) (ptr+off))[1];

Once again, both gcc and icc 8 were doing this *all by themselves*. No
need for gcc-isms, x86-specifics, just show them perfectly portable
memcpy(a,b,8) and they'll inline it as above. Which is why memcpy was
[consciously] chosen. Again, note "were," as it's memmove now and it's a
call:-) But enough about sad:-) Cheers. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Andy Polyakov
In reply to this post by Tomas Svensson
>>For reference, note that Linux version avoids __intel_fast_memcpy with
>>-Dmemcpy=__builtin_memcpy, because libirc.a caused griefs when linked
>>into shared library. __intel_fast_memcpy feels as overkill in OpenSSL
>>context and inlined code [movs or unrolled loop] should do better job.
>>Can you try to compile with -Dmemcpy=__builtin_memcpy
>>-Dmemset=__builting_memset?
>
> I get
>
> unresolved external symbol __builtin_memcpy
> unresolved external symbol __builtin_memset

Ouch! Something to look into when 9 is available for Linux...

>>>It seems like memcpy should be replaced with memmove here?
>>
>>Does it mean that you've tried to replace it with memmove and can
>>confirm that DES works if compiled with ICC /O2 or higher? It actually
>>smells more like compiler bug than memcpy vs. memmove issue...
>
> Yes, DES works with memmove and breaks with memcpy for /O2 and higher.

Fix commited.

>>>3) Is AES really a lot faster on Win64/x64 compared to the i586 asm
>>>version or am I doing something wrong?
>>
>>1. Who says that AES is assembler empowered on Win32? It's not, not yet:-)
>
> I assumed that the asm file was used since it was included...

There're couple of other assembler code-pathes which are not engaged on
VC-WIN32. Most notably SSE2 codes in BN and SHA512 and P4 specific RC4. A.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: 0.9.8: cfb_enc.c bug? and AES speed on Win64/x64

Tomas Svensson
>>>For reference, note that Linux version avoids __intel_fast_memcpy with
>>>-Dmemcpy=__builtin_memcpy, because libirc.a caused griefs when linked
>>>into shared library. __intel_fast_memcpy feels as overkill in OpenSSL
>>>context and inlined code [movs or unrolled loop] should do better job.
>>>Can you try to compile with -Dmemcpy=__builtin_memcpy
>>>-Dmemset=__builting_memset?
>>
>> I get
>>
>> unresolved external symbol __builtin_memcpy
>> unresolved external symbol __builtin_memset
>
> Ouch! Something to look into when 9 is available for Linux...

It works with the Linux versions of 9.0 for IA32 and IA64 atleast.

-Tomas
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [hidden email]
Automated List Manager                           [hidden email]