ssl3_get_record:decryption failed on some machines

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

ssl3_get_record:decryption failed on some machines

Fernando Gutierrez Mendez
Hi,

I wrote an application that uses OpenSSL (1.1.1) and for the past couple of weeks I have been unable to solve a very strange issue.

I use non-blocking IO with a SSL BIO so a call to BIO_read eventually returns -1, when this happens I call BIO_should_retry to test if this is due an error or because of the underlying non-blocking transport.

This code works correctly but after transferring between 1Mb to 5Mb (it varies every time) BIO_should_rety returns false and SSL_get_error returns SSL_ERROR_SSL. The error is "139964546914112:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../ssl/record/ssl3_record.c:677"

The very strange thing is that this code has been working correctly and transferring several Gb without any issues on a couple of machines. I started getting the error in a virtual machine from a popular VPS provider that uses AMD CPUs and in one physical machine using an older Intel CPU.

Works correctly on:
Intel Celeron CPU J1800
Virtual Machine on Intel Core i7-5820K
Virtual Machine on Intel Xeon E5-2697

Fails every time on:
Intel Pentium G2020T
Virtual Machine on AMD EPYC 7601

All machines are using "OpenSSL 1.1.1  11 Sep 2018" on "Ubuntu 18.04.3 LTS"

Things I tried:

- Playing with OPENSSL_ia32cap to force disable PCLMULQDQ/AES-NI, this makes no difference
- Running my app under valgrind. It does not report any error but the problem does not reproduce
- Instead of using the distro provided build I downloaded and compiled from https://github.com/openssl/openssl/archive/OpenSSL_1_1_1d.tar.gz, it also made no difference

I understand this could be a bug in my code but I cant figure out why it only happens on some machines.

Any help is appreciated.

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: ssl3_get_record:decryption failed on some machines

Viktor Dukhovni
> On Nov 18, 2019, at 1:44 PM, Fernando Gutierrez Mendez <[hidden email]> wrote:
>
> I use non-blocking IO with a SSL BIO so a call to BIO_read eventually returns -1, when this happens I call BIO_should_retry to test if this is due an error or because of the underlying non-blocking transport.

Is the writer side also non-blocking?  Is it your own code?

> This code works correctly but after transferring between 1Mb to 5Mb (it varies every time) BIO_should_rety returns false and SSL_get_error returns SSL_ERROR_SSL. The error is "139964546914112:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../ssl/record/ssl3_record.c:677"

One way to get decryption integrity failure is for a non-blocking
writer to not handle partial writes correctly, if on an incomplete
write the writer resends the whole buffer, rather than only what
it failed to send last time, the TCP stream ends up stuttering
ciphertext, and the reader sees data integrity errors.

This can be seen by looking for unexpected runs of repeated
ciphertext in a PCAP capture of the data.

Whether the data sent to a particular reader ever ends up
blocked at the TCP layer for a given writer can depend on
various network-layer issues making some machines more
prone to problems than others.

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: ssl3_get_record:decryption failed on some machines

Fernando Gutierrez Mendez
The writer is my own code but I can also reproduce the problem when server is nginx and client is my app.

In my code I do not use OpenSSL socket BIOs instead I do read/writes through a BIO pair:

  pairBase = BIO_new(BIO_s_bio());
  pairInt  = BIO_new(BIO_s_bio());

  [...]

  BIO_make_bio_pair(pairBase, pairInt);

  [...]

  sslBIO = BIO_new_ssl(ssl_ctx, 1 /* Client */);

  [...]

  BIO_push(sslBIO, pairInt);

After each BIO_read/BIO_write to sslBIO I read/write any available data from the network to pairBase.

I think I'm handling partial writes correctly:

  SSL_CTX_set_mode(ssl_ctx, SSL_MODE_AUTO_RETRY | SSL_MODE_ENABLE_PARTIAL_WRITE | SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);

  [..]

  ret = BIO_write(sslBIO, buf, (int)length);

  if (ret <= 0 && !BIO_should_retry(sslBIO))
  {
      /* Handle error */
      return;
  }

  if (ret > 0)
  {
      buf = ((uint8_t *)buf) + (size_t)ret;
      length -= (size_t)ret;
  }

but again the problem reproduces even if the writer is nginx.

Thanks

On Mon, Nov 18, 2019 at 02:19:30PM -0500, Viktor Dukhovni wrote:

> > On Nov 18, 2019, at 1:44 PM, Fernando Gutierrez Mendez <[hidden email]> wrote:
> >
> > I use non-blocking IO with a SSL BIO so a call to BIO_read eventually returns -1, when this happens I call BIO_should_retry to test if this is due an error or because of the underlying non-blocking transport.
>
> Is the writer side also non-blocking?  Is it your own code?
>
> > This code works correctly but after transferring between 1Mb to 5Mb (it varies every time) BIO_should_rety returns false and SSL_get_error returns SSL_ERROR_SSL. The error is "139964546914112:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../ssl/record/ssl3_record.c:677"
>
> One way to get decryption integrity failure is for a non-blocking
> writer to not handle partial writes correctly, if on an incomplete
> write the writer resends the whole buffer, rather than only what
> it failed to send last time, the TCP stream ends up stuttering
> ciphertext, and the reader sees data integrity errors.
>
> This can be seen by looking for unexpected runs of repeated
> ciphertext in a PCAP capture of the data.
>
> Whether the data sent to a particular reader ever ends up
> blocked at the TCP layer for a given writer can depend on
> various network-layer issues making some machines more
> prone to problems than others.
>
> --
> Viktor.
>
Reply | Threaded
Open this post in threaded view
|

RE: ssl3_get_record:decryption failed on some machines

Fernando Gutierrez Mendez
Sorry to bring this up again but I really don't know how to fix. I already
re-wrote my code to use SSL_read/SSL_write instead of a SSL filter BIO but I
still get the same error.

I can reproduce when the sender is nginx, socat openssl-listen or openssl
s_server. Both the server and client are running in the same machine.

The SSL object is not using a socket BIO instead I use a BIO pair.  I may be
using the BIO pair incorrectly but I haven't found any complete examples on
how to use them.

It works perfectly if I use a debug build of OpenSSL

Thanks

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of
Fernando Gutierrez Mendez
Sent: Monday, November 18, 2019 2:34 PM
To: [hidden email]
Subject: Re: ssl3_get_record:decryption failed on some machines

The writer is my own code but I can also reproduce the problem when server
is nginx and client is my app.

In my code I do not use OpenSSL socket BIOs instead I do read/writes through
a BIO pair:

  pairBase = BIO_new(BIO_s_bio());
  pairInt  = BIO_new(BIO_s_bio());

  [...]

  BIO_make_bio_pair(pairBase, pairInt);

  [...]

  sslBIO = BIO_new_ssl(ssl_ctx, 1 /* Client */);

  [...]

  BIO_push(sslBIO, pairInt);

After each BIO_read/BIO_write to sslBIO I read/write any available data from
the network to pairBase.

I think I'm handling partial writes correctly:

  SSL_CTX_set_mode(ssl_ctx, SSL_MODE_AUTO_RETRY |
SSL_MODE_ENABLE_PARTIAL_WRITE | SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);

  [..]

  ret = BIO_write(sslBIO, buf, (int)length);

  if (ret <= 0 && !BIO_should_retry(sslBIO))
  {
      /* Handle error */
      return;
  }

  if (ret > 0)
  {
      buf = ((uint8_t *)buf) + (size_t)ret;
      length -= (size_t)ret;
  }

but again the problem reproduces even if the writer is nginx.

Thanks

On Mon, Nov 18, 2019 at 02:19:30PM -0500, Viktor Dukhovni wrote:
> > On Nov 18, 2019, at 1:44 PM, Fernando Gutierrez Mendez
<[hidden email]> wrote:
> >
> > I use non-blocking IO with a SSL BIO so a call to BIO_read eventually
returns -1, when this happens I call BIO_should_retry to test if this is due
an error or because of the underlying non-blocking transport.
>
> Is the writer side also non-blocking?  Is it your own code?
>
> > This code works correctly but after transferring between 1Mb to 5Mb (it
varies every time) BIO_should_rety returns false and SSL_get_error returns
SSL_ERROR_SSL. The error is "139964546914112:error:1408F119:SSL
routines:ssl3_get_record:decryption failed or bad record
mac:../ssl/record/ssl3_record.c:677"

>
> One way to get decryption integrity failure is for a non-blocking
> writer to not handle partial writes correctly, if on an incomplete
> write the writer resends the whole buffer, rather than only what it
> failed to send last time, the TCP stream ends up stuttering
> ciphertext, and the reader sees data integrity errors.
>
> This can be seen by looking for unexpected runs of repeated ciphertext
> in a PCAP capture of the data.
>
> Whether the data sent to a particular reader ever ends up blocked at
> the TCP layer for a given writer can depend on various network-layer
> issues making some machines more prone to problems than others.
>
> --
> Viktor.
>

Reply | Threaded
Open this post in threaded view
|

Re: ssl3_get_record:decryption failed on some machines

Matt Caswell-2


On 25/11/2019 08:45, [hidden email] wrote:

> Sorry to bring this up again but I really don't know how to fix. I already
> re-wrote my code to use SSL_read/SSL_write instead of a SSL filter BIO but I
> still get the same error.
>
> I can reproduce when the sender is nginx, socat openssl-listen or openssl
> s_server. Both the server and client are running in the same machine.
>
> The SSL object is not using a socket BIO instead I use a BIO pair.  I may be
> using the BIO pair incorrectly but I haven't found any complete examples on
> how to use them.
>
> It works perfectly if I use a debug build of OpenSSL

This suggests it *could* be a compiler bug. You might want to experiment
with different optimization levels to see if that makes a difference.

Matt


>
> Thanks
>
> -----Original Message-----
> From: openssl-users <[hidden email]> On Behalf Of
> Fernando Gutierrez Mendez
> Sent: Monday, November 18, 2019 2:34 PM
> To: [hidden email]
> Subject: Re: ssl3_get_record:decryption failed on some machines
>
> The writer is my own code but I can also reproduce the problem when server
> is nginx and client is my app.
>
> In my code I do not use OpenSSL socket BIOs instead I do read/writes through
> a BIO pair:
>
>   pairBase = BIO_new(BIO_s_bio());
>   pairInt  = BIO_new(BIO_s_bio());
>
>   [...]
>
>   BIO_make_bio_pair(pairBase, pairInt);
>
>   [...]
>
>   sslBIO = BIO_new_ssl(ssl_ctx, 1 /* Client */);
>
>   [...]
>
>   BIO_push(sslBIO, pairInt);
>
> After each BIO_read/BIO_write to sslBIO I read/write any available data from
> the network to pairBase.
>
> I think I'm handling partial writes correctly:
>
>   SSL_CTX_set_mode(ssl_ctx, SSL_MODE_AUTO_RETRY |
> SSL_MODE_ENABLE_PARTIAL_WRITE | SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER);
>
>   [..]
>
>   ret = BIO_write(sslBIO, buf, (int)length);
>
>   if (ret <= 0 && !BIO_should_retry(sslBIO))
>   {
>       /* Handle error */
>       return;
>   }
>
>   if (ret > 0)
>   {
>       buf = ((uint8_t *)buf) + (size_t)ret;
>       length -= (size_t)ret;
>   }
>
> but again the problem reproduces even if the writer is nginx.
>
> Thanks
>
> On Mon, Nov 18, 2019 at 02:19:30PM -0500, Viktor Dukhovni wrote:
>>> On Nov 18, 2019, at 1:44 PM, Fernando Gutierrez Mendez
> <[hidden email]> wrote:
>>>
>>> I use non-blocking IO with a SSL BIO so a call to BIO_read eventually
> returns -1, when this happens I call BIO_should_retry to test if this is due
> an error or because of the underlying non-blocking transport.
>>
>> Is the writer side also non-blocking?  Is it your own code?
>>
>>> This code works correctly but after transferring between 1Mb to 5Mb (it
> varies every time) BIO_should_rety returns false and SSL_get_error returns
> SSL_ERROR_SSL. The error is "139964546914112:error:1408F119:SSL
> routines:ssl3_get_record:decryption failed or bad record
> mac:../ssl/record/ssl3_record.c:677"
>>
>> One way to get decryption integrity failure is for a non-blocking
>> writer to not handle partial writes correctly, if on an incomplete
>> write the writer resends the whole buffer, rather than only what it
>> failed to send last time, the TCP stream ends up stuttering
>> ciphertext, and the reader sees data integrity errors.
>>
>> This can be seen by looking for unexpected runs of repeated ciphertext
>> in a PCAP capture of the data.
>>
>> Whether the data sent to a particular reader ever ends up blocked at
>> the TCP layer for a given writer can depend on various network-layer
>> issues making some machines more prone to problems than others.
>>
>> --
>> Viktor.
>>
>