RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
This is what we do:

Create a non-blocking TCP socket.
Call SSL_new(), SSL_set_fd(), SSL_connect()
Thereafter call SSL_read().
Renegotiates handled by OpenSSL.

We have only seen the error very occasionally, the vast majority of calls return SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE.
The following is traced when we get the fail, originally we then caused a disconnect but now handle it as discussed:

LDAP: session=7F38AE82F840 socket read error SSL Error Syscall: LastError 11

Note that because our code automatically reconnects and reissues failed and aborted operations customers would not normally see the error. In this case though we disconnected while a modify operation was running in the LDAP server, after reconnecting the modify was reissued and failed because of attribute deletion by the original modify. The investigation showed up the error. The customer was running a soak test and thousands (maybe millions) of reads worked fine until the failing one.

Regards,
John.

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of Erik Forsberg
Sent: 01 May 2019 03:05
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11EAGAIN

CAUTION: This email originated from outside of Synchronoss.


>-- Original Message --
>
>
>>-- Original Message --
>>
>>On Tue, Apr 30, 2019 at 03:23:23PM -0700, Erik Forsberg wrote:
>>
>>> >Is the handshake explicit, or does the application just call
>>> >SSL_read(), with OpenSSL performing the handshake as needed?
>>>
>>> I occasionally (somewhat rarely) see the issue mentioned by the OP.
>>> Ignoring the error, or mapping it and do what WANT_READ/WANT_WRITE
>>> does effectively hides the issue and connection works fine. I
>>> predominantly run on Solaris 11. In my case, I open the socket
>>> myself, set non-blocking mode and associates with an SSL object using SS_set_fd().
>>> The initial handshake is done explicitly.
>>
>>Recoverable errors should not result in SSL_ERROR_SYSCALL.  This feels
>>like a bug.  I'd like to hear from Matt Caswell on this one.
>>Perhaps someone should open an issue on Github...
>>
>I will scan my logs later this evening and see if this is still an issue.
>Last time I remember seeing it was quote some long time ago (couple of
>years)
>
>

ok, I checked my logs (3+ years worth of them) and I have not seen this error in that timeframe.
so it must have been a much older OpenSSL version I used way back when I remember doing this workaround.
Doesnt seem to be needed for me anymore.


Reply | Threaded
Open this post in threaded view
|

Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

Viktor Dukhovni
> On May 1, 2019, at 9:47 AM, John Unsworth <[hidden email]> wrote:
>
> Create a non-blocking TCP socket.
> Call SSL_new(), SSL_set_fd(), SSL_connect()
> Thereafter call SSL_read().
> Renegotiates handled by OpenSSL.

Can you be more specific about "Create a non-blocking TCP socket"?
That fully sets up the TCP connection?

Also, with the non-blocking connection, how do you decide when to read?
Are you using poll()? select()? epoll()?  And did they report available
data?

In this particular case, was the client trying to read the initial
bytes of the server's reply having received nothing yet in response
to its query?  Or was it in the middle of reading a data stream?

When reading TLS records OpenSSL first reads the record layer
header which indicates the payload length, and then tries to read
that many bytes.  Does the server send the record layer header in
the same TCP segment as the payload, or in separate segments?

Do you know what protocol version was negotiated?  Are both ends
using OpenSSL?  What version on the server side? ...

Can you reproduce the problem after sufficiently many client
server interactions?  Can you get PCAP files of any problem
cases?

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
Create a non-blocking TCP socket
        socket() for a sock_stream.
        connect().
        SSL_new(), SSL_set_fd(), SSL_connect().

The application sends LDAP operations from many threads. We have just one thread that reads LDAP results. If an operation is outstanding then the result thread does (simplified):

SSL_read()
If > 0 return data.
Else if SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE then poll(); back to SSL_read() when data available.
Else return error and disconnect.

Don't know what protocol was negotiated or what the server does in terms of returned data. TCP/OpenSSL handle that.
Both ends OpenSSL 1.1.0h.
Problem seems to occur at random - only reproducable on customer site and after a long time running their soak test.

Regards,
John.

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of Viktor Dukhovni
Sent: 02 May 2019 07:25
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

CAUTION: This email originated from outside of Synchronoss.


> On May 1, 2019, at 9:47 AM, John Unsworth <[hidden email]> wrote:
>
> Create a non-blocking TCP socket.
> Call SSL_new(), SSL_set_fd(), SSL_connect() Thereafter call
> SSL_read().
> Renegotiates handled by OpenSSL.

Can you be more specific about "Create a non-blocking TCP socket"?
That fully sets up the TCP connection?

Also, with the non-blocking connection, how do you decide when to read?
Are you using poll()? select()? epoll()?  And did they report available data?

In this particular case, was the client trying to read the initial bytes of the server's reply having received nothing yet in response to its query?  Or was it in the middle of reading a data stream?

When reading TLS records OpenSSL first reads the record layer header which indicates the payload length, and then tries to read that many bytes.  Does the server send the record layer header in the same TCP segment as the payload, or in separate segments?

Do you know what protocol version was negotiated?  Are both ends using OpenSSL?  What version on the server side? ...

Can you reproduce the problem after sufficiently many client server interactions?  Can you get PCAP files of any problem cases?

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

Viktor Dukhovni
> On May 2, 2019, at 5:56 AM, John Unsworth <[hidden email]> wrote:
>
> Create a non-blocking TCP socket
> socket() for a sock_stream.
> connect().

Do you wait for the non-blocking connect to complete at this point?

> SSL_new(), SSL_set_fd(), SSL_connect().
>
> The application sends LDAP operations from many threads.

Are multiple threads writing to the same SSL connection?  How do
you ensure orderly use of the SSL connection?  Sharing connections
across threads without application level synchronization is not
supported in OpenSSL.

> We have just one thread that reads LDAP results.

How are further requests locked out when you're performing reads?
What is the granularity of the relevant locks?

> If an operation is outstanding then the result thread does (simplified):
>
> SSL_read()
> If > 0 return data.

At this point you'd be calling SSL_get_error(), is there a lock that
prevents writes between SSL_read() and SSL_read() and SSL_get_error()?

> Else if SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE then poll(); back to SSL_read() when data available.
> Else return error and disconnect.
>
> Don't know what protocol was negotiated or what the server does in terms of returned data. TCP/OpenSSL handle that.
> Both ends OpenSSL 1.1.0h.
> Problem seems to occur at random - only reproducable on customer site and after a long time running their soak test.

It would be helpful if the customer could gather more diagnostic information
from that "soak test".  With 1.1.0, presumably they negotiate TLS 1.2, since
TLS 1.3 is not available, while 1.2 is available on both ends.

IIRC OpenSSL will normally send the record layer header in the same segment
as the payload, so running into EAGAIN is unlikely after the initial 5 bytes
of record header, unless the TCP receive window was nearly full.

I gather the protocol is full-duplex and multiple outstanding requests can be
written before the corresponding replies are read?  Or is it strict half-duplex
request-response?

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
Please note that the connection has been made successfully and many operations and responses have occurred before the fail.

> Do you wait for the non-blocking connect to complete at this point?
We connect in blocking mode then switch to non-blocking.

> Are multiple threads writing to the same SSL connection?  How do you ensure orderly use of the SSL connection?  Sharing connections across threads without application level synchronization is not supported in OpenSSL.

We use mutexes to synchronize of course.

> How are further requests locked out when you're performing reads?
What is the granularity of the relevant locks?

The mutex only allows one SSL call at a time.

> At this point you'd be calling SSL_get_error(), is there a lock that prevents writes between SSL_read() and SSL_read() and SSL_get_error()?

The mutex does not protect SSL_get_error() calls.

> I gather the protocol is full-duplex and multiple outstanding requests can be written before the corresponding replies are read?  Or is it strict half-duplex request-response?

It is full duplex and there can be multiple operations in progress.

Regards,
John.
 
-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of Viktor Dukhovni
Sent: 02 May 2019 15:56
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

CAUTION: This email originated from outside of Synchronoss.


> On May 2, 2019, at 5:56 AM, John Unsworth <[hidden email]> wrote:
>
> Create a non-blocking TCP socket
>       socket() for a sock_stream.
>       connect().

Do you wait for the non-blocking connect to complete at this point?

>       SSL_new(), SSL_set_fd(), SSL_connect().
>
> The application sends LDAP operations from many threads.

Are multiple threads writing to the same SSL connection?  How do you ensure orderly use of the SSL connection?  Sharing connections across threads without application level synchronization is not supported in OpenSSL.

> We have just one thread that reads LDAP results.

How are further requests locked out when you're performing reads?
What is the granularity of the relevant locks?

> If an operation is outstanding then the result thread does (simplified):
>
> SSL_read()
> If > 0 return data.

At this point you'd be calling SSL_get_error(), is there a lock that prevents writes between SSL_read() and SSL_read() and SSL_get_error()?

> Else if SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE then poll(); back to SSL_read() when data available.
> Else return error and disconnect.
>
> Don't know what protocol was negotiated or what the server does in terms of returned data. TCP/OpenSSL handle that.
> Both ends OpenSSL 1.1.0h.
> Problem seems to occur at random - only reproducable on customer site and after a long time running their soak test.

It would be helpful if the customer could gather more diagnostic information from that "soak test".  With 1.1.0, presumably they negotiate TLS 1.2, since TLS 1.3 is not available, while 1.2 is available on both ends.

IIRC OpenSSL will normally send the record layer header in the same segment as the payload, so running into EAGAIN is unlikely after the initial 5 bytes of record header, unless the TCP receive window was nearly full.

I gather the protocol is full-duplex and multiple outstanding requests can be written before the corresponding replies are read?  Or is it strict half-duplex request-response?

--
        Viktor.

Reply | Threaded
Open this post in threaded view
|

Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

Viktor Dukhovni
On Thu, May 02, 2019 at 04:10:31PM +0000, John Unsworth wrote:

> > Do you wait for the non-blocking connect to complete at this point?
> We connect in blocking mode then switch to non-blocking.

Thanks that rules connection setup out of the picture.

> > Are multiple threads writing to the same SSL connection?  How do you
> > ensure orderly use of the SSL connection?  Sharing connections across
> > threads without application level synchronization is not supported in
> > OpenSSL.
>
> We use mutexes to synchronize of course.

OK.

> > How are further requests locked out when you're performing reads?
> > What is the granularity of the relevant locks?
>
> The mutex only allows one SSL call at a time.

So a serialized mix of reads and writes with possibly multiple
outstanding writes interleaved with the reader?  Right?

> > At this point you'd be calling SSL_get_error(), is there a lock that
> > prevents writes between SSL_read() and SSL_read() and SSL_get_error()?
>
> The mutex does not protect SSL_get_error() calls.

I think that's an application bug.  The SSL_get_error() is using
the same SSL handle as the SSL_read(), which can be materially
altered by concurrent writes.  (Matt, if you're still reading this
thread, do you agree?)

I would not release the mutex until after the call to SSL_get_error().

> > I gather the protocol is full-duplex and multiple outstanding requests
> > can be written before the corresponding replies are read?  Or is it strict
> > half-duplex request-response?
>
> It is full duplex and there can be multiple operations in progress.

Please retest with both the SSL_read() and SSL_get_error() running
under a single lock.  And similarly on the write side.

Do keep in mind that a server may close the socket for further
writes after replying to a number of requests (ideally sending an
SSL close notify, i.e. SSL_shutdown() as it does so), at which point
the SSL connection is half-closed.  With a full-duplex protocol,
you may also need to handle reading outstanding replies on a
connection that no longer supports writes.

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
>> I think that's an application bug.  
Thanks.
I thought you might say that. I will change the code and get the customer to retest.

Regards,
John

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of Viktor Dukhovni
Sent: 02 May 2019 18:23
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

CAUTION: This email originated from outside of Synchronoss.


On Thu, May 02, 2019 at 04:10:31PM +0000, John Unsworth wrote:

> > Do you wait for the non-blocking connect to complete at this point?
> We connect in blocking mode then switch to non-blocking.

Thanks that rules connection setup out of the picture.

> > Are multiple threads writing to the same SSL connection?  How do you
> > ensure orderly use of the SSL connection?  Sharing connections
> > across threads without application level synchronization is not
> > supported in OpenSSL.
>
> We use mutexes to synchronize of course.

OK.

> > How are further requests locked out when you're performing reads?
> > What is the granularity of the relevant locks?
>
> The mutex only allows one SSL call at a time.

So a serialized mix of reads and writes with possibly multiple outstanding writes interleaved with the reader?  Right?

> > At this point you'd be calling SSL_get_error(), is there a lock that
> > prevents writes between SSL_read() and SSL_read() and SSL_get_error()?
>
> The mutex does not protect SSL_get_error() calls.

I think that's an application bug.  The SSL_get_error() is using the same SSL handle as the SSL_read(), which can be materially altered by concurrent writes.  (Matt, if you're still reading this thread, do you agree?)

I would not release the mutex until after the call to SSL_get_error().

> > I gather the protocol is full-duplex and multiple outstanding
> > requests can be written before the corresponding replies are read?  
> > Or is it strict half-duplex request-response?
>
> It is full duplex and there can be multiple operations in progress.

Please retest with both the SSL_read() and SSL_get_error() running under a single lock.  And similarly on the write side.

Do keep in mind that a server may close the socket for further writes after replying to a number of requests (ideally sending an SSL close notify, i.e. SSL_shutdown() as it does so), at which point the SSL connection is half-closed.  With a full-duplex protocol, you may also need to handle reading outstanding replies on a connection that no longer supports writes.

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

Matt Caswell-2
In reply to this post by Viktor Dukhovni


On 02/05/2019 18:23, Viktor Dukhovni wrote:

>>> At this point you'd be calling SSL_get_error(), is there a lock that
>>> prevents writes between SSL_read() and SSL_read() and SSL_get_error()?
>>
>> The mutex does not protect SSL_get_error() calls.
>
> I think that's an application bug.  The SSL_get_error() is using
> the same SSL handle as the SSL_read(), which can be materially
> altered by concurrent writes.  (Matt, if you're still reading this
> thread, do you agree?)
>
> I would not release the mutex until after the call to SSL_get_error().

An SSL object should not be used in multiple threads at the same time no matter
what the API call. This applies to SSL_get_error() as well. If you are doing
that then that could most definitely cause the behaviour you are seeing.

Matt

Reply | Threaded
Open this post in threaded view
|

Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
Testing changed code.

Regards
John


From: openssl-users <[hidden email]> on behalf of Matt Caswell <[hidden email]>
Sent: Friday, May 3, 2019 10:16 am
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN
 
CAUTION: This email originated from outside of Synchronoss.


On 02/05/2019 18:23, Viktor Dukhovni wrote:
>>> At this point you'd be calling SSL_get_error(), is there a lock that
>>> prevents writes between SSL_read() and SSL_read() and SSL_get_error()?
>>
>> The mutex does not protect SSL_get_error() calls.
>
> I think that's an application bug.  The SSL_get_error() is using
> the same SSL handle as the SSL_read(), which can be materially
> altered by concurrent writes.  (Matt, if you're still reading this
> thread, do you agree?)
>
> I would not release the mutex until after the call to SSL_get_error().

An SSL object should not be used in multiple threads at the same time no matter
what the API call. This applies to SSL_get_error() as well. If you are doing
that then that could most definitely cause the behaviour you are seeing.

Matt

Reply | Threaded
Open this post in threaded view
|

Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

Viktor Dukhovni
On Fri, May 03, 2019 at 09:34:14AM +0000, John Unsworth wrote:

> Testing changed code.

For the record, though I think you realise this, *both* the SSL_read()
or SSL_write() and the following SSL_get_error() need to be protected
as a unit by the *same* instance of the locked mutex.  It would not
be enough to lock these separately.

    acquire_lock();
        if (reading)
            ret = SSL_read(ssl, ...);
        else
            ret = SSL_write(ssl, ...);
        if (ret <= 0)
            err = SSL_get_error(ssl, ret);
    release_lock();

    /* Handle EOF and errors */

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
Thanks, the mutex is tied to the SSL session and used for all calls (now!).

The good news is that moving SSL_get_error() into the same mutex unit as SSL_read() has solved the problem.

Thank you for all your help and advice.

Regards,
John.

John Unsworth |Meta-Directory Engineering and Support
Mobile: +44 777.557.2643

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of Viktor Dukhovni
Sent: 03 May 2019 23:04
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

CAUTION: This email originated from outside of Synchronoss.


On Fri, May 03, 2019 at 09:34:14AM +0000, John Unsworth wrote:

> Testing changed code.

For the record, though I think you realise this, *both* the SSL_read()
or SSL_write() and the following SSL_get_error() need to be protected
as a unit by the *same* instance of the locked mutex.  It would not
be enough to lock these separately.

    acquire_lock();
        if (reading)
            ret = SSL_read(ssl, ...);
        else
            ret = SSL_write(ssl, ...);
        if (ret <= 0)
            err = SSL_get_error(ssl, ret);
    release_lock();

    /* Handle EOF and errors */

--
        Viktor.
Reply | Threaded
Open this post in threaded view
|

RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

John Unsworth-3
Just a thought. Would it not be possible for the SSL session to create a mutex and lock it where required?
Error details could be stored in Thread Local Storage to obliviate the need to call SSL_get_error() within the mutex block.

Regards,
John

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of John Unsworth
Sent: 07 May 2019 09:06
To: [hidden email]
Subject: RE: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

CAUTION: This email originated from outside of Synchronoss.


Thanks, the mutex is tied to the SSL session and used for all calls (now!).

The good news is that moving SSL_get_error() into the same mutex unit as SSL_read() has solved the problem.

Thank you for all your help and advice.

Regards,
John.

-----Original Message-----
From: openssl-users <[hidden email]> On Behalf Of Viktor Dukhovni
Sent: 03 May 2019 23:04
To: [hidden email]
Subject: Re: SSL_read() returning SSL_ERROR_SYSCALL with errno 11 EAGAIN

CAUTION: This email originated from outside of Synchronoss.


On Fri, May 03, 2019 at 09:34:14AM +0000, John Unsworth wrote:

> Testing changed code.

For the record, though I think you realise this, *both* the SSL_read()
or SSL_write() and the following SSL_get_error() need to be protected
as a unit by the *same* instance of the locked mutex.  It would not
be enough to lock these separately.

    acquire_lock();
        if (reading)
            ret = SSL_read(ssl, ...);
        else
            ret = SSL_write(ssl, ...);
        if (ret <= 0)
            err = SSL_get_error(ssl, ret);
    release_lock();

    /* Handle EOF and errors */

--
        Viktor.