SSL_ERROR_WANT_TIME: Pause SSL_connect to fetch intermediate certificates

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

SSL_ERROR_WANT_TIME: Pause SSL_connect to fetch intermediate certificates

Alex Rousskov
Hello,

    TLDR: How can we pause the SSL_connect() progress and return to its
caller after the origin certificate is fetched/decrypted, but before
OpenSSL starts validating it (so that we can fetch the missing
intermediate certificates without threads or blocking I/O)?
ASYNC_pause_job() does not seem to be the right answer.


My team is working on an HTTP proxy Squid. Squid does not have the
luxury of knowing what secure servers it will be talking to (on behalf
of its clients). Thus, it cannot simply preload _intermediate_
certificates for servers that do not supply them in their TLS handshakes
(e.g. https://incomplete-chain.badssl.com/ )

The standard solution for the missing intermediate certificate problem
is to fetch the missing intermediate certificates upon their discovery.
AIA (RFC 5280) is a mechanism that applications can use to learn about
the location of the missing intermediate certificates. Popular browsers
fetch what they miss: If you go to the above URL, your browser will
probably be happy!

As you know, OpenSSL provides the certificate verification callback that
can discover that the origin certificate chain is incomplete. An
application using threads or blocking I/O can probably "pause" its
verification callback execution, fetch the intermediate certificates,
and then complete validation before happily returning to the
SSL_connect() caller. Life is easy when you can use threads or block
thousands of concurrent transactions!

Unfortunately, Squid can use neither threads nor blocking I/O. Upon
discovery of the missing intermediate certificates, Squid has to return
to the SSL_connect() caller, fetch the certificates, and then resume
SSL_connect() with the fetched certificates available to OpenSSL.
However, the certificate verification callback does not have a "please
call me later" return code. We can only return "yes, valid" or "no,
invalid" results.


We found an ugly workaround for the above problem. Here is a simplified
description: Using OpenSSL BIO API, Squid parses the server certificate
during the TLS handshake, discovers the missing intermediate
certificates, pauses TLS I/O (this results in SSL_connect() returning
back to the caller with SSL_ERROR_WANT_READ), fetches the missing
certificates, supplies them to OpenSSL, and then resumes SSL_connect().
This hack has worked for many years.


Now comes TLS v1.3. Squid code can no longer parse the certificates
before OpenSSL because they are contained in the _encrypted_ part of the
server handshake. Thus, Squid cannot discover what is missing and fetch
that for OpenSSL before certificate validation starts.

What can we do? How can we pause the SSL_connect() progress after the
origin certificate is fetched but before it is validated?


I am aware of the ASYNC_pause_job() and related async APIs in OpenSSL.
If I interpret related documentation, discussions, and our test results
correctly, that API is not meant as the correct answer for our problem.
Today, abusing that API will probably work. Tomorrow,
internal/unpredictable OpenSSL changes might break our Squid
enhancements beyond repair as detailed below.

Somewhat counter-intuitively, the OpenSSL async API is meant for
activities that can work correctly _without_ becoming asynchronous (i.e.
without being paused to temporary give way to other activities). Squid
cannot fetch the missing intermediate certificates without pausing TLS
negotiations with the server...

The async API was added to support custom OpenSSL engines, not
application callbacks. The API does not guarantee that an
ASYNC_pause_job() will actually pause processing and return to the
SSL_connect() caller! That will only happen if OpenSSL internal code
does not call ASYNC_block_pause(), effectively converting all subsequent
ASYNC_pause_job() calls into a no-op. That pause-nullification was added
to work around deadlocks, but it effectively places the API off limits
to user-level code that cannot control the timing of those
ASYNC_block_pause() calls.


Squid could kill the current TLS session (and its TCP connection), fetch
the missing certificates, and then retry from scratch, but that is a
very ugly (unreliable, wasteful, and noisy) solution.


Can you think of another trick?


Thank you,

Alex.
P.S. Squid does not support BoringSSL, but BoringSSL's
SSL_ERROR_WANT_CERTIFICATE_VERIFY result of the certificate validation
callback seemingly addresses our use case. I do not know whether OpenSSL
decision makers would be open to adding something along those lines and
decided to ask for existing solutions here before proposing adding
SSL_ERROR_WANT_TIME :-).
Reply | Threaded
Open this post in threaded view
|

Re: SSL_ERROR_WANT_TIME: Pause SSL_connect to fetch intermediate certificates

Matt Caswell-2


On 18/08/2020 22:31, Alex Rousskov wrote:
> As you know, OpenSSL provides the certificate verification callback that
> can discover that the origin certificate chain is incomplete. An
> application using threads or blocking I/O can probably "pause" its
> verification callback execution, fetch the intermediate certificates,
> and then complete validation before happily returning to the
> SSL_connect() caller. Life is easy when you can use threads or block
> thousands of concurrent transactions!

I suspect this is the way most people do it.

> What can we do? How can we pause the SSL_connect() progress after the
> origin certificate is fetched but before it is validated?

We should really have a proper callback for this purpose. PRs welcome!
(Doesn't help you right now though).


> I am aware of the ASYNC_pause_job() and related async APIs in OpenSSL.
> If I interpret related documentation, discussions, and our test results
> correctly, that API is not meant as the correct answer for our problem.
> Today, abusing that API will probably work. Tomorrow,
> internal/unpredictable OpenSSL changes might break our Squid
> enhancements beyond repair as detailed below.
>
> Somewhat counter-intuitively, the OpenSSL async API is meant for
> activities that can work correctly _without_ becoming asynchronous (i.e.
> without being paused to temporary give way to other activities). Squid
> cannot fetch the missing intermediate certificates without pausing TLS
> negotiations with the server...
>
> The async API was added to support custom OpenSSL engines, not
> application callbacks. The API does not guarantee that an
> ASYNC_pause_job() will actually pause processing and return to the
> SSL_connect() caller! That will only happen if OpenSSL internal code
> does not call ASYNC_block_pause(), effectively converting all subsequent
> ASYNC_pause_job() calls into a no-op. That pause-nullification was added
> to work around deadlocks, but it effectively places the API off limits
> to user-level code that cannot control the timing of those
> ASYNC_block_pause() calls.

The async API is meant for any scenario where user code may want to
perform async processing. Its design is NOT restricted to engines -
although that is certainly where it is normally used. However there are
no assumptions made anywhere that it will be exclusively restricted to
engines.

ASYNC_block_pause() is intended as a user level API, and a quick search
of the  codebase reveals that the only place we use it internally is in
our tests - it does not appear in the library code. The intention is
that you should be able to rely on being inside a job in any callbacks,
if you've started the connection inside one.

"Somewhat counter-intuitively, the OpenSSL async API is meant for
activities that can work correctly _without_ becoming asynchronous (i.e.
without being paused to temporary give way to other activities)"

I have no idea what you mean by this. The whole point of
ASYNC_pause_job() is to temporarily give way to other activities.

One issue you might encounter with the ASYNC APIs is that they are not
available on some less-common platforms. Basically anything without
setcontext/swapcontext support (e.g. IIRC I think android may fall into
this category).

> Can you think of another trick?

One possibility that springs to mind (which is also an ugly hack) is to
defer the validation of the certificates. So, you have a verify callback
that always says "ok". But any further reads on the underlying BIO
always return with "retry" until such time as any intermediate
certificates have been fetched and the chain has been verified "for
real". The main problem I can see with this approach is there is no easy
way to send the right alert back to the server in the event of failure.


> P.S. Squid does not support BoringSSL, but BoringSSL's
> SSL_ERROR_WANT_CERTIFICATE_VERIFY result of the certificate validation
> callback seemingly addresses our use case. I do not know whether OpenSSL
> decision makers would be open to adding something along those lines and
> decided to ask for existing solutions here before proposing adding
> SSL_ERROR_WANT_TIME :-).

I'd definitely be open to adding it - although it wouldn't be backported
to a stable branch.

Matt



Reply | Threaded
Open this post in threaded view
|

Re: SSL_ERROR_WANT_TIME: Pause SSL_connect to fetch intermediate certificates

Alex Rousskov
On 8/19/20 5:29 AM, Matt Caswell wrote:

> We should really have a proper callback for this purpose. PRs welcome!
> (Doesn't help you right now though).

Thank you for a prompt, thoughtful, and useful response. I believe that
we are on the same page as far as async API overall intentions, and I am
also very glad to hear that the OpenSSL team may welcome an addition of
a proper callback to address Squid's use case. I know Squid needs are
not unique. I do not yet know whether I can contribute (or facilitate
contribution of) such an enhancement, but this green light is meaningful
progress already!


>> "Somewhat counter-intuitively, the OpenSSL async API is meant for
>> activities that can work correctly _without_ becoming asynchronous (i.e.
>> without being paused to temporary give way to other activities)"

> I have no idea what you mean by this.

Sorry for not detailing this accusation. I was worried that my email was
already too long/verbose... I will detail it below.


> The whole point of ASYNC_pause_job() is to temporarily give way to
> other activities.

Yes, giving way to other activities is the whole point of the async API
optimization. Unfortunately, being only an optimization, the API is not
enough when the callback MUST "give way to other activities". AFAICT,
OpenSSL does not guarantee that ASYNC_pause_job() in a callback will
actually "give way to other activities" because OpenSSL does guarantee
that some engine or OpenSSL native code does not hold a
ASYNC_block_pause() "lock" while calling the callback.

The engine code that async API supports well may look like this[1]:

   myengine()
   {
       while (!something_happened())
           ASYNC_pause_job(); // application MAY get control here

       ... use something ...
   }

The callback code that async API does not support looks like this:

    mycallback()
    {
        if (!something_happened())
            ASYNC_pause_job(); // application MUST get control here

        assert(something_happened());
        ... use something ...
    }

Please note that replacing "if" with "while" in mycallback() would make
the compiled code identical with myengine() but would not solve the
problem: Instead of the failed assertion, the callback would get into an
infinite loop...

The callback _relies_ on the application making progress (e.g., fetching
the missing intermediate certificates or declaring a fetch failure
before resuming SSL_connect()). This callback cannot work correctly
without the application actually getting control. That is why the
pausing call comments are different: MAY vs. MUST.

Does this clarify what I meant? Do you agree that OpenSSL async API is
not suitable for callbacks that _require_ ASYNC_pause_job() to return
control to the application?

[1] This myengine() example is inspired by your explanation at
https://mta.openssl.org/pipermail/openssl-dev/2015-October/003031.html


> ASYNC_block_pause() ... does not appear in the library code

True, but it did appear there in the past, right? I am looking at commit
625146d as an example. Those calls were removed more than a year later
in 75e2c87 AFAICT, but I see no guarantee that they will not reappear again.

And even if OpenSSL now has a policy against using ASYNC_block_pause()
internally, or a policy against holding an ASYNC_block_pause() "lock"
while calling any callback, some custom engine might do that at the
"wrong" for the above mycallback() moment, right?

If you think that fears about something inside OpenSSL/engines
preventing our callback from returning control to the application are
unfounded, then using async API may be the best long-term solution for
Squid. Short-term, it does not work "as is" because OpenSSL STACKSIZE
appears to be too small (leading to weird crashes that disappear if we
increase STACKSIZE from 32768 to 524288 bytes), but perhaps we can
somehow hack around that.


> One possibility that springs to mind (which is also an ugly hack) is to
> defer the validation of the certificates. So, you have a verify callback
> that always says "ok". But any further reads on the underlying BIO
> always return with "retry" until such time as any intermediate
> certificates have been fetched and the chain has been verified "for
> real". The main problem I can see with this approach is there is no easy
> way to send the right alert back to the server in the event of failure.

We were also concerned that X509_verify_cert() is not enough to fully
mimic the existing OpenSSL certificate validation procedure because the
internal OpenSSL ssl_verify_cert_chain() does not just call
X509_verify_cert(). It also does some DANE-related manipulations, for
example. Are those fears unfounded? In other words, is calling
X509_verify_cert() directly always enough to make the right certificate
validation decision?


Thanks a lot,

Alex.
Reply | Threaded
Open this post in threaded view
|

Re: SSL_ERROR_WANT_TIME: Pause SSL_connect to fetch intermediate certificates

Matt Caswell-2


On 19/08/2020 20:35, Alex Rousskov wrote:
> Does this clarify what I meant? Do you agree that OpenSSL async API is
> not suitable for callbacks that _require_ ASYNC_pause_job() to return
> control to the application?

Yes, it clarifies what you meant. And, yes, its true that strictly
speaking that *could* happen. ASYNC_block_pause() was introduced to
handle the problem where we are holding a lock and therefore must not
return control to the user without releasing that lock. As a general
rule we want to keep the sections of code that perform work under a lock
to an absolute minimum. It would not seem like a great idea to me to
call user callbacks from libssl while holding such a lock. We have no
idea what those callbacks are going to do, and which APIs they will
call. The chances of a deadlock occurring seem very high under those
circumstances, unless restrictions are placed on what the callback can
do, and those restrictions are very clearly documented.

So, yes you are right. But in practice I'm not sure how much I'd really
worry about this theoretical restriction. That's of course for you to
decide.

> If you think that fears about something inside OpenSSL/engines
> preventing our callback from returning control to the application are
> unfounded, then using async API may be the best long-term solution for
> Squid. Short-term, it does not work "as is" because OpenSSL STACKSIZE
> appears to be too small (leading to weird crashes that disappear if we
> increase STACKSIZE from 32768 to 524288 bytes), but perhaps we can
> somehow hack around that.

Hmm. Yes this is a problem with the current implementation. The
selection of STACKSIZE is somewhat arbitrary. It would be nice if the
stack size grew as required, but I'm not sure if that's even technically
possible. A workaround might be for us to expose some API to set it -
but exposing such internal details is also quite horrible.


>
>
>> One possibility that springs to mind (which is also an ugly hack) is to
>> defer the validation of the certificates. So, you have a verify callback
>> that always says "ok". But any further reads on the underlying BIO
>> always return with "retry" until such time as any intermediate
>> certificates have been fetched and the chain has been verified "for
>> real". The main problem I can see with this approach is there is no easy
>> way to send the right alert back to the server in the event of failure.
>
> We were also concerned that X509_verify_cert() is not enough to fully
> mimic the existing OpenSSL certificate validation procedure because the
> internal OpenSSL ssl_verify_cert_chain() does not just call
> X509_verify_cert(). It also does some DANE-related manipulations, for
> example. Are those fears unfounded? In other words, is calling
> X509_verify_cert() directly always enough to make the right certificate
> validation decision?
>

Does squid use the DANE APIs? If not I'm not sure it makes much
difference. In any case the "manipulation" seems limited to setting DANE
information in the X509_STORE_CTX which presumably could be replicated by:

X509_STORE_CTX_set0_dane(ctx, SSL_get0_dane());

However, I'm not really the person to ask about the DANE implementation.
Maybe Viktor Dukhovni will chip in with his thoughts.

Matt