Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);

Ian Sinclair
I'm working with Asterisk PBX code which uses openSSL v1.0.2 (from Centos6). On one site we're getting a crash from dtls1_do_write and as far as I can tell it's from the assertion coded:

  /* bad if this assert fails, only part of the handshake
   * message got sent.  but why would this happen? */
  OPENSSL_assert(len == (unsigned int)ret);

My question is the same as some previous author - why would this happen?

Is there any meaningful way I can debug this? Some flag I can set that will show the DTLS packets to try to find a cause? Some way to get rid of the assertion so that the failure doesn't take down the whole system, because currently the assertion causes a reboot? It's happening on an end customer site so building a debug load isn't particularly viable, but if that's the only option tell me how.

Is this a known problem that is only fixed as a non-security fix in a later release? We are current for the release, I believe, with v1.0.1e 58.el6_10. If the solution is only in later releases how compatible are those with Centos 6? I really don't want to have to go to another stream.

I'm completely new to Asterisk, openSSL, core files, and pretty much everything else, so please be clear and complete in suggestions.

Thanks,
Ian

Reply | Threaded
Open this post in threaded view
|

Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);

Matt Caswell-2


On 24/09/2019 14:11, Ian Sinclair wrote:
> I'm working with Asterisk PBX code which uses openSSL v1.0.2 (from Centos6). On
> one site we're getting a crash from dtls1_do_write and as far as I can tell it's
> from the assertion coded:
>
>   /* bad if this assert fails, only part of the handshake
>    * message got sent.  but why would this happen? */
>   OPENSSL_assert(len == (unsigned int)ret);
>
> My question is the same as some previous author - why would this happen?

I've taken a look at the code in this area to try and figure out a reason.

I note that the code before this looks like this:

        ret = dtls1_write_bytes(s, type, &s->init_buf->data[s->init_off],
                                len);
        if (ret < 0) {
                ....omitted...
        } else {

            /*
             * bad if this assert fails, only part of the handshake message
             * got sent.  but why would this happen?
             */
            OPENSSL_assert(len == (unsigned int)ret);

Most significantly I notice that this doesn't seem to handle the ret == 0 case -
which is normally used to indicate an error. So looking at dtls1_write_bytes we
should check whether there are any cases where it could return 0. Well that
function doesn't actually do very much except call do_dtls1_write() and return
the result of that.

There are a few possible points where that function could return a value:

1) Returns the result of ssl3_write_pending()
2) Returns the result of ssl_dispatch_alert(). Following the logic of that we
find that it ends up returning the result of a recursive do_dtls1_write() call -
so we can ignore this one.
3) If len == 0 and !create_empty_fragment then we return 0. If that is the case
then len == 0 and ret == 0 so the assertion would not be hit, so we can ignore this.
4) if create_empty_fragment is true we return wr->length. But there are no calls
of do_dtls1_write where create_empty_fragment is true anywhere in the code. This
is actually dead code (we should remove this). We can ignore this.
5) On error we return -1, which we can ignore.

So it seems do_dtls1_write() will only return 0 if ssl3_write_pending() does.

Looking at that function there are 3 possible points where it can return:

1) On error it returns -1, so we can ignore that.
2) In some circumstances it can return s->wpend_ret. wpend_return is set in
do_dtls1_write to the value of len, so the only way it could be 0 is if len is,
which would not cause the assertion to fail so we can ignore this.
3) In other circumstances it returns the result of a BIO_write() call

So the only way I can see for a 0 return to come back is if BIO_write() returns
that. Looking at the implementation of BIO_write() it will return 0 if:

- the BIO is NULL. But there is a check for this in ssl3_write_pending() so we
know this will never be the case.
- If a BIO callback has been set (via BIO_set_callback()) then it will return
the result from that
- Otherwise the return value is the result of a write call from the underlying
BIO implementation.

In DTLS the underlying BIO implementation is usually a BIO_s_datagram(). That
write function will return the value from a system call to send or sendto. The
man pages indicate this returns -1 on error, or otherwise the number of bytes
sent - so I don't see how we could ever get 0 here.

So, the summary of all of that is, I think we could get a situation where the
assertion is triggered if:

1) You are using BIO_set_callback() and the callback you set returns 0
2) You are using some source/sink BIO other than BIO_s_datagram() and it
sometimes returns 0 from its write call.
3) You are using a filter BIO and that returns 0 during a write call

Other possibilities that spring to mind are if you are using a custom BIO that
doesn't behave well and returns a positive value from its write call that is not
equal to the number of bytes it was asked to write.

>
> Is there any meaningful way I can debug this? Some flag I can set that will show
> the DTLS packets to try to find a cause? Some way to get rid of the assertion so
> that the failure doesn't take down the whole system, because currently the
> assertion causes a reboot?

In later releases I notice that this "hard" assertion has been converted to a
soft assertion - i.e. it only crashes in a debug build. Otherwise it just
returns -1 to indicate an error. So the equivalent of changing the code to look
like this in a production build:

            /*
             * bad if this assert fails, only part of the handshake message
             * got sent.  but why would this happen?
             */
            if (len != (unsigned int)ret)
                return -1;


> It's happening on an end customer site so building a
> debug load isn't particularly viable, but if that's the only option tell me how.
>
> Is this a known problem that is only fixed as a non-security fix in a later
> release?

It's not a know problem as far as I can remember. But on the other hand there
have been a lot of DTLS bug fixes that have gone in between 1.0.2 and 1.1.1.

Note that 1.0.2 is approaching EOL (at the end of this year), so is only
receiving security fixes. Since any fix we would apply doesn't sound like a
security fix it wouldn't get backported to 1.0.2.

> We are current for the release, I believe, with v1.0.1e 58.el6_10.

I'm slightly confused here you're talking about 1.0.1e but above you said 1.02
above.

Matt
Reply | Threaded
Open this post in threaded view
|

Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);

Ian Sinclair
Thanks for the detailed investigation. I don't know if we have a BIO callback or modified any BIO code. I'll have to dig into our version of Asterisk to see if I can tell.

The version confusion is mine. We really are running 1.0.1e 58 from Centos6. I got crossed up because when I checked the documentation it only goes back to 1.0.2, so that got into my head.

If we have to rebuild the code, even just to debug the external cause (I expect there is something funny being fed to openSSL) is it going to be compatible for me to use the 1.0.2 source code or are there a bunch of related packages that I would need to upgrade in parallel? 

Thanks.

From: Matt Caswell <[hidden email]>
Sent: September 25, 2019 9:49 AM
To: Ian Sinclair <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);
 


On 24/09/2019 14:11, Ian Sinclair wrote:
> I'm working with Asterisk PBX code which uses openSSL v1.0.2 (from Centos6). On
> one site we're getting a crash from dtls1_do_write and as far as I can tell it's
> from the assertion coded:
>
>   /* bad if this assert fails, only part of the handshake
>    * message got sent.  but why would this happen? */
>   OPENSSL_assert(len == (unsigned int)ret);
>
> My question is the same as some previous author - why would this happen?

I've taken a look at the code in this area to try and figure out a reason.

I note that the code before this looks like this:

        ret = dtls1_write_bytes(s, type, &s->init_buf->data[s->init_off],
                                len);
        if (ret < 0) {
                ....omitted...
        } else {

            /*
             * bad if this assert fails, only part of the handshake message
             * got sent.  but why would this happen?
             */
            OPENSSL_assert(len == (unsigned int)ret);

Most significantly I notice that this doesn't seem to handle the ret == 0 case -
which is normally used to indicate an error. So looking at dtls1_write_bytes we
should check whether there are any cases where it could return 0. Well that
function doesn't actually do very much except call do_dtls1_write() and return
the result of that.

There are a few possible points where that function could return a value:

1) Returns the result of ssl3_write_pending()
2) Returns the result of ssl_dispatch_alert(). Following the logic of that we
find that it ends up returning the result of a recursive do_dtls1_write() call -
so we can ignore this one.
3) If len == 0 and !create_empty_fragment then we return 0. If that is the case
then len == 0 and ret == 0 so the assertion would not be hit, so we can ignore this.
4) if create_empty_fragment is true we return wr->length. But there are no calls
of do_dtls1_write where create_empty_fragment is true anywhere in the code. This
is actually dead code (we should remove this). We can ignore this.
5) On error we return -1, which we can ignore.

So it seems do_dtls1_write() will only return 0 if ssl3_write_pending() does.

Looking at that function there are 3 possible points where it can return:

1) On error it returns -1, so we can ignore that.
2) In some circumstances it can return s->wpend_ret. wpend_return is set in
do_dtls1_write to the value of len, so the only way it could be 0 is if len is,
which would not cause the assertion to fail so we can ignore this.
3) In other circumstances it returns the result of a BIO_write() call

So the only way I can see for a 0 return to come back is if BIO_write() returns
that. Looking at the implementation of BIO_write() it will return 0 if:

- the BIO is NULL. But there is a check for this in ssl3_write_pending() so we
know this will never be the case.
- If a BIO callback has been set (via BIO_set_callback()) then it will return
the result from that
- Otherwise the return value is the result of a write call from the underlying
BIO implementation.

In DTLS the underlying BIO implementation is usually a BIO_s_datagram(). That
write function will return the value from a system call to send or sendto. The
man pages indicate this returns -1 on error, or otherwise the number of bytes
sent - so I don't see how we could ever get 0 here.

So, the summary of all of that is, I think we could get a situation where the
assertion is triggered if:

1) You are using BIO_set_callback() and the callback you set returns 0
2) You are using some source/sink BIO other than BIO_s_datagram() and it
sometimes returns 0 from its write call.
3) You are using a filter BIO and that returns 0 during a write call

Other possibilities that spring to mind are if you are using a custom BIO that
doesn't behave well and returns a positive value from its write call that is not
equal to the number of bytes it was asked to write.

>
> Is there any meaningful way I can debug this? Some flag I can set that will show
> the DTLS packets to try to find a cause? Some way to get rid of the assertion so
> that the failure doesn't take down the whole system, because currently the
> assertion causes a reboot?

In later releases I notice that this "hard" assertion has been converted to a
soft assertion - i.e. it only crashes in a debug build. Otherwise it just
returns -1 to indicate an error. So the equivalent of changing the code to look
like this in a production build:

            /*
             * bad if this assert fails, only part of the handshake message
             * got sent.  but why would this happen?
             */
            if (len != (unsigned int)ret)
                return -1;


> It's happening on an end customer site so building a
> debug load isn't particularly viable, but if that's the only option tell me how.
>
> Is this a known problem that is only fixed as a non-security fix in a later
> release?

It's not a know problem as far as I can remember. But on the other hand there
have been a lot of DTLS bug fixes that have gone in between 1.0.2 and 1.1.1.

Note that 1.0.2 is approaching EOL (at the end of this year), so is only
receiving security fixes. Since any fix we would apply doesn't sound like a
security fix it wouldn't get backported to 1.0.2.

> We are current for the release, I believe, with v1.0.1e 58.el6_10.

I'm slightly confused here you're talking about 1.0.1e but above you said 1.02
above.

Matt
Reply | Threaded
Open this post in threaded view
|

Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len == (unsigned int)ret);

Matt Caswell-2


On 25/09/2019 18:41, Ian Sinclair wrote:

> Thanks for the detailed investigation. I don't know if we have a BIO callback or
> modified any BIO code. I'll have to dig into our version of Asterisk to see if I
> can tell.
>
> The version confusion is mine. We really are running 1.0.1e 58 from Centos6. I
> got crossed up because when I checked the documentation it only goes back to
> 1.0.2, so that got into my head.
>
> If we have to rebuild the code, even just to debug the external cause (I expect
> there is something funny being fed to openSSL) is it going to be compatible for
> me to use the 1.0.2 source code or are there a bunch of related packages that I
> would need to upgrade in parallel?

From an OpenSSL perspective 1.0.2 should be a drop in replacement for 1.0.1.
However if you are using the system OpenSSL package, then upgrading that could
have other impacts - but that's not something I can really advise on.

Matt


>
> Thanks.
> --------------------------------------------------------------------------------
> *From:* Matt Caswell <[hidden email]>
> *Sent:* September 25, 2019 9:49 AM
> *To:* Ian Sinclair <[hidden email]>; [hidden email]
> <[hidden email]>
> *Subject:* Re: Crash in OpenSSL v1.0.1 from dtls1_do_write OPENSSL_assert(len ==
> (unsigned int)ret);
>  
>
>
> On 24/09/2019 14:11, Ian Sinclair wrote:
>> I'm working with Asterisk PBX code which uses openSSL v1.0.2 (from Centos6). On
>> one site we're getting a crash from dtls1_do_write and as far as I can tell it's
>> from the assertion coded:
>>
>>   /* bad if this assert fails, only part of the handshake
>>    * message got sent.  but why would this happen? */
>>   OPENSSL_assert(len == (unsigned int)ret);
>>
>> My question is the same as some previous author - why would this happen?
>
> I've taken a look at the code in this area to try and figure out a reason.
>
> I note that the code before this looks like this:
>
>         ret = dtls1_write_bytes(s, type, &s->init_buf->data[s->init_off],
>                                 len);
>         if (ret < 0) {
>                 ....omitted...
>         } else {
>
>             /*
>              * bad if this assert fails, only part of the handshake message
>              * got sent.  but why would this happen?
>              */
>             OPENSSL_assert(len == (unsigned int)ret);
>
> Most significantly I notice that this doesn't seem to handle the ret == 0 case -
> which is normally used to indicate an error. So looking at dtls1_write_bytes we
> should check whether there are any cases where it could return 0. Well that
> function doesn't actually do very much except call do_dtls1_write() and return
> the result of that.
>
> There are a few possible points where that function could return a value:
>
> 1) Returns the result of ssl3_write_pending()
> 2) Returns the result of ssl_dispatch_alert(). Following the logic of that we
> find that it ends up returning the result of a recursive do_dtls1_write() call -
> so we can ignore this one.
> 3) If len == 0 and !create_empty_fragment then we return 0. If that is the case
> then len == 0 and ret == 0 so the assertion would not be hit, so we can ignore this.
> 4) if create_empty_fragment is true we return wr->length. But there are no calls
> of do_dtls1_write where create_empty_fragment is true anywhere in the code. This
> is actually dead code (we should remove this). We can ignore this.
> 5) On error we return -1, which we can ignore.
>
> So it seems do_dtls1_write() will only return 0 if ssl3_write_pending() does.
>
> Looking at that function there are 3 possible points where it can return:
>
> 1) On error it returns -1, so we can ignore that.
> 2) In some circumstances it can return s->wpend_ret. wpend_return is set in
> do_dtls1_write to the value of len, so the only way it could be 0 is if len is,
> which would not cause the assertion to fail so we can ignore this.
> 3) In other circumstances it returns the result of a BIO_write() call
>
> So the only way I can see for a 0 return to come back is if BIO_write() returns
> that. Looking at the implementation of BIO_write() it will return 0 if:
>
> - the BIO is NULL. But there is a check for this in ssl3_write_pending() so we
> know this will never be the case.
> - If a BIO callback has been set (via BIO_set_callback()) then it will return
> the result from that
> - Otherwise the return value is the result of a write call from the underlying
> BIO implementation.
>
> In DTLS the underlying BIO implementation is usually a BIO_s_datagram(). That
> write function will return the value from a system call to send or sendto. The
> man pages indicate this returns -1 on error, or otherwise the number of bytes
> sent - so I don't see how we could ever get 0 here.
>
> So, the summary of all of that is, I think we could get a situation where the
> assertion is triggered if:
>
> 1) You are using BIO_set_callback() and the callback you set returns 0
> 2) You are using some source/sink BIO other than BIO_s_datagram() and it
> sometimes returns 0 from its write call.
> 3) You are using a filter BIO and that returns 0 during a write call
>
> Other possibilities that spring to mind are if you are using a custom BIO that
> doesn't behave well and returns a positive value from its write call that is not
> equal to the number of bytes it was asked to write.
>
>>
>> Is there any meaningful way I can debug this? Some flag I can set that will show
>> the DTLS packets to try to find a cause? Some way to get rid of the assertion so
>> that the failure doesn't take down the whole system, because currently the
>> assertion causes a reboot?
>
> In later releases I notice that this "hard" assertion has been converted to a
> soft assertion - i.e. it only crashes in a debug build. Otherwise it just
> returns -1 to indicate an error. So the equivalent of changing the code to look
> like this in a production build:
>
>             /*
>              * bad if this assert fails, only part of the handshake message
>              * got sent.  but why would this happen?
>              */
>             if (len != (unsigned int)ret)
>                 return -1;
>
>
>> It's happening on an end customer site so building a
>> debug load isn't particularly viable, but if that's the only option tell me how.
>>
>> Is this a known problem that is only fixed as a non-security fix in a later
>> release?
>
> It's not a know problem as far as I can remember. But on the other hand there
> have been a lot of DTLS bug fixes that have gone in between 1.0.2 and 1.1.1.
>
> Note that 1.0.2 is approaching EOL (at the end of this year), so is only
> receiving security fixes. Since any fix we would apply doesn't sound like a
> security fix it wouldn't get backported to 1.0.2.
>
>> We are current for the release, I believe, with v1.0.1e 58.el6_10.
>
> I'm slightly confused here you're talking about 1.0.1e but above you said 1.02
> above.
>
> Matt