CMAC timings

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

CMAC timings

Hal Murray
In general, things have slowed down.

The new EVP_MAC code takes 3 times as long as the old CMAC code on 1.1.1.
The new PKEY code takes twice as long as the old CMAC code on 1.1.1

The one ray of hope is that the API for EVP_MAC has split the part of the
setup that uses the key out of the init routine.  If we can hang on to a ctx
for each key, we can cut the time in half - that's new ECP_MAC vs CMAC on 1.1.1

So how much memory does a ctx take?  How can I find out?

Even if I can't allocate a ctx per key, I can at least keep one around from
recv to send.   That makes the slowdown 1.7 instead of 3.

----------

Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz

# KL=key length, PL=packet length, CL=CMAC length

# OpenSSL 1.1.1g FIPS  21 Apr 2020

# CMAC        KL PL CL  ns/op sec/run
     AES-128  16 48 16    366   0.366  475ac1c053379e7dbd4ce80b87d2178e
     AES-192  24 48 16    381   0.381  c906422bfe0963de6df50e022b4aa7d4
     AES-256  32 48 16    407   0.407  991f4017858de97515260dd9ae440b06

# PKEY        KL PL CL  ns/op sec/run
     AES-128  16 48 16    436   0.436  475ac1c053379e7dbd4ce80b87d2178e
     AES-192  24 48 16    448   0.448  c906422bfe0963de6df50e022b4aa7d4
     AES-256  32 48 16    461   0.461  991f4017858de97515260dd9ae440b06

---------

# OpenSSL 3.0.0-alpha3 4 Jun 2020

# CMAC        KL PL CL  ns/op sec/run
     AES-128  16 48 16    973   0.973  475ac1c053379e7dbd4ce80b87d2178e
     AES-192  24 48 16    987   0.987  c906422bfe0963de6df50e022b4aa7d4
     AES-256  32 48 16   1011   1.011  991f4017858de97515260dd9ae440b06

# PKEY        KL PL CL  ns/op sec/run
     AES-128  16 48 16    817   0.817  475ac1c053379e7dbd4ce80b87d2178e
     AES-192  24 48 16    824   0.824  c906422bfe0963de6df50e022b4aa7d4
     AES-256  32 48 16    842   0.842  991f4017858de97515260dd9ae440b06

# EVP_MAC     KL PL CL  ns/op sec/run
     AES-128  16 48 16   1136   1.136  475ac1c053379e7dbd4ce80b87d2178e
     AES-192  24 48 16   1153   1.153  c906422bfe0963de6df50e022b4aa7d4
     AES-256  32 48 16   1181   1.181  991f4017858de97515260dd9ae440b06

Preload cipher and key.
     AES-128  16 48 16    170   0.170  475ac1c053379e7dbd4ce80b87d2178e
     AES-192  24 48 16    182   0.182  c906422bfe0963de6df50e022b4aa7d4
     AES-256  32 48 16    196   0.196  991f4017858de97515260dd9ae440b06



--
These are my opinions.  I hate spam.



Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Richard Levitte - VMS Whacker-2
Quick forst answer, EVP_MAC_CTX is a typedef of struct evp_mac_ctx_st,
which you find in crypto/evp/evp_local.h.  It's quite small (smaller
than EVP_MD_CTX and EVP_PKEY_CTX):

    struct evp_mac_ctx_st {
        EVP_MAC *meth;               /* Method structure */
        void *data;                  /* Individual method data */
    } /* EVP_MAC_CTX */;

The slowdown isn't entirely surprising...  in pre-3.0, all back-ends
(including engines, with the help of API calls) could reach right into
the EVP_PKEY_CTX that was used by libcrypto code, i.e. central code
and back-end code shared intimate knowledge.  With providers, the
boundary between central code and provider code is much stronger,
which means a certain amount of messaging between the two.

What does surprise me, though, is that direct EVP_MAC calls would be
slower than going through the PKEY bridge.  I would very much like to
see your code to see what's going on.

Regarding preloaded cipher and key, that tells me that the actual
computation of a MAC is quick enough, that most of the slowdown is
parameter overhead.  That was expected.

Cheers,
Richard

On Sun, 14 Jun 2020 17:30:50 +0200,
Hal Murray wrote:

>
> In general, things have slowed down.
>
> The new EVP_MAC code takes 3 times as long as the old CMAC code on 1.1.1.
> The new PKEY code takes twice as long as the old CMAC code on 1.1.1
>
> The one ray of hope is that the API for EVP_MAC has split the part of the
> setup that uses the key out of the init routine.  If we can hang on to a ctx
> for each key, we can cut the time in half - that's new ECP_MAC vs CMAC on 1.1.1
>
> So how much memory does a ctx take?  How can I find out?
>
> Even if I can't allocate a ctx per key, I can at least keep one around from
> recv to send.   That makes the slowdown 1.7 instead of 3.
>
> ----------
>
> Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
>
> # KL=key length, PL=packet length, CL=CMAC length
>
> # OpenSSL 1.1.1g FIPS  21 Apr 2020
>
> # CMAC        KL PL CL  ns/op sec/run
>      AES-128  16 48 16    366   0.366  475ac1c053379e7dbd4ce80b87d2178e
>      AES-192  24 48 16    381   0.381  c906422bfe0963de6df50e022b4aa7d4
>      AES-256  32 48 16    407   0.407  991f4017858de97515260dd9ae440b06
>
> # PKEY        KL PL CL  ns/op sec/run
>      AES-128  16 48 16    436   0.436  475ac1c053379e7dbd4ce80b87d2178e
>      AES-192  24 48 16    448   0.448  c906422bfe0963de6df50e022b4aa7d4
>      AES-256  32 48 16    461   0.461  991f4017858de97515260dd9ae440b06
>
> ---------
>
> # OpenSSL 3.0.0-alpha3 4 Jun 2020
>
> # CMAC        KL PL CL  ns/op sec/run
>      AES-128  16 48 16    973   0.973  475ac1c053379e7dbd4ce80b87d2178e
>      AES-192  24 48 16    987   0.987  c906422bfe0963de6df50e022b4aa7d4
>      AES-256  32 48 16   1011   1.011  991f4017858de97515260dd9ae440b06
>
> # PKEY        KL PL CL  ns/op sec/run
>      AES-128  16 48 16    817   0.817  475ac1c053379e7dbd4ce80b87d2178e
>      AES-192  24 48 16    824   0.824  c906422bfe0963de6df50e022b4aa7d4
>      AES-256  32 48 16    842   0.842  991f4017858de97515260dd9ae440b06
>
> # EVP_MAC     KL PL CL  ns/op sec/run
>      AES-128  16 48 16   1136   1.136  475ac1c053379e7dbd4ce80b87d2178e
>      AES-192  24 48 16   1153   1.153  c906422bfe0963de6df50e022b4aa7d4
>      AES-256  32 48 16   1181   1.181  991f4017858de97515260dd9ae440b06
>
> Preload cipher and key.
>      AES-128  16 48 16    170   0.170  475ac1c053379e7dbd4ce80b87d2178e
>      AES-192  24 48 16    182   0.182  c906422bfe0963de6df50e022b4aa7d4
>      AES-256  32 48 16    196   0.196  991f4017858de97515260dd9ae440b06
>
>
>
> --
> These are my opinions.  I hate spam.
>
>
>
--
Richard Levitte         [hidden email]
OpenSSL Project         http://www.openssl.org/~levitte/
Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Hal Murray
In reply to this post by Hal Murray
[hidden email] said:
> What does surprise me, though, is that direct EVP_MAC calls would be slower
> than going through the PKEY bridge.  I would very much like to see your code
> to see what's going on.

Over on an ntpsec list, Kurt Roeckx reported that he was still waiting...

Richard's message said "I", so I sent him a copy off list.  Correcting that...


--
These are my opinions.  I hate spam.


cmac-timing.c (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Hal Murray
In reply to this post by Hal Murray
Thanks.

[hidden email] said:
> Quick forst answer, EVP_MAC_CTX is a typedef of struct evp_mac_ctx_st, which
> you find in crypto/evp/evp_local.h.  It's quite small (smaller than
> EVP_MD_CTX and EVP_PKEY_CTX):

How much space does the crypto stuff take?  The idea is to do all of the setup
calculations ahead of time.  I expect there are some tables in there.

> Regarding preloaded cipher and key, that tells me that the actual computation
> of a MAC is quick enough, that most of the slowdown is parameter overhead.
> That was expected.

There are 2 sorts of overhead.  One is turning the key into a table, or
something like that.  The other is collecting the parameters and turning them
into something that can be processed.  Using strings as keys in the params
tables seems like an invitation for not-fast.  It's probably not significant
if it is being used from deep inside SSL processing but the total processing
time for an NTP packet is ballpark of 10 microseconds so difference on that
scale become interesting.


--
These are my opinions.  I hate spam.



Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Kurt Roeckx
In reply to this post by Hal Murray
On Wed, Jun 17, 2020 at 03:50:05AM -0700, Hal Murray wrote:
> [hidden email] said:
> > What does surprise me, though, is that direct EVP_MAC calls would be slower
> > than going through the PKEY bridge.  I would very much like to see your code
> > to see what's going on.
>
> Over on an ntpsec list, Kurt Roeckx reported that he was still waiting...
>
> Richard's message said "I", so I sent him a copy off list.  Correcting that...

So I took a look at at the EVP_PKEY case, and it seems we spend most
of our time doing:
- alloc/free. 12 alloc and 16 free calls per signature
- OPENSSL_cleanse: 10 calls per signature
- EVP_CIPHER_CTX_reset: 6 calls per signature

Most of the time is spent in those functions.

The manpage documents:
The call to EVP_DigestSignFinal() internally finalizes a
copy of the digest context. This means that calls to
EVP_DigestSignUpdate() and EVP_DigestSignFinal() can be called
later to digest and sign additional data.

And:
       EVP_MD_CTX_FLAG_FINALISE
           Some functions such as EVP_DigestSign only finalise
           copies of internal contexts so additional data can be
           included after the finalisation call. This is
           inefficient if this functionality is not required, and
           can be disabled with this flag.

(A reference to the EVP_MD_CTX_set_flags manpage would have been
useful.)

So after the EVP_MD_CTX_new(), I added an:
        EVP_MD_CTX_set_flags(ctx, EVP_MD_CTX_FLAG_FINALISE);

For me it changed things with 3.0 from:
     AES-128  16 48 16   1696   1.696  475ac1c053379e7dbd4ce80b87d2178e
to:
     AES-128  16 48 16    754   0.754  475ac1c053379e7dbd4ce80b87d2178e

While 1.1 gives me this without the change:
     AES-128  16 48 16    739   0.739  475ac1c053379e7dbd4ce80b87d2178e
and with the change:
     AES-128  16 48 16    291   0.291  475ac1c053379e7dbd4ce80b87d2178e

I question the default behaviour, I think most people don't need
that support.


Kurt

Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Hal Murray
In reply to this post by Hal Murray

Thanks.

> The manpage documents: The call to EVP_DigestSignFinal() internally finalizes
> a copy of the digest context. This means that calls to EVP_DigestSignUpdate()
> and EVP_DigestSignFinal() can be called later to digest and sign additional
> data.

I saw that, but couldn't figure out what it meant.  "additional" suggests that
it would keep going and sign the current data plus new data.  That didn't seem
very relevant for my use case.  "another" might be a better word.

Is the idea that it makes the internal state so it is the same as after
EVP_DigestSignInit()?  If so, that would allow significant CPU savings in the
request/response case where we sign twice with the same key.


> - alloc/free. 12 alloc and 16 free calls per signature

I'm curious.  12 seems like a huge number.  I'd expect 1.  What's going on?


> And:
>        EVP_MD_CTX_FLAG_FINALISE
>            Some functions such as EVP_DigestSign only finalise
>            copies of internal contexts so additional data can be
>            included after the finalisation call. This is
>            inefficient if this functionality is not required, and
>            can be disabled with this flag.

Now that you have explained it, I can try to read between the lines.

That "only" seems misleading.

"inefficient if this functionality is not required" doesn't make much sense
(too me) in the context of finalize.  The inefficiency is that you will have
to repeat the setup if you want to do another run with the same key.


--
These are my opinions.  I hate spam.



Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Tomas Mraz-2
In reply to this post by Kurt Roeckx
On Wed, 2020-06-17 at 23:02 +0200, Kurt Roeckx wrote:

> On Wed, Jun 17, 2020 at 03:50:05AM -0700, Hal Murray wrote:
> > [hidden email] said:
> > > What does surprise me, though, is that direct EVP_MAC calls would
> > > be slower
> > > than going through the PKEY bridge.  I would very much like to
> > > see your code
> > > to see what's going on.
> >
> > Over on an ntpsec list, Kurt Roeckx reported that he was still
> > waiting...
> >
> > Richard's message said "I", so I sent him a copy off
> > list.  Correcting that...
>
> So I took a look at at the EVP_PKEY case, and it seems we spend most
> of our time doing:
> - alloc/free. 12 alloc and 16 free calls per signature
> - OPENSSL_cleanse: 10 calls per signature
> - EVP_CIPHER_CTX_reset: 6 calls per signature
>
> Most of the time is spent in those functions.
>
> The manpage documents:
> The call to EVP_DigestSignFinal() internally finalizes a
> copy of the digest context. This means that calls to
> EVP_DigestSignUpdate() and EVP_DigestSignFinal() can be called
> later to digest and sign additional data.
>
> And:
>        EVP_MD_CTX_FLAG_FINALISE
>            Some functions such as EVP_DigestSign only finalise
>            copies of internal contexts so additional data can be
>            included after the finalisation call. This is
>            inefficient if this functionality is not required, and
>            can be disabled with this flag.
>
> (A reference to the EVP_MD_CTX_set_flags manpage would have been
> useful.)
>
> So after the EVP_MD_CTX_new(), I added an:
>         EVP_MD_CTX_set_flags(ctx, EVP_MD_CTX_FLAG_FINALISE);
>
> For me it changed things with 3.0 from:
>      AES-128  16 48
> 16   1696   1.696  475ac1c053379e7dbd4ce80b87d2178e
> to:
>      AES-128  16 48
> 16    754   0.754  475ac1c053379e7dbd4ce80b87d2178e
>
> While 1.1 gives me this without the change:
>      AES-128  16 48
> 16    739   0.739  475ac1c053379e7dbd4ce80b87d2178e
> and with the change:
>      AES-128  16 48
> 16    291   0.291  475ac1c053379e7dbd4ce80b87d2178e
>
> I question the default behaviour, I think most people don't need
> that support.

Unfortunately that would be an API break that could be very hard to
discover, so I do not think we can change this even in 3.0.

--
Tomáš Mráz
No matter how far down the wrong road you've gone, turn back.
                                              Turkish proverb
[You'll know whether the road is wrong if you carefully listen to your
conscience.]


Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Blumenthal, Uri - 0553 - MITLL
I think that the default behavior should change for 3.0, and the API change described in the Release Notes. I find that alternative less impacting that this silent sudden performance deterioration.


On 6/18/20, 04:42, "openssl-users on behalf of Tomas Mraz" <[hidden email] on behalf of [hidden email]> wrote:

    On Wed, 2020-06-17 at 23:02 +0200, Kurt Roeckx wrote:
    > On Wed, Jun 17, 2020 at 03:50:05AM -0700, Hal Murray wrote:
    > > [hidden email] said:
    > > > What does surprise me, though, is that direct EVP_MAC calls would
    > > > be slower
    > > > than going through the PKEY bridge.  I would very much like to
    > > > see your code
    > > > to see what's going on.
    > >
    > > Over on an ntpsec list, Kurt Roeckx reported that he was still
    > > waiting...
    > >
    > > Richard's message said "I", so I sent him a copy off
    > > list.  Correcting that...
    >
    > So I took a look at at the EVP_PKEY case, and it seems we spend most
    > of our time doing:
    > - alloc/free. 12 alloc and 16 free calls per signature
    > - OPENSSL_cleanse: 10 calls per signature
    > - EVP_CIPHER_CTX_reset: 6 calls per signature
    >
    > Most of the time is spent in those functions.
    >
    > The manpage documents:
    > The call to EVP_DigestSignFinal() internally finalizes a
    > copy of the digest context. This means that calls to
    > EVP_DigestSignUpdate() and EVP_DigestSignFinal() can be called
    > later to digest and sign additional data.
    >
    > And:
    >        EVP_MD_CTX_FLAG_FINALISE
    >            Some functions such as EVP_DigestSign only finalise
    >            copies of internal contexts so additional data can be
    >            included after the finalisation call. This is
    >            inefficient if this functionality is not required, and
    >            can be disabled with this flag.
    >
    > (A reference to the EVP_MD_CTX_set_flags manpage would have been
    > useful.)
    >
    > So after the EVP_MD_CTX_new(), I added an:
    >         EVP_MD_CTX_set_flags(ctx, EVP_MD_CTX_FLAG_FINALISE);
    >
    > For me it changed things with 3.0 from:
    >      AES-128  16 48
    > 16   1696   1.696  475ac1c053379e7dbd4ce80b87d2178e
    > to:
    >      AES-128  16 48
    > 16    754   0.754  475ac1c053379e7dbd4ce80b87d2178e
    >
    > While 1.1 gives me this without the change:
    >      AES-128  16 48
    > 16    739   0.739  475ac1c053379e7dbd4ce80b87d2178e
    > and with the change:
    >      AES-128  16 48
    > 16    291   0.291  475ac1c053379e7dbd4ce80b87d2178e
    >
    > I question the default behaviour, I think most people don't need
    > that support.

    Unfortunately that would be an API break that could be very hard to
    discover, so I do not think we can change this even in 3.0.

    --
    Tomáš Mráz
    No matter how far down the wrong road you've gone, turn back.
                                                  Turkish proverb
    [You'll know whether the road is wrong if you carefully listen to your
    conscience.]



smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Kurt Roeckx
In reply to this post by Tomas Mraz-2
On Thu, Jun 18, 2020 at 10:41:40AM +0200, Tomas Mraz wrote:
> > I question the default behaviour, I think most people don't need
> > that support.
>
> Unfortunately that would be an API break that could be very hard to
> discover, so I do not think we can change this even in 3.0.

But I think the old CMAC API didn't support that, and so we can
change the internal calls to use the flag now (if needed). The
EVP_MAC API probably supports this too, and we can consider
changing the default there.


Kurt

Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Blumenthal, Uri - 0553 - MITLL
On 6/18/20, 12:46, "openssl-users on behalf of Kurt Roeckx" <[hidden email] on behalf of [hidden email]> wrote:

    On Thu, Jun 18, 2020 at 10:41:40AM +0200, Tomas Mraz wrote:
    >> > I question the default behaviour, I think most people don't need
    >> > that support.
    >>
    >> Unfortunately that would be an API break that could be very hard to
    >> discover, so I do not think we can change this even in 3.0.
    >
    > But I think the old CMAC API didn't support that, and so we can
    > change the internal calls to use the flag now (if needed). The
    > EVP_MAC API probably supports this too, and we can consider
    > changing the default there.

Yes please!


smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Kurt Roeckx
In reply to this post by Blumenthal, Uri - 0553 - MITLL
On Thu, Jun 18, 2020 at 02:12:56PM +0000, Blumenthal, Uri - 0553 - MITLL wrote:
> I think that the default behavior should change for 3.0, and the API change described in the Release Notes. I find that alternative less impacting that this silent sudden performance deterioration.

Note that I was just looking at why the EVP PKEY API was slow, and
the first thing I found was the EVP_MD_CTX_FLAG_FINALISE's impact.
This also has a big impact in the 1.1.1 version:
CMAC API:
     AES-128  16 48 16    410   0.410  475ac1c053379e7dbd4ce80b87d2178e
EVP_PKEY:
     AES-128  16 48 16    739   0.739  475ac1c053379e7dbd4ce80b87d2178e
EVP_PKEY adding EVP_MD_CTX_FLAG_FINALISE:
     AES-128  16 48 16    291   0.291  475ac1c053379e7dbd4ce80b87d2178e

The same with AESNI disabled:
CMAC API:
     AES-128  16 48 16    584   0.584  475ac1c053379e7dbd4ce80b87d2178e
EVP_PKEY:
     AES-128  16 48 16    823   0.823  475ac1c053379e7dbd4ce80b87d2178e
EVP_PKEY adding EVP_MD_CTX_FLAG_FINALISE:
     AES-128  16 48 16    387   0.387  475ac1c053379e7dbd4ce80b87d2178e

Now that a large fraction of the cost has been found, I can look
again to see where the biggest cost in 3.0 comes from now and if we
can do something about it.

I think changing the default is going to break applications, which
is what we want to avoid. I think we should document that this can
overhead can be large if you do small packets and that the
behavioru can be changed with that option.


Kurt

Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Kurt Roeckx
On Thu, Jun 18, 2020 at 07:24:39PM +0200, Kurt Roeckx wrote:
>
> Now that a large fraction of the cost has been found, I can look
> again to see where the biggest cost in 3.0 comes from now and if we
> can do something about it.

So a code path that I've noticed before when looking at the
profile is:
    /* TODO(3.0): Remove this eventually when no more legacy */
    if (ctx->op.sig.sigprovctx == NULL)
        return EVP_PKEY_CTX_ctrl(ctx, -1, EVP_PKEY_OP_TYPE_SIG,
                                 EVP_PKEY_CTRL_MD, 0, (void *)(md));

I think that is now actually causing most of the CPU usage.

This currently ends up doing an EVP_MAC_dup_ctx(), and I'm
currently not sure why, and what the effect is going to be
when sigprovctx != NULL. I think it might be better to wait until
someone fixes that before I look at that again.


Kurt

Reply | Threaded
Open this post in threaded view
|

Re: CMAC timings

Matt Caswell-2


On 18/06/2020 20:18, Kurt Roeckx wrote:

> On Thu, Jun 18, 2020 at 07:24:39PM +0200, Kurt Roeckx wrote:
>>
>> Now that a large fraction of the cost has been found, I can look
>> again to see where the biggest cost in 3.0 comes from now and if we
>> can do something about it.
>
> So a code path that I've noticed before when looking at the
> profile is:
>     /* TODO(3.0): Remove this eventually when no more legacy */
>     if (ctx->op.sig.sigprovctx == NULL)
>         return EVP_PKEY_CTX_ctrl(ctx, -1, EVP_PKEY_OP_TYPE_SIG,
>                                  EVP_PKEY_CTRL_MD, 0, (void *)(md));
>
> I think that is now actually causing most of the CPU usage.
>
> This currently ends up doing an EVP_MAC_dup_ctx(), and I'm
> currently not sure why, and what the effect is going to be
> when sigprovctx != NULL. I think it might be better to wait until
> someone fixes that before I look at that again.

I looked into what was going on here.

The EVP_PKEY -> EVP_MAC bridge is implemented as a *legacy*
EVP_PKEY_METHOD, i.e. the conversion from EVP_PKEY -> EVP_MAC happens in
libcrypto *before* it hits any provider. So in the above code
"ctx->op.sig.signprovctx" will *always* be NULL because we are using the
bridge.

The answer to why we have the EVP_MAC_dup_ctx() lies in the
implementation of EVP_PKEY_new_CMAC_key(). In EVP_MAC terms the cipher
and key to be used are parameters set on an EVP_MAC_CTX - there is no
long term "key" object to store these in. By contrast an EVP_PKEY
considers these part of the long term "key" that can be reused in
multiple EVP_PKEY_CTX operations. To resolve this difference in approach
the EVP_PKEY -> MAC bridge creates an EVP_MAC_CTX during construction of
the EVP_PKEY and sets the cipher and key parameters on it. Then, every
time we do an EVP_DigestSignInit() call we create a new EVP_PKEY_CTX and
"dup" the EVP_MAC_CTX from the EVP_PKEY. In this way we can reuse the
same EVP_PKEY in many EVP_DigestSign*() operations.

That all seems to work but has the impact that if you only ever create
the EVP_PKEY, use it once and then throw it away then we have to create
the underlying EVP_MAC_CTX and then dup it (never actually using the
original EVP_MAC_CTX for anything other than a template for the
subsequent dup).

I find it slightly surprising that EVP_MAC_dup_ctx() is quite so
expensive. It mainly seems to end up doing a CMAC_CTX_copy() so I guess
this is where the time is going?

Matt