There’s a crack in the foundation of Internet of Things (IoT) security, one that affects 35 billion devices worldwide.
Basically, every IoT device with a hardware random number generator
(RNG) contains a serious vulnerability whereby it fails to properly
generate random numbers, which undermines security for any upstream use.
See Dan and Allan discuss their research at DEF CON 29:
FIGURE 1: Alice and Bob trying to have a private conversation using RNG
In order for Alice and Bob to communicate secretly, away from the
prying eyes of Eve, they need to produce a shared secret by using an
RNG. The fact that Eve does not know this number is the only thing
keeping her from compromising the secrecy of the communications. This
same story plays out for other aspects of security: whether it’s SSH
keys for authentication or session tokens for authorization, random
numbers are one of the bedrock foundations of computer security.
But it turns out that these “randomly” chosen numbers aren’t always
as random as you’d like when it comes to IoT devices. In fact, in many
cases, devices are choosing encryption keys of
or worse. This can lead to a catastrophic collapse of security for any upstream use.0
As of 2021, most new IoT systems-on-a-chip (SoCs) have a dedicated
hardware RNG peripheral that’s designed to solve exactly this problem.
But unfortunately, it’s not that simple. How you use the peripheral is
critically important, and the current state of the art in IoT can only
be aptly described as “doing it wrong.”
One of the more glaring pitfalls happens when developers fail to
check error code responses, which often results in numbers that are
decidedly less random than required for a security-relevant use.
When an IoT device requires a random
number, it makes a call to the dedicated hardware RNG either through the
device’s SDK or increasingly through an IoT operating system. What the
function call is named varies, of course, but it takes place in the
hardware abstraction layer (HAL). This is an API created by the device
manufacturer and is designed so you can more easily interface with the
hardware through C code and not need to mess around with setting and
checking specific registers unique to the device. The HAL function looks
something like this:
u8 hal_get_random_number(u32 *out_number);
out_number
. This is where the function will put the random number; it’s a pointer to an unsigned 32-bit integer.So, the first question you might be asking is, “How many people out
there in the wild actually check this error code?” Unfortunately, the
answer is almost nobody.
For instance, just look at GitHub results for use of the MediaTek 7697 SoC HAL function:
https://github.com/search?p=1&q=hal_trng_get_generated_random_number&type=Code
Or even FreeRTOS’s (a popular IoT operating system) abstraction layer:
https://github.com/search?q=HAL_RNG_GenerateRandomNumber&type=code
Notice that the return code is pervasively not checked – though this
isn’t unique to these two examples. This is just how the IoT industry
does it. You’ll find this behavior across basically every SDK and IoT
OS.
Okay, so devices aren’t checking the error code of the RNG HAL
function. But how bad is it really? It depends on the specific device,
but potentially bad. Very bad. Let’s take a look.
The HAL function to the RNG peripheral can fail for a variety of
reasons, but by far the most common (and exploitable) is that the device
has run out of entropy. Hardware RNG peripherals pull entropy out of
the universe through a variety of means (such as analog sensors or EMF
readings) but don’t have it in infinite supply. They’re only capable of
producing so many random bits per second. If you try calling the RNG HAL
function when it doesn’t have any random numbers to give you, it will
fail and return an error code. Thus, if the device tries to get too many
random numbers too quickly, the calls will begin to fail.
But that’s the thing about random numbers; it’s not enough to just
have one. When a device needs to generate a new 2048-bit private key, as
a conservative example, it will call the RNG HAL function over and over
in a loop. This starts to seriously tax the hardware’s ability to keep
up, and in practice, they often can’t. The first few calls may succeed,
but they will typically start to cause errors quickly.
So… what does the HAL function actually give you for a random number
when it fails? Depending on the hardware, one of the following:
FIGURE 2: That shouldn’t be there.
None of those are very good, but uninitialized memory?! How does that
happen? Well, recall that the random number is an output pointer. Then
consider the following pseudocode (which you can find many examples of
on GitHub if you care to look):
u32 random_number; hal_get_random_number(&random_number); // Sends over the network packet_send(random_number);
random_number
variable is declared and lives on the stack but is never initialized.
If the HAL function behaves such that it never overwrites the output
variable in the event of an error (which is common behavior), then the
value in the variable will contain uninitialized RAM. Which we then send
out over the network to someone else. Not great.These are not contrived or unrealistic scenarios. Devices are out there right now using crypto keys of 0 or worse.
It’s easy to look at the current state of affairs and conclude that
it’s simply user error, but this is not the case. The users have been
put in a no-win situation. You see, random numbers are incredibly
critical when you need them. Oftentimes you can’t just “handle” the
error in an elegant way and move forward.
For example, the MediaTek documentation contains the following example code for developers of the MT7697 SoC:
if(HAL_TRNG_STATUS_OK != hal_trng_get_generated_random_number(&random_number)) { //error handle }
Neither are acceptable solutions. This is why
developers ignore the error condition — the alternatives are terrible
and the ecosystem around RNG hardware has done them no favors.
Things aren’t much better even when developers have the benefit of
time on their side. Some devices, like the STM32, have sizable
documentation and even vendor-provided proof of randomness whitepapers,
but these are an exception. Few devices have even a basic description of
how the hardware RNG is supposed to work, and fewer still have any kind
of documentation about basic things like expected operating speed, safe
operating temperature ranges, and statistical evidence of randomness.
Anecdotally speaking, we attempted to follow the STM32 documentation
carefully and still managed to create code that incorrectly handled
error responses. It took multiple attempts and substantial code to block
additional calls to the RNG and spin loop properly when there were
error responses — and even then, we observed questionable results that
made us doubt our code. It’s no wonder developers are doing IoT RNG,
well, wrong, but more on that below.
So you might be wondering, “What makes this unique to the IoT? Does
this issue affect laptops and servers too?” The answer is that it’s a
unique issue to the IoT world because this sort of low-level device
management is usually handled by an operating system that’s notably
missing from typical IoT devices.
FIGURE 3: CSPRNG subsystem components
When an application needs a cryptographically secure random number on
a Linux server, it doesn’t read from the hardware RNG directly or make a
call to some HAL function and fight with error codes. No, it just reads
from
.
This is a cryptographically secure pseudo-random number generator
(CSPRNG) subsystem made available to applications as an API. There are
similar subsystems on every major operating system, too: Windows, iOS,
MacOS, Android, BSD, you name it./dev/urandom
Importantly, calls to
never fail and never block program execution. The CSPRNG subsystem can
produce an endless sequence of strong random numbers immediately. This
squarely solves the problem of HAL functions that either block program
execution or fail./dev/urandom
Another critical design feature of a CSPRNG subsystem is the entropy
pool. This is designed to take in entropy from a variety of sources,
hardware RNGs (HWRNGs) included. Due to the magic of the
operation,
all of these individually weak sources of entropy can be combined into
one strong one. The entropy pool also removes any single points of
failure among the entropy sources: in order to break the RNG, you’d need
to predict every entropy source simultaneously.xor
Designing an entire CSPRNG subsystem sounds really hard, especially
when your gadget isn’t using one of those new IoT operating systems.
Maybe it’s enough to just bite the bullet and spin loop on the RNG HAL
function. That way you’re always getting good random numbers, right?
Nobody writes source code entirely from scratch, especially in the
world of IoT devices. There’s always some reference code or example doc
that developers start from. Interfacing with hardware is tricky to get
right for any device, let alone one as finicky as a hardware RNG
peripheral.
PRNGs such as
are
wildly insecure since the numbers they produce reveal information about
the internal state of the RNG. They’re fine for non-security-relevant
contexts because they’re fast and easy to implement. But using them for
things like encryption keys leads to catastrophic collapse of the
device’s security, as all of the numbers are predictable.libc rand()
Unfortunately, many SDKs and operating systems that support hardware
RNGs use an insecure PRNG under the hood. The Contiki-ng IoT operating
system for Nordic Semiconductor’s nrf52840 SoC does precisely this by
seeding the insecure
function with the hardware RNG::libc rand()
void random_init(unsigned short seed) { (void)seed; unsigned short hwrng = 0; NRF_RNG->TASKS_START = 1; NRF_RNG->EVENTS_VALRDY = 0; while(!NRF_RNG->EVENTS_VALRDY); hwrng = (NRF_RNG->VALUE & 0xFF); NRF_RNG->EVENTS_VALRDY = 0; while(!NRF_RNG->EVENTS_VALRDY); hwrng |= ((NRF_RNG->VALUE & 0xFF) << 8); NRF_RNG->TASKS_STOP = 1; srand(hwrng); }
FIGURE 4: Seeding
with the hardware RNG entropylibc rand()
unsigned short random_rand(void) { return (unsigned short)rand(); }
FIGURE 5: Future calls to
call the insecure random_rand()
libc rand()
libc rand()
to
generate secure values. The fact that the seed is created using the
hardware is irrelevant since an attacker can just derive or enumerate it
using untwister. What’s important is that when the user calls random_rand()
, the output will come from the insecure libc rand()
call.You can also see identical vulnerable behavior in the MediaTek Arduino code:
/** * @brief This function is to get random seed. * @param[in] None. * @return None. */ static void _main_sys_random_init(void) { #if defined(HAL_TRNG_MODULE_ENABLED) uint32_t seed; hal_trng_status_t s; s = hal_trng_init(); if (s == HAL_TRNG_STATUS_OK) { s = hal_trng_get_generated_random_number(&seed); } if (s == HAL_TRNG_STATUS_OK) { srand((unsigned int)seed); }
FIGURE 6: Insecure
usage in MediaTek SDKlibc rand()
Sometimes, devices work in a very quirky way that is not at all
immediately obvious. Failing to properly account for these quirks can
lead to a catastrophic collapse of a device’s security. One example of
this is the LPC 54628.
When testing the LPC 54628, we noticed that we were getting extremely
poor-quality random numbers from the hardware RNG — so bad that we
suspected there might be something wrong with our tooling. Turns out we
were right. If you read the user manual carefully, on page 1,106 (of
1,152), you’ll notice the following instructions:
"The
quality of randomness (entropy) of the numbers generated by the Random
Number Generator relies on the initial states of internal logic.
If a 128 bit or 256 bit random number is required, it is not
recommended to concatenate several words of 32 bits to form the number. For example, if two 128 bit words are concatenated, the hardware RNG will not provide 2 times 128 bits of entropy.
…omitted for brevity…
To constitute one 128 bit number, a 32 bit random number is read, then the next 32 numbers are read but not used.
The next 32 bit number is read and used and so on. Thus 32 32-bit
random numbers are skipped between two 32-bit numbers that are used."
Clearly, it’s unacceptable for security to rely on such behavior.
Even if some developers get it right some of the time, having this kind
of an API is guaranteed to produce vulnerable devices given a large
enough audience.
Okay, so suppose that you’ve gone through the trouble of auditing your device’s code to make sure that it’s actually using the hardware RNG, and you made sure to check any error conditions and spin until the device was ready, and you laboriously read through your device’s 1,000-page user manual to
make sure you caught any quirks…. Surely you must be safe, right? Not
even close.
One of the things we tested for when analyzing entropy is the
relative distribution of bytes produced by the RNG. In a perfect world,
we should expect each byte to be equally likely, so the distribution of
bytes ought to be a flat line (Modulo any minor random blips of course).
But that’s not what we see for the MT7697:
FIGURE 7: Histogram of the frequency of each byte 0 to 255 on the MediaTek 7697 SoC
The Nordic Semiconductor nrf52840 SoC’s hardware RNG exhibited a repeating 12-bit pattern of
, occurring every 0x50 bytes:0x000
FIGURE 8: Repeating
in the nrf52840 SoC0x000
Unfortunately, we don’t have a nice diagram to show you for this one,
but here are the results from dieharder, an industry-standard
statistical randomness testing tool:
#=============================================================================#
# dieharder version 3.31.1 Copyright 2003 Robert G. Brown #
#=============================================================================#
rng_name | filename |rands/second|
file_input_raw|STM32L432randData.4gb.32768trans.fullblocking.bin| 1.72e+07 |
#=============================================================================#
test_name |ntup| tsamples |psamples| p-value |Assessment
#=============================================================================#
rgb_minimum_distance| 0| 10000| 1000|0.00000000| FAILED
FIGURE 9: Sample of dieharder statistical analysis results for the STM32-L432KC SoC
One of the hard parts about this vulnerability is that it’s not a
simple case of “you zigged where you should have zagged” that can be
patched easily. In order to remediate this issue, a substantial and
complex feature has to be engineered into the IoT device.
Deprecate and/or disable any direct use of the RNG HAL function in
your SDK. Instead, include a CSPRNG API that is seeded using robust and
diverse entropy sources with proper hardware RNG handling. The Linux
kernel’s implementation of
can serve as an excellent reference./dev/urandom