Hertzbleed: Turning Power Side-Channel Attacks Into Remote Timing Attacks on x86
zeynep at keemail.me
zeynep at keemail.me
Wed Jun 15 16:55:07 PDT 2022
Hertzbleed: Turning Power Side-Channel AttacksInto Remote Timing Attacks on x86
AbstractPower side-channel attacks exploit data-dependent varia-tions in a CPU’s power consumption to leak secrets. In thispaper, we show that on modern Intel (and AMD) x86 CPUs,power side-channel attacks can be turned into timing attacksthat can be mounted without access to any power measure-ment interface. Our discovery is enabled by dynamic voltageand frequency scaling (DVFS). We find that, under certaincircumstances, DVFS-induced variations in CPU frequencydepend on the current power consumption (and hence, data)at the granularity of milliseconds. Making matters worse,these variations can be observed by a remote attacker, sincefrequency differences translate to wall time differences!The frequency side channel is theoretically more powerfulthan the software side channels considered in cryptographicengineering practice today, but it is difficult to exploit becauseit has a coarse granularity. Yet, we show that this new channelis a real threat to the security of cryptographic software. First,we reverse engineer the dependency between data, power,and frequency on a modern x86 CPU—finding, among otherthings, that differences as seemingly minute as a set bit’sposition in a word can be distinguished through frequencychanges. Second, we describe a novel chosen-ciphertext at-tack against (constant-time implementations of) SIKE, a post-quantum key encapsulation mechanism, that amplifies a sin-gle key-bit guess into many thousands of high- or low-poweroperations, allowing full key extraction via remote timing.1 IntroductionPower-analysis attacks have been known for decades to bea powerful source of side channel information leakage. His-torically, these attacks were used to leak cryptographic se-crets from embedded devices like smart cards using physicalprobes [3,39,59,68,74,75]. Recently, however, power-analysisattacks have been shown to be exploitable also via softwarepower measurement interfaces. Such interfaces, available∗These authors contributed equally to this work.on many of today’s general-purpose processors, have beenabused to fingerprint websites [95], recover RSA keys [70],break KASLR [63], and even recover AES-NI keys [64].Fortunately, software-based power-analysis attacks can bemitigated and easily detected by blocking (or restricting [10])access to power measurement interfaces. Up until today, sucha mitigation strategy would effectively reduce the attack sur-face to physical power analysis, a significantly smaller threatin the context of modern general-purpose x86 processors.In this paper, we show that, on modern Intel (and AMD)x86 CPUs, power-analysis attacks can be turned into timingattacks—effectively lifting the need for any power measure-ment interface. Our discovery is enabled by the aggressive dy-namic voltage and frequency scaling (DVFS) of these CPUs.DVFS is a commonly-used technique that consists of dynami-cally adjusting CPU frequency to reduce power consumption(during low CPU loads) and to ensure that the system staysbelow power and thermal limits (during high CPU loads). Wefind that, under certain circumstances, DVFS-induced CPUfrequency adjustments depend on the current power consump-tion at the granularity of milliseconds. Therefore, since thepower consumption is data dependent, it follows transitivelythat CPU frequency adjustments are data dependent too.Making matters worse, we show that data-dependent fre-quency adjustments can be observed without the need for anyspecial privileges and even by a remote attacker. The reason isthat CPU frequency differences directly translate to executiontime differences (as 1 hertz = 1 cycle per second). The securityimplications of this finding are significant. For example, theyfundamentally undermine constant-time programming, whichhas been the bedrock defense against timing attacks since theirdiscovery in 1996 [58]. The premise behind constant-timeprogramming is that by writing a program to only use “safe”instructions, whose latency is invariant to the data values, theprogram’s execution time will be data-independent. With thefrequency channel, however, timing becomes a function ofdata—even when only safe instructions are used.Despite its theoretical power, it is not obvious how to con-struct practical exploits through the frequency side channel.This is because DVFS updates depend on the aggregate powerconsumption over millions of CPU cycles and only reflectcoarse-grained program behavior. Yet, we show that the fre-quency side channel is a real threat to the security of crypto-graphic software, by (i) reverse engineering a precise leakagemodel for this channel on modern x86 CPUs, and (ii) showingthat some cryptographic primitives admit amplification ofsingle key bit guesses into thousands of high- or low-poweroperations, enough to induce a measurable timing difference.To construct a leakage model, we reverse engineer the de-pendency between data being computed on and power con-sumption / frequency on modern x86 Intel CPUs. Our resultsreveal that power consumption and CPU frequency depend onboth the Hamming weight (HW) of data being processed andthe Hamming distance (HD) of data across computations. Weshow, for the first time, that these two effects are distinct andadditive on modern Intel CPUs. Further, the HW effect is nonuniform. That is, computing on data with the same HW resultsin differences in power consumption / frequency dependingon the position of individual 1s within data values. The take-away is that computing on data with different bit patternsdepending on a secret can result in different power consump-tions and frequencies depending on that secret. We expect thatthis information will also be useful towards building future,Intel-specific power leakage emulators [11,60,72,87,89]. Wefind that AMD x86 CPUs also feature a similar leakage model,but leave reverse engineering its details to future work.We then describe a novel attack, including new cryptana-lytic techniques, on two production-ready, constant-time im-plementations of SIKE (Supersingular Isogeny Key Encap-sulation [52]). SIKE is a decade old, widely studied key en-capsulation mechanism. Unlike other finalists in NIST’s Post-Quantum Cryptography competition, SIKE has both shortciphertexts and short public keys — and a “well-understood”side channel posture [20]. In our attack, we show that, whenprovided with a specially-crafted input, SIKE’s decapsula-tion algorithm produces anomalous 0 values that depend onsingle bits of the key. Worse so, these values cause the algo-rithm to get stuck and operate on intermediate values that arealso 0 for the remainder of the decapsulation. When this hap-pens, the processor consumes less power and runs at a higherfrequency than usual, and therefore decapsulation takes ashorter wall time. This timing signal is so robust that keyextraction is possible across a network, as we demonstratefor the SIKE implementations in both Cloudflare’s Interop-erable Reusable Cryptographic Library (CIRCL) [28] andMicrosoft’s PQCrypto-SIDH [66]. Our unoptimized versionof the attack recovers the full key from these libraries in 36and 89 hours, respectively. Finally, we show that the frequencyside channel can also be used to mount timing attacks withouta timer, such as a KASLR break and a covert channel.Disclosure We disclosed our findings, together with proof-of-concept code, to Intel, Cloudflare and Microsoft in Q3 2021and to AMD in Q1 2022. The attack was assigned CVE-2022-23823 and CVE-2022-24436 and held under embargo untilJune 14, 2022. Intel committed to awarding us a bug bounty.Cloudflare and Microsoft deployed a mitigation to CIRCLand PQCrypto-SIDH, respectively.2 Background and Related WorkIntel P-States In Intel processors, dynamic voltage and fre-quency scaling (DVFS) works at the granularity of P-states.P-states correspond to different operating points (voltage-frequency pairs) in 100 MHz frequency increments [49]. Thenumber of P-states varies across different CPU models. Mod-ern Intel processors offer two mechanisms to control P-states,namely SpeedStep and Speed Shift / Hardware Controlled Per-formance States (HWP). With SpeedStep, P-states are man-aged by the operating system (OS) using hardware coordina-tion feedback registers. With HWP, P-states are managed en-tirely by the processor, increasing the overall responsiveness.HWP was introduced with the Skylake microarchitecture [78].When HWP is enabled, the OS can only give hints to the pro-cessor’s internal P-state selection logic, including restrictingthe range of available P-states [91]. Otherwise, the availablerange of P-states depends only on the number of active coresand on whether “Turbo Boost” is enabled [55]. Our P-statenaming convention follows the one used in Linux [91].1 Thelowest P-state corresponds to the lowest supported CPU fre-quency. The highest P-state corresponds to the “max turbo”frequency for the processor. However, when Turbo Boost isdisabled, the highest available P-state is the base frequency.We use the term P-state and frequency interchangeably.P-state management is also related to power management.Each Intel processor has a Thermal Design Point (TDP), indi-cating the expected power consumption at steady state undera sustained workload [22, 40]. While in the max turbo mode,the processor can exceed its nominal TDP [47]. However, ifthe CPU hits a certain power and thermal limit while in maxturbo mode, the hardware will automatically downclock thefrequency to stay at TDP for the duration of the workload.Data-Dependent Power Consumption It is well-knownthat a processor’s power consumption depends on the databeing processed [46, 68]. The precise dependency betweendata and power consumption depends on the processor’s im-plementation, but can be approximated using leakage mod-els. Two commonly-used leakage models are the Hammingdistance (HD) [9, 61, 68, 71, 77] and the Hamming weight(HW) [56, 61, 67, 71, 73, 74, 88] models. In the HD model,power consumption depends to the number of 1 → 0 and0 → 1 bit transitions occurring in the data during a computa-tion. In the HW model, power consumption just depends onthe number of bits that are 1 in the data being processed.1However, Intel refers to higher frequencies as lower P-states [48, 50].Table 1: CPUs tested in our experimental setups.CPU Model Microarchitecture Cores BaseFrequencyMax TurboFrequencyi7-8700 Coffee Lake 6 3.20 GHz 4.60 GHzi7-9700 Coffee Lake Refresh 8 3.00 GHz 4.70 GHzi9-10900K Comet Lake 10 3.70 GHz 5.30 GHzi7-11700 Rocket Lake 8 2.50 GHz 4.90 GHzi7-10850H Ice Lake (mobile) 6 2.70 GHz 5.10 GHzi7-1185G7 Tiger Lake (mobile) 4 3.00 GHz 4.80 GHzPower Side-Channel Attacks Power side-channel attacksagainst cryptosystems were first publicly discussed by Kocherin 1998 [59]. His work introduced analytical techniques thatexploit the data dependency of power consumption to revealsecret keys. Following works demonstrated power-analysisattacks against several cryptographic algorithms includingAES [14, 67], DES [74], RSA [30, 75, 80, 94], and ElGa-mal [16,30].2 However, all these attacks were targeted againstsmart cards and required physical access to the device. Morerecently, power side-channel attacks have been applied also tomore complex devices such as smartphones [15,35,76,92,93]and PCs [36, 63, 64, 70, 95]. Some of these attacks rely onlyon software power measurement interfaces, meaning that theydo not need proximity to the device. However, while someof these works use the HW and HD leakage models [64, 70],none of them presents a systematic reverse engineering of thedependency between power consumption and data on modernIntel x86 CPUs. Further, all these attacks can be blocked byrestricting access to such power measurement interfaces.3 CPU Frequency Leakage ChannelIn this section, we analyze the leakage from CPU frequencyvariations on modern Intel processors. We show that, un-der certain circumstances, the distribution of a processor’sfrequencies leaks information about the instructions beingexecuted as well as the data being processed.Experimental Setup We run our experiments on severaldifferent machines. The characteristics of the CPU of eachmachine are reported in Table 1. All our machines run Ubuntuwith versions either 18.04 or 20.04, kernel either 4.15 or 5.4,and the latest microcode patches installed. Unless otherwisenoted, we use the default system configuration, without re-stricting the P-states. To monitor CPU frequency, we use theMSR_IA32_MPERF and MSR_IA32_APERF registers, as done inthe Linux kernel [62]. To monitor power consumption, weuse the MSRs of the RAPL interface, following Weaver [90].3.1 Distinguishing InstructionsAs a first step for our analysis, we set out to understand howrunning different workloads affects the P-state selection logic2For a comprehensive survey of these attacks, we refer to prior work [68].3.94.04.14.24.34.44.5Frequency (GHz)0 5 10 15Time (s)60708090100110120 Power (W)(a) Run of the int32-float test3.94.04.14.24.34.44.5Frequency (GHz)0 5 10 15Time (s)60708090100110120 Power (W)(b) Run of the int32 testFigure 1: Example of distinguishing workloads using fre-quency traces on our i7-9700 CPU. The lighter workload(int32) allows for longer runtimes at higher frequencies thanthe heavier workload (int32-float).of our CPUs. We pick two workloads from the stress-ngbenchmark suite [57]. The first workload consists of 32-bitinteger and floating-point operations (int32float method),while the second workload consists of only 32-bit integer oper-ations (int32 method). We run both benchmarks on all coresand starting from an idle machine. We sample the CPU fre-quency and the (package domain) power consumption every5 ms during the benchmark’s execution.Figure 1a shows the results for the int32float test on ouri7-9700 CPU. The frequency starts at 4.5 GHz, the highestP-state available when all cores are active on our CPU. Thisfrequency is sustained for about 8 seconds, during which thepower consumption is allowed to exceed the TDP. Then, theCPU drops to a lower P-state, bringing the power consumptiondown to TDP (65 W on our CPU). From there onwards, theCPU remains in steady state and power stays around the TDPlevel for the duration of the workload. In our example, atsteady state the frequency oscillates between two P-states,corresponding to the frequencies of 3.9 GHz and 4.0 GHz.Figure 1b shows the results for the int32 stress test. Heretoo, the frequency starts at 4.5 GHz and later drops to a lowerP-state. However, compared to Figure 1a, (i) the drop occurslater, after 10 seconds, and (ii) the P-states used after the dropare higher, corresponding to 4.0 GHz and 4.1 GHz. This isbecause the power consumption of the int32 test is lower. Asa consequence, not only can the processor sustain the highestavailable P-state for longer, but it can also use higher P-statesin steady state without exceeding the TDP.The key takeaway from the above results is that both (i)the time that a processor can spend at the maximum availableP-state and (ii) the distribution of P-states at steady state de-pend on the CPU power consumption. Since the CPU powerconsumption depends on the workload, by the transitive prop-erty it follows that P-states depend on the workload too. Thisimplies that dynamic scaling of P-states leaks informationabout the current workload running on the processor.4.3 4.4Frequency (GHz)0.00.20.40.60.8Probabilityhw=16hw=32hw=48(a) Frequency at steady state19.5 20.0 20.5Seconds before steady state0246Probability densityhw=16hw=32hw=48(b) Seconds before steady stateFigure 2: Distinguishing data (in the source register to a shlxinstruction) using frequency traces on our i7-9700 CPU. Fig-ure 2a is over 30,000 samples. Figure 2b is over 100 traces.3.2 Distinguishing DataWe saw that P-state information leaks information about theinstructions being executed (i.e., the workload). We now ex-plore if the frequency leakage channel can leak informationabout the data being processed by instructions. Our questionis motivated by the fact that power consumption on x86 pro-cessors is known to be data dependent [64]. It is thus naturalto ask: do data-dependent differences in power consumptionshow in the distribution of P-states?To answer this question, we monitor the CPU frequencywhile executing the same instructions and only changing thecontent of the input registers. For example, we use the shlxinstruction to continuously shift left the bits of a source regis-ter and write the result into different destination registers in aloop, while only varying the content of the source register. Werun this experiment on all cores and compare the distributionof the P-states in steady state. Figure 2a shows the resultswhen we set the content of the source register to have 16, 32or 48 ones. In all cases the P-state oscillated between 4.3 GHzand 4.4 GHz. However, the larger the Hamming weight, themore the frequency stayed at the lower P-state. We also sawa data-dependent difference in terms of when the frequencydrops to steady state if we start from idle (cf. Figure 2b). Thelarger the Hamming weight, the quicker the frequency dropsto steady state. This is because, as we show in Section 4, pro-cessing data with larger Hamming weights consumes morepower than processing data with lower Hamming weights.We get similar results with other instructions too. For ex-ample, we observed data-dependent effects when running or,xor, and, imul, add, sub, as well as when computing on dataloaded from memory. The only caveat is that, for some in-structions, the power consumption of just running the targetinstruction in a loop on all cores was not large enough tocause the P-state to ever drop to steady state. In these cases,we ran an additional, fixed workload in the background topush the total power consumption up.The key takeaway of the above results is that dynamicscaling of P-states leaks information about the data beingprocessed. In the following sections, we use the distributionof P-states at steady state as our leakage channel.4 CPU Frequency Leakage ModelWe saw that the power consumption and the distribution ofP-states in Intel CPUs depend on the data being processed.The goal of this section is to construct a leakage model of thisbehavior. To this end, we reverse engineer the dependency be-tween power consumption/frequency and data on the ALU ofmodern Intel CPUs. As we show in Section 5, this informationcan help an attacker construct side-channel attacks.Scope Precisely understanding where power is dissipatedas a function of data on general-purpose x86 processors is achallenging task. The reason is that the microarchitecture ofmodern x86 processors is (i) highly complex and (ii) largelyundocumented. Fortunately, studying the power consumptionacross all microarchitectural units is not necessary to buildattacks. This is because the vast majority of computationsperformed by modern, constant-time cryptographic softwareoccurs in the arithmetic logic unit (ALU). Since our primarygoal is to build a model that is useful to leak secrets fromconstant-time cryptographic code, the analysis in this sectionfocuses specifically on the ALU component.Methodology We use the experimental setup of Section 3.In each experiment, we run a fixed set of ALU instructions(the sender) in a loop on all cores, while varying the inputcontents. We carefully design our senders to target specificbehaviors and minimize side effects. First, to reduce powerconsumption from other core units such as the cache, wealways use register to register instructions without any mem-ory access. Second, to avoid any datapath contamination ef-fects caused by incrementing the loop counter variable andevaluating loop conditions, we run our sender in an infiniteloop that we manually terminate at the end of the experiment.Third, to avoid introducing unintended HD effects, we inter-leave different instructions in such a way that encourages fullthroughput on all available ports [1, 2]. Finally, we run eachsender in two setups. In the first setup, we use the defaultsystem configuration, warm up the machine until it enterssteady state, and monitor the frequency. In the second setup,we disable SpeedStep / HWP (this way, our processor staysat the base frequency for the duration of the workload) andmonitor the (core domain) power consumption. We samplepower/frequency every 1 ms, collect 30,000 data points foreach experiment and use their mean for our analyses.4.1 Hamming Distance (HD) EffectTo start, we set out to understand if the number of 1 → 0and 0 → 1 transitions affects power consumption / frequency.Recall that these transitions depend on the number of bitsthat differ (also known as the HD) between consecutive datavalues being processed. To study the dependency between HDand power consumption / frequency, we then need a senderthat offers fine-grained control over the number of transitions,rax = COUNTrbx = 0x0000FFFFFFFF0000loop: shlx %rax,%rbx,%rcx // rcx = rbx << rax shlx %rax,%rbx,%rdx // rdx = rbx << rax shrx %rax,%rbx,%rsi // rsi = rbx >> rax shrx %rax,%rbx,%rdi // rdi = rbx >> rax shlx %rax,%rbx,%r8 // r8 = rbx << rax shlx %rax,%rbx,%r9 // r9 = rbx << rax shrx %rax,%rbx,%r10 // r10 = rbx >> rax shrx %rax,%rbx,%r11 // r11 = rbx >> raxjmp loop(a) Sender for our HD experiments.rax = LEFTrcx = … = r11 = RIGHTloop: or %rax,%rcx // rcx = rax | rcx or %rax,%rdx // rdx = rax | rdx or %rax,%rsi // rsi = rax | rsi or %rax,%rdi // rdi = rax | rdi or %rax,%r8 // r8 = rax | r8 or %rax,%r9 // r9 = rax | r9 or %rax,%r10 // r10 = rax | r10 or %rax,%r11 // r11 = rax | r11jmp loop(b) Sender for our HW experiments.rax = rcx = rdx = rsi = rdi = FIRSTrbx = r8 = r9 = r10 = r11 = SECONDloop: or %rax,%rcx // rcx = rax | rcx or %rax,%rdx // rdx = rax | rdx or %rax,%rsi // rsi = rax | rsi or %rax,%rdi // rdi = rax | rdi or %rbx,%r8 // r8 = rbx | r8 or %rbx,%r9 // r9 = rbx | r9 or %rbx,%r10 // r10 = rbx | r10 or %rbx,%r11 // r11 = rbx | r11jmp loop(c) Sender for our HW+HD experiments.Figure 3: Different sets of instructions (senders) used to reverse engineer the dependency between data and power consumption /frequency on our CPUs. Different senders are designed to target different effects. Each sender can be run with variable inputs.0 5 10 15COUNT4.264.274.284.29Frequency (GHz)(a) Frequency vs COUNT0 5 10 15COUNT23.223.423.623.8Power (W)(b) Power vs COUNTFigure 4: Effect of increasing COUNT in Figure 3a’s senderon our i7-9700 CPU. Higher COUNT values cause higher HDsin the ALU output. As the HD increases, the mean power con-sumption grows and the mean steady-state frequency drops.while avoiding other potential side effects. For example, test-ing different HDs should not require changing the number of1s in the input (which, as we show below, is a separate effect).3We design our sender to use interleaved shlx and shrxinstructions, as shown in Figure 3a. These instructions shiftthe bits of the second source register to the left or right by aCOUNT value stored in the first source register. The result iswritten to a separate destination register. Since on our CPUsshlx and shrx execute on port 0 and port 6 [1], we interleavethem in groups of two. We fix the content of the second sourceregister to 0x0000ffffffff0000, corresponding to 16 zeros,followed by 32 ones, followed by 16 zeros. We then shift thisregister left and right by COUNT (with 0 ≤ COUNT ≤ 16).By construction, the HD in the ALU output between a shlxand a shrx is 4×COUNT. For example, when COUNT = 8,the output of each shlx is 0x00ffffffff000000, and theoutput of each shrx is 0x000000ffffffff00, translating to4×8 bit transitions in the ALU output. Yet, the ALU inputremains unchanged and the number of 1s in the source andthe destination registers is fixed.43This requirement implies that approaches such as using a xor instructionto cause bit transitions are not suitable, because triggering different numbersof transitions would also require using different numbers of 1s in the input.4The only other variable is the number of 1s in the COUNT register itself,Figure 4 shows the results when we vary the COUNT value.We see that the power consumption grows and the frequencydrops when COUNT grows, confirming that the number of bittransitions directly affects power consumption and frequency.In Appendix A.1, we corroborate this observation with an ad-ditional experiment where transitions occur in the ALU input.These results are consistent on all the CPUs of Table 1.1. Larger Hamming distances between data values beingprocessed contribute to larger power consumptions andlower steady-state frequencies.4.2 Hamming Weight (HW) EffectWe now set out to understand if the HW of the data values be-ing processed affects power consumption / frequency. Recallthat the idea behind the HW model is that power consumptiondepends on the number of 1s in the data being processed. Tostudy the dependency between HW and power consumption /frequency, we need a sender that offers fine-grained controlover the number of 1s, while avoiding other potential sideeffects. For example, testing different HWs should not requirebit transitions in the data (i.e., the HD effect).To satisfy the above requirements, we design a sender thatuses or logic instructions, as shown in Figure 3b. These in-structions perform a bitwise inclusive or operation betweenthe source register and the destination register, and store theresult in the destination register. We always use the sameinput and output registers for all the or instructions in theloop. We fix the content of the source register to LEFT, andset the initial content of the output register to RIGHT.By construction, the number of bit transitions occurring onthe ALU input and output during the execution of the abovesender is zero. The reason is that all or instructions take thesame inputs and produce the same output during an experi-ment. Hence, we can test different HW in the source registerswithout introducing any HD effects. An added benefit of us-ing or instructions is that they allow us to study the effectswhich varies between 1 and 4. However, this effect is negligible.0 20 40 60Hamming weight4.124.144.16Frequency (GHz)From LSBFrom MSB(a) Frequency vs HW0 20 40 60Hamming weight26.2526.5026.7527.0027.25Power (W)From LSBFrom MSB(b) Power vs HWFigure 5: Effect of varying the number of consecutive 1s inthe LEFT = RIGHT input to Figure 3b’s sender on our i7-9700CPU. As we increase the number of 1s, the mean power con-sumption grows and the mean steady-state frequency drops.of changing some bits of the input register (LEFT) withoutaffecting the contents of the output register (RIGHT). We usethis sender to perform multiple experiments.Consecutive 1s We start our analysis of the HW effect bychecking if the number of leading or trailing 1s in the data af-fects power consumption / frequency. We set LEFT = RIGHTsuch that the inputs and outputs of all or instructions are al-ways the same. We then run the sender with a varying HW inthe LEFT = RIGHT values. Figure 5 shows the results whenthe HW grows from 0 to 64, both when the 1s start from theleast significant bit (LSB) and when they start from the mostsignificant bit (MSB). In both cases, the power consumptiongrows and the frequency drops when the HW grows.2. A larger number of leading or trailing 1s in the datavalues being processed contributes to larger power con-sumptions and lower steady-state frequencies.We also see that the changes in power consumption andfrequency appear to be nonlinear. That is, the plots of Figure 5have a “bow” shape, suggesting that the HW effect is strongerfor the most significant 32 bits than for the least significant 32bits. For example, when the input is 0xffffffff00000000(HW=32, orange line), the HW effect is larger than when itis 0x00000000ffffffff (HW=32, blue line). This suggeststhat given data values with the same HW, their contributionpower / frequency may also depend on the position of 1s. Wethoroughly examine this observation later in this subsection.Non-consecutive 1s The above experiment shows thatpower consumption and frequency can depend on the HW ofthe data being processed. However, it only focuses on a bitpattern of consecutive 1s and 0s. In reality, 1s and 0s mightoccur in anywhere in the data. For our model to be useful, weneed to test if the HW effect applies to arbitrary bit patterns.To analyze the HW effect in the presence of non-consecutive 1s, we run a variant of our previous experiment,where we increase the HW at byte granularity. That is, webreak the 64-bit registers LEFT = RIGHT into 8 bytes and0 2 4 6 8Hamming weight4.124.144.16Frequency (GHz)(a) Frequency vs HW0 2 4 6 8Hamming weight26.2526.5026.7527.0027.25Power (W)(b) Power vs HWFigure 6: Effect of varying the number of non-consecutive 1sin the LEFT = RIGHT input to Figure 3b’s sender on our i7-9700 CPU. The results confirm that larger HWs cause higherpower consumptions and lower steady-state frequencies.0 1 2 3 4 5 6 7Byte index0.0080.0060.004 Frequency (GHz)(a) Effect of 0xFF to frequency0 1 2 3 4 5 6 7Byte index0.100.150.20 Power (W)(b) Effect of 0xFF to powerFigure 7: Effect of setting single bytes to 0xff in the LEFT =RIGHT input to Figure 3b’s sender on our i7-9700 CPU. Theeffect varies depending on the position of 1s within the inputs.HW differences in the MSBs have the strongest effect; HWdifferences in the bits right below 32 have the weakest effect.vary the HW within each byte. Increasing the HW withineach byte allows us to measure the impact of different num-bers of non-consecutive 1s. For example, when the HW foreach byte is 2, we set 2 bits of each byte to 1, for a total HWof 2×8 = 16. Figure 6 shows the results, clearly indicatingthat a larger number of non-consecutive 1s contributes to alarger power consumption and lower CPU frequency.3. A larger Hamming weight (number of 1s) in the datavalues being processed contributes to larger power con-sumptions and lower steady-state frequencies regardlessof whether the 1s are consecutive or not.Non-uniformity of the HW Effect To analyze the impactof the position of 1s within the data, we run another variantof our previous experiment. We break the 64-bit registersLEFT = RIGHT into 8 bytes. Each byte can be set to 0x00 (all0s) or 0xff (all 1s). When we target byte i, we fix the value ofthe other 7 bytes and compute the delta of power consumption/ frequency between setting byte i to 0xff and 0x00. For eachbyte, we repeat this test with all the 27combinations of theother 7 bytes. We compute the average and standard deviationof the deltas for each byte and show the result in Figure 7.We immediately see that the HW effect is non-uniformacross different bytes. At a high level, the 4 most significantbytes have a stronger HW effect than the 4 least significantbytes, and bytes closer to the 32nd bit have a weaker HWeffect than bytes farther from the 32nd bit. This is consis-tent with our previous observation that an input where themost significant 32 bits are 1 consumes more power than aninput where the least significant 32 bits are 1, even if theirHWs are the same. Further, the standard deviations are rel-atively small, suggesting that the HW effect of each byteis independent of the values of other bytes. For example,the power/frequency deltas between 0x0000ff0000000000and 0x000000000000000 are the same as the ones between0xff00ffff00ffffff and 0xff0000ff00ffffff. We sus-pect that these properties also hold a bit granularity, but areunable to confirm because it would require collecting data for264 bit combinations for a runtime of more than 1013 years.Note that the difference in the HW effect due to the positionof 1s is relatively small (e.g., ≤ 0.12 W in Figure 7b) com-pared to the difference in the HW effect due to the number of1s (e.g., ≤ 1.11 W in Figures 5b and 6b) and the HD effectdue to bit transitions (e.g., ≤ 0.75 W in Figure 4b).4. The HW effect is non-uniform. 1s in the most signifi-cant bytes affect power and frequency more than 1s inthe least significant bytes. Additionally, the HW effectat each byte is independent of the values of other bytes.The above experiments show that power consumption andfrequency depend both on the number and the positions of1s in the data being processed. However, both experimentswere designed using LEFT = RIGHT, meaning that all thesource and destination registers used by the sender duringan experiment were the same. It is then natural to ask: doesthe HW effect occur even when LEFT 6= RIGHT? To answerthis question, we repeated the above two experiments, butthis time set LEFT = 0 and only varied the HW of RIGHT.5Both experiments yielded results similar to the ones whereLEFT = RIGHT, albeit with smaller increments/decrements inpower/frequency. This result shows that the HW effect on anoperand is independent of the contents of other operands.5. The HW effect occurs on each operand independently.To sum up, the HW effect may be approximated as a linearcombination of two vectors. The first vector is the number of1s per byte, and the second vector is the non-uniform powerconsumption / frequency “cost” of 1s in that byte (based onthe deltas of Figure 7). In Appendix A.1 we discuss additionalexperiments in support of this model. We verified that thismodel applies to all the CPUs of Table 1. However, the non-uniform “costs” per byte of the HW effect can be differentacross CPU models. For example, in the 11th gen CPUs, theHW effect is more uniform compared to Figure 7.5Whether LEFT = 0 or LEFT = RIGHT, the result of the or is still RIGHT.0 20 40 60HW of SECOND3.984.004.024.04Frequency (GHz)ABCD(a) Frequency vs HW0 20 40 60HW of SECOND29.530.030.531.0Power (W)ABCD(b) Power vs HWFigure 8: Effect of increasing the HW of SECOND in Fig-ure 3c’s sender, while fixing FIRST to different values on ouri7-9700 CPU. Power consumption grows and steady-statefrequency drops when both HW and HD increase at the sametime (net effect of HW+HD). However, power consumptiondrops and steady-state frequency grows when HW incrementscorrespond to HD decrements (net effect of HW−HD).4.3 Additivity of the HW and HD EffectsFinally, we set out to understand if the HD and HW effects areadditive. To this end, we design our sender to use or instruc-tions with interleaved operand contents, as shown in Figure 3c.In this sender, half of the instructions computes FIRST|FIRSTand the other half computes SECOND|SECOND. We in-terleave these instructions in groups of four, since onour CPUs or instructions use four ports [1]. We thentest setting FIRST to be A = 0x000000000000ffff, B =0xffff000000000000, C = 0x00000000ffffffff, or D =0xffffffff00000000, and increase the HW of SECONDfrom 0 to 64, starting from the least significant bit.Figure 8 shows the results. Consider the case when FIRST= C. As the HW of SECOND increases from 0 to 32, the HDbetween FIRST and SECOND decreases, causing the powerconsumption to drop and the frequency to grow. However, asHW of SECOND increases from 32 to 64, the HD betweenFIRST and SECOND increases, causing the opposite effect.The slope between 0 and 32 is smaller than the one between32 and 64. This is because the former is a net effect of HWminus HD whereas the latter is a net effect of HD plus HW.For the other values of FIRST, we see analogous effects butwith different constant offsets. This result (consistent acrossthe CPUs of Table 1) shows that the HW and the HD effectscan simultaneously contribute to power and frequency.6. The HD and HW effects are additive and can simulta-neously contribute to differences in power consumptionand steady-state frequency.5 Remote Timing Attack on SIKEThe previous sections have shown that carefully crafted in-struction sequences can trigger data-dependent power con-sumption and frequency differences. In this section, we showthat the frequency side channel threat extends to in-the-wildsoftware. Specifically, we show how to use the frequency sidechannel, combined with novel cryptanalysis, for a full keyrecovery attack through remote timing on two production-ready, side-channel hardened implementations of Supersingu-lar Isogeny Key Encapsulation (SIKE) [52], a post-quantumkey encapsulation mechanism based on the SupersingularIsogeny Diffie-Hellman (SIDH) [53] key exchange protocol.Attack Model We assume a chosen-ciphertext attack model(CCA). The goal of the attacker (client) is to recover thestatic secret key used by the victim (server) to decapsulateciphertexts. The attacker can send many ciphertexts to thevictim, which always tries to compute the shared secret withthe decapsulation procedure using its static secret key.Attack Idea The server’s static secret key is an integer mwith bit expansion m = (m`−1,...,m0)2, where ` = 378 (forSIKE-751, the parameter selection we target in our experi-ments). During decapsulation, the server computes P+[m]Qfor elliptic curve points P and Q included in the ciphertext;the SIKE standard prescribes a particularly efficient algorithmfor evaluating this expression, the Montgomery three-pointladder [29]. We show that an attacker who knows the i leastsignificant bits of m can construct points P and Q such that:• If mi 6= mi−1, then the (i+1)st round of the Montgomerythree-point ladder produces an anomalous 0 value. Oncethat anomalous 0 value appears, the decapsulation algo-rithm gets stuck: every intermediate value produced for theremainder of the ladder is 0. Additionally, every intermedi-ate value produced for the function (isogeny computation)following the ladder is also 0.• If, however, mi = mi−1, or if the attacker was wrong aboutthe i least significant bits of m when constructing the chal-lenge ciphertext, then the (i+1)st round generates a non-0value. Heuristically, the remainder of the computation pro-ceeds without producing an anomalous 0 value except withnegligible probability.This observation is new, and it represents a core contributionof our work. Because SIKE is built on somewhat abstrusemath, we defer the details of how to construct points P and Qthat trigger an anomalous 0 value, and why a 0 value causesthe decapsulation algorithm to get stuck, to Section 5.3.The values operated on by SIKE decapsulation are large(a single element of the field underlying SIKE-751 takes188 bytes to express) and the operations themselves are com-plex: the inner loop of the Montgomery ladder comprisesthousands of lines of hand-optimized assembly. Nevertheless,in Section 5.1, we show that SIKE decapsulation behaveslike the much simpler, synthetic senders of Section 4. Whenmi 6= mi−1 and the decapsulation algorithm gets stuck, repeat-edly producing and operating on 0 values, the processor con-sumes less power and runs at a higher steady-state frequency(and therefore decapsulation takes a shorter wall time).Taken together, our findings mean that the server’s secretkey can be recovered by an adaptive chosen-ciphertext attack,using execution time as a side channel. Having extracted thefirst i bits of m, the adversary repeatedly queries the serverwith ciphertexts that should cause decapsulation to get stuckin the (i + 1)st round. If the server responds faster than abaseline (established through profiling), the adversary con-cludes that bit miis the opposite of bit mi−1; otherwise bitmiis the same. The attacker then proceeds to the next bit. InSection 5.2, we show that the timing signal is so robust thatkey extraction is possible across a network. We demonstratefull recovery of the (378-bit) private key from the SIKE-751implementations in two popular, production-ready crypto-graphic libraries: Cloudflare’s Interoperable Reusable Crypto-graphic Library (CIRCL) [28], written in Go, and Microsoft’sPQCrypto-SIDH [66], written in C. Both of libraries are hard-ened against previously known software side channels andmeant to run in constant time. Our attack is practical; an un-optimized version recovers the full key from a CIRCL serverin 36 hours and from a PQCrypto-SIDH server in 89 hours.5.1 P-State and SIKE implementationWe start by verifying that a correct key-bit guess in our chosen-ciphertext attack—one that causes the Montgomery ladderand the remainder of SIKE decapsulation to repeatedly pro-duce 0 values—causes the processor to execute at a higherfrequency than an incorrect key-bit guess does. Our local ex-periment uses 10 randomly generated SIKE-751 server keys.For each key m = (m`−1,...,m0)2, we target 4 out of the 378bit positions. We choose the target bit positions uniformly atrandom, to validate that the frequency difference is observableeven for bits accessed late in the Montgomery ladder loop.Suppose we target bit i in a secret key m. Provided thatmi 6= mi−1, we can craft a challenge ciphertext that will triggeran anomalous 0 value in the Montgomery ladder iteration thataccesses bit i. However, if mi = mi−1, then there is no chal-lenge ciphertext that can trigger the anomalous 0 value. Tomake sure we are measuring the effect of anomalous 0 values,and not some other unknown effect, we set up our experimentas follows. For each key m and each target bit index i, wecreate a variant key m0that agrees with m at every bit posi-tion except index i, where it has the opposite bit value.6Inother words, m0 =3.8 3.9 4.0Frequency (GHz)0.00.20.40.60.8Probabilitymi = mi 1 mi mi 130 35 40Power consumption (W)0.00.10.2Probability densitymi = mi 1 mi mi 1(a) CIRCL data3.6 3.7Frequency (GHz)0.000.250.500.751.00Probabilitymi = mi 1 mi mi 140 45Power consumption (W)0.00.10.20.3Probability densitymi = mi 1 mi mi 1(b) PQCrypto-SIDH dataFigure 9: Distribution of the power consumption and the fre-quency when the challenge ciphertext introduces an anoma-lous 0 (mi 6= mi−1) or not (mi = mi−1), using the setups fromSection 4 on our i7-9700 CPU. The results are over 10 ran-domly generated keys, where, for each key, we target 4 out ofthe 378 bit positions. For each key and each bit, we launch theserver with 300 goroutines (CIRCL) or pthreads (PQCrypto-SIDH), each of which handles a single decapsulation request.have elapsed. As in Section 4, we run each experiment intwo setups. In the first setup, we use the default system con-figuration, and monitor the steady-state CPU frequency. Inthe second setup, we disable SpeedStep/HWP (this way, ourCPU stays at the base frequency during the experiment) andmonitor (core domain) power consumption. We sample boththe CPU frequency and the power consumption every 1 ms.We group the measured data points according to whetherwe expect the challenge ciphertext to induce an anomalous 0or not. For each key m and target bit position i, exactly one ofm and m0contributes to the anomalous-0 grouping.The results, shown in Figures 9a and 9b, confirm thatthe steady-state frequency is higher and the power consump-tion is lower when an anomalous 0 is triggered (mi 6= mi−1)than when it is not (mi = mi−1), for both the CIRCL andthe PQCrypto-SIDH decapsulation servers. As noted above,both these libraries are hardened against previously knownsoftware side channels and meant to run in constant time.The signal we obtain from PQCrypto-SIDH is fainter thanthe one we obtain from CIRCL, because PQCrypto-SIDHuses a different strategy for Montgomery reduction that causesthe value 0 to be represented in memory sometimes as 0 andsometimes as a prime number of size 751 bits.5.2 SIKE Key Remote RecoveryWe now show that the secret-dependent power consumptionand frequency differences observed in Section 5.1 translate toa remotely observable secret-dependent timing difference.We configure a SIKE target server with a randomly gen-erated static 378-bit key for SIKE-7517, revealed for com-parison only after the attack completes. Our server accepts aclient decapsulation request over HTTP (Go) or TCP (C) andspawns a goroutine (Go) or pthread (C) to handle the request.The thread reads in the ciphertext and performs the decapsula-tion computation, after which it sends a message back to theclient indicating the establishment of a shared secret but noother information. The target server and the attacker are bothconnected to the same network, and we measure an averageround-trip time of 688 µs between the two machines.The attacker simultaneously sends n requests with a chal-lenge ciphertext meant to trigger an anomalous 0 and mea-sures the time t it takes to receive responses for all n re-quests. When an anomalous 0 is triggered, power decreases,frequency increases, SIKE decapsulation executes faster, and tshould be smaller. Based on the observed t and the previouslyrecovered secret key bits, the attacker can infer the value ofthe target bit, then repeat the attack for the next bit.For the attack to be successful, we must overcome a numberof practical difficulties. First, we must set a value for n, thenumber of requests, that allows us to observe a clear timingsignal when we trigger the anomalous 0s. We experimentallyfind an n big enough that the frequency increase is remotelyobservable, but not so big that we induce thrashing.Second, we must set a time cutoff to distinguish whenanomalous 0s are triggered and when they are not. To thisend, we collect the decapsulation times when querying theserver with a random ciphertext, and use these times to seta cutoff for queries not triggering anomalous 0s. We thenquery the server with the challenge ciphertexts for the firstfew bits of the key until we see a speedup compared to therandom ciphertext, and use these times to set a cutoff forqueries triggering anomalous 0s.Third, we must detect and recover from mistakes causedby random variations in the server’s decapsulation time. Re-call that a challenge ciphertext constructed using a wrongvalue for the i least significant bits of m will never triggeranomalous 0s regardless of the relationship of mi and mi−1.Measuring no timing reduction in many consecutive roundsis evidence either that many consecutive key bits all havethe same value (unlikely since key bits are independent anduniformly distributed), or that the value we are using for theleast significant bits of the key is wrong (cf. Appendix A.4).In our experiments, we backtrack when experiments for 40consecutive bit positions show no timing reduction.Finally, there is a chance that a challenge ciphertext con-structed as in Section 5.3.2 will accidentally trigger an anoma-lous 0 later in the decapsulation process even if it does not atthe target bit index i of the Montgomery ladder. This will hap-7The SIKE standard and the implementations we examined place thelong-term keypair in the 3-torsion and the ephemeral key used for forminga ciphertext in the 2-torsion, so this is the case we studied. A variant of ourattack applies also if the roles are swapped.650 660 670Time (ms)0.000.050.100.15Probability densitymi = mi 1 mi mi 1(a) CIRCL histogram1550 1560 1570 1580Time (ms)0.000.050.10Probability densitymi = mi 1 mi mi 1(b) PQCrypto-SIDH histogramFigure 10: Distribution of the timings measured by the at-tacker during the remote key extraction attack, with the serverrunning on an i7-9700 CPU. The attacker makes 300 (CIRCL)and 1000 (PQCrypto-SIDH) connections (all with the samechallenge ciphertext, constructed as per Section 5.3.2) andmeasures the time until the last connection completes. Wegroup the execution time (filtered) of each key bit extractionbased on whether it should have triggered an anomalous 0 inthe Montgomery ladder (i.e., whether mi = 1−mi−1 or not).pen with exponentially small probability for most bit indices,but larger probability for the last few bit indices. We defera detailed explanation to Appendix A.3. It may be possibleto avoid triggering this misbehavior with a different way ofconstructing the challenge key. We instead sidestep it by stop-ping our interaction with the server after extracting all but thelast 14 bits; we recover these last bits by brute-force search.Attack Setup We run the SIKE target server on our i7-9700CPU using the default system configuration. In the attackon CIRCL, the server is an HTTP server written using Go’snet.http library, which handles each request in a goroutine.In the attack on PQCrypto-SIDH, the server is a TCP serverwritten in C, which handles each request in a pthread.We configure the attacker to send n = 300 concurrent re-quests in the CIRCL case, and n = 1000 requests in thePQCrypto-SIDH case. In both cases, concurrent requests aresent all with the same challenge ciphertext (constructed asdescribed in Section 5.3.2), and the attacker measures thetime until the last connection completes. We experimentallydetermine the expected timings when the CPU frequency in-creases because of anomalous 0s and when it does not: forCIRCL, at most 660.2 ms and at least 662.5 ms, respectively;for PQCrypto-SIDH at most 1556 ms and at least 1558 ms,respectively. We repeat the measurement 400 times, excludeoutliers (CIRCL: below 650 ms or above 675 ms; PQCrypto-SIDH: below 1500 ms or above 1580 ms), compute the me-dian of the remaining values, and compare to the cutoffs. Ifthe result is inconclusive for a bit, we repeat the attack on thatbit. We use our side channel to extract the key up to bit 364and recover the last 14 bits by brute force search.Results In Figure 10a and Figure 10b, we show the timingdistribution of the 300-connection runs (CIRCL) and 1000-connection runs (PQCrypto-SIDH) respectively, grouped ac-0 3 6 9 12 15 18Secret key bit index660661662663Time (ms)mi mi 1mi = mi 1(a) CIRCL first 20 bits345 348 351 354 357 360 363Secret key bit index660661662663Time (ms)mi mi 1mi = mi 1(b) CIRCL last 20 bitsFigure 11: Median times used to extract the first 20 bits (0to 19) and the last 20 bits (345 to 364) of the key for thesame attack against CIRCL SIKE-751 as in Figure 10a. Thetimings depend on whether the challenge ciphertext triggeredan anomalous 0 (mi 6= mi−1) or not (mi = mi−1).0 3 6 9 12 15 18Secret key bit index1554155615581560Time (ms)mi mi 1mi = mi 1(a) PQCrypto-SIDH first 20 bits345 348 351 354 357 360 363Secret key bit index15561558Time (ms)mi mi 1mi = mi 1(b) PQCrypto-SIDH last 20 bitsFigure 12: Median times used to extract the first 20 bits (0 to19) and the last 20 bits (345 to 364) of the key for the sameattack against PQCrypto-SIDH SIKE-751 as in Figure 10b.The timings depend on whether the challenge ciphertext trig-gered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1).cording to whether the challenge ciphertext of that run trig-gered an anomalous 0 (mi 6= mi−1) or not (mi = mi−1).For the first and the last 20 bit positions of the key that weextract by interacting with the server (bits 0–19 and 345–364,respectively), we plot, in Figure 11 (CIRCL) and Figure 12(PQCrypto-SIDH), the median time among the 400 measure-ments for that bit and whether the run triggered an anoma-lous 0 (mi 6= mi−1) or not (mi = mi−1) at that bit position. Thesignal is strong for both the top bits and the bottom bits.Both attacks successfully recovered the full secret key. Theattack on CIRCL completed in 36 hours, while the attack onPQCrypto-SIDH completed in 89 hours. We expect that the at-tack running time could be reduced with careful optimization.Unlike our attack on CIRCL, our attack on PQCrypto-SIDHmade 1 mistake and needed to backtrack; see Appendix A.4for our error correction strategy.5.3 Anomalous 0s in SIKE DecapsulationWe now explain how an attacker can construct SIKE cipher-texts that trigger an anomalous 0 in the (i + 1)st iterationof the Montgomery ladder when mi 6= mi−1, and why thatanomalous 0, once generated, causes the remainder of thedecapsulation algorithm to also produce 0s repeatedly.We briefly recall some relevant mathematical backgroundin Appendix A.2. We recommend that readers review a longerintroduction to the math behind elliptic curves, isogenies, andSIKE; Costello’s tutorial expositions of elliptic curves [18]and isogenies [19] are especially good choices.The first subroutine in the SIKE decapsulation algorithmrecovers (the Montgomery coefficient A of) the curve E00onwhich the points P, Q, and Q−P, included in the ciphertextprovided by the attacker, lie. This subroutine is fast and inde-pendent of the secret key; we do not consider it further.The second subroutine uses the Montgomery three-pointladder to compute P + [m]Q on the curve E00recovered bythe first subroutine. This is the subroutine in which a correctkey-bit guess (mi 6= mi−1) can trigger the generation of ananomalous 0 value. We explain how in Section 5.3.2.The third subroutine evaluates the isogeny correspondingto the point P + [m]Q, computing (the Montgomery coeffi-cient of) the curve E0e3that is the image of E00under thatisogeny. The fourth subroutine computes the j-invariant of thecurve E0e3; this j-invariant is the shared SIDH secret. In Sec-tion 5.3.3 and Appendix A.3, we explain how an anomalous 0value output by the Montgomery ladder causes the isogenyevaluation (third subroutine) and the j-invariant computation(fourth subroutine) to produce additional anomalous 0s.The final step in SIKE decapsulation is a Fujisaki–Okamotoconsistency check [31, 44] that checks that the ciphertext wasproperly generated. If the check fails, the recipient generatesa random session key instead of the one prescribed by the(invalid) ciphertext. The Fujisaki–Okamoto check immunizesSIKE against attacks, such as that due to Galbraith et al. [32],that require partial information about the j-invariant computedwhen decapsulating (invalid) ciphertexts.We do not claim to invalidate SIKE’s proof of security.None of the ciphertexts we construct in our attack passes theFujisaki–Okamoto check. Nevertheless, our attack recoversthe server’s secret key, because we obtain the informationwe need from the running time of the subroutines performedbefore the Fujisaki–Okamoto check.While our paper was under embargo (cf. Section 1), ourchosen-ciphertext attack triggering anomalous 0s in SIKEdecapsulation, described in this subsection, was independentlyrediscovered by De Feo et al. [25].5.3.1 Affine and Projective X-Coordinate Point Repre-sentations on Montgomery CurvesA Montgomery curve is defined by the equation EA,B : By2 =x3 + Ax2 + x, with parameters A,B ∈ Fp2 such that B(A2 −4) 6= 0. Montgomery curves have properties that make themsuitable for efficient, side-channel resistant implementations.In particular, many operations needed in cryptography canbe computed using just the x-coordinate of a point (ignoringthe y-coordinate) and just the curve parameter A (ignoringthe curve parameter B). The point with x- and y-coordinateAlgorithm 1: Three point ladder ( [52], Appendix A)1 function Ladder3ptInput: m = (m`−1,...,m0)2 ∈ Z, (xP, xQ, xQ−P),and (A : 1)Output:
Consider the algorithm xADD that, given points U,V, and Win projective x-coordinate form where W = U −V, computesthe point U +V in projective x-coordinate form, as:X ← ZW-(XU −ZU )(XV +ZV ) + (XU +ZU )(XV −ZV )2Z ← XW-(XU −ZU )(XV +ZV )−(XU +ZU )(XV −ZV )2.When U −V is any point except O or T, xADD(U,V,U −V)correctly returns U +V. However, when U −V is O or T,xADD(U,V,U −V) misbehaves and returns the invalid projec-tive representation (0 : 0) instead of U +V [21].9Worse, xADD(U,V,W) will also return (0 : 0) if called withany of U, V, or W equal to (0 : 0), regardless of the valueof the other two inputs.10 Repeated applications of xADD canthus get stuck at (0 : 0). We use exactly this fact for our attack.Suppose that we can arrange that, at the beginning of iter-ation k in Ladder3pt, (X2 : Z2) ∼ T, i.e., that X2 = 0 and Z2is nonzero. There are 2 cases to consider:• if mk = 1, then T will be passed into the third argument ofxDBLADD, triggering the misbehavior in xADD and causing(X1 : Z1) to be set to (0 : 0).• otherwise, if mk = 0, then T will instead be passed intothe second argument of xDBLADD. This will not triggerthe misbehavior in xADD and not produce (0 : 0) as anoutput. The point (X2 : Z2), which was equal to T, will beoverwritten with whatever xADD returns.In the first case, xADD will get stuck; the second elementof the tuple returned by xDBLADD will be (0 : 0) in everysubsequent iteration of Ladder3pt’s loop, and Ladder3ptwill eventually return (0 : 0). In the second case, it is likelythat 0 values will not recur during the ladder computation.It remains to show how the attacker can arrange for(X2 : Z2)to equal T at loop iteration k. Let 𝜇i = (mi−1,...,m0)2 rep-resent the least significant i bits of m. Algorithm Ladder3ptmaintains the invariant that, at the beginning of iteration i ofthe loop, the points (X0 : Z0), (X1 : Z1), and (X2 : Z2) satisfy(X0 : Z0) ∼ [2i]Q(X1 : Z1) ∼ P+[𝜇i]Q(X2 : Z2) ∼ (X0 : Z0)−(X1 : Z1) .Suppose that that the attacker, proceeding bit-by-bit, has ex-tracted 𝜇k. The attacker picks an arbitrary curve and sets Q tobe an arbitrary point on the curve.If mk−1 = 0, the attacker setsP ←-2k − 𝜇kQ−T . (1)9If U −V = O then U = V and therefore (XU : ZU ) ∼ (XV : ZV ). IfU −V = T then U = 𝜏T (V) where 𝜏T is the translation-by-T map; by aproperty of Montgomery curves, it follows that (XU : ZU ) ∼ (ZV : XV ).10In this case it does not matter— indeed, does not make sense to ask—whether W = U −V.Then, at iteration k of the Ladder3pt loop, we will have (X2 :Z2) ∼ T. If mk = 1, T will be passed as the third argumentto xDBLADD, triggering the misbehavior as described above.If mk−1 = 1, the attacker instead setsP ← T −-𝜇kQ . (2)Then, at iteration k of the Ladder3pt loop, we will have (X1 :Z1) ∼ T. If mk = 0, T will be passed as the third argumentto xDBLADD, triggering the misbehavior.To summarize, if mk 6= mk−1, the crafted input ciphertextwill trigger the anomalous 0 misbehavior.When generated according to the SIKE specification, Pand Q are always linearly independent points of order 3e3 andnever produce T or O during the execution of Ladder3pt.When generated according to our algorithm above but withan incorrect key-bit guess, we expect that T or O will beproduced only with negligible probability.11 This conjectureis supported by our experiments.5.3.3 Anomalous 0s in Isogeny Evaluation and j-Invariant CalculationThe next task in SIKE decapsulation, isogeny evaluation, iscarried out by algorithm 3_e_iso, which takes as input thepoint P + [m]Q (in projective x-coordinate form) as outputby Ladder3pt, expecting it to be a point of exact order 3e3 .In Appendix A.3, we show that, when invoked on the invalidinput (0 : 0), 3_e_iso and its subroutines repeatedly operateon and produce 0 values. Isogeny evaluation in 3_e_iso thusacts as an amplifier for the signal produced by the ladderevaluation in Ladder3pt, making it possible to observe evenan anomalous 0 produced in a late Ladder3pt loop iteration.After isogeny evaluation, the next task in SIKE decapsula-tion is j-invariant calculation, using algorithm jInvariant.When 3_e_iso returns (0 : 0), jInvariant is invoked withinput (0 : 0), every intermediate value it computes is 0, andits return value (the SIDH shared secret) is 0.125.4 MitigationsWe now describe the mitigation that Cloudflare and Microsoftdeployed after we disclosed our attack on SIKE.The mitigation, which was originally proposed by De Feoet al. [25], consists of validating that the ciphertext (publickey) consists of a pair of linearly independent points of thecorrect order 3e3 . This check is performed before running thethree-point ladder and prevents attack ciphertexts from beingfurther processed, thus hindering the attack. When runningdecapsulation on a single thread on our i7-9700 CPU, we11This fact allows us not only to distinguish a correct from an incorrect bitguess for bit mk but also to detect and recover from mistakes in determiningthe earlier bits 𝜇k; see Appendix A.4.12Note that this output depends on the result of inverting 0 in Fp2 in step 15of jInvariant. The Montgomery inversion algorithms in the implementa-tions we examined have 1/0 = 0 (see Savas and Koç [83]).found that the mitigation adds a performance overhead of 5%for CIRCL and of 11% for PQCrypto-SIDH.6 Timer-free AttacksWe now show that not only can we use the frequency sidechannel to turn power attacks into remote timing attacks (aswe saw in Section 5), but we can also use it to mount timingattacks without a timer. To this end, we use the frequency sidechannel to mount a KASLR break and a covert channel.KASLR Break Like prior work [12,13,37,43,45,51,63,64],the goal of the (unprivileged) attacker is to de-randomize thekernel base address. Knowledge of the kernel base address isuseful to mount memory corruption exploits.In Linux, the kernel text is placed at a 2 MB boundary in the0xffffffff80000000 – 0xffffffffc0000000 range [13].Hence, the kernel can be placed at one of 512 possible offsets.Prior work has shown that, on Intel and AMD processors,there is a timing and power consumption difference when ex-ecuting prefetch instructions on a memory address dependingon whether that address is mapped or not [43, 63]. This dif-ference can be used to infer the location of the kernel withinits predefined region. We show that this power consumptiondifference manifests also as a CPU frequency difference.To this end, we build a sender process similar to the onesof Figure 3, but using only prefetcht0 instructions. Whilethe sender runs, a separate thread measures the current CPUfrequency using the unprivileged scaling_cur_freq inter-face from the cpufreq driver. We ran the sender with allthe 512 possible kernel base addresses, for 10 different ran-domizations (i.e., repeating across 10 reboots) on our Inteli7-9700 CPU. In all 10 cases, we were able to identify thebase address successfully (as verified by checking the privi-leged /proc/kallsyms interface). We measured an averagesteady-state CPU frequency of 4.04 GHz when repeatedlyprefetching mapped addresses, and 4.24 GHz when repeat-edly prefetching unmapped addresses. The runtime of our un-optimized, proof-of-concept implementation is of 2 minutes.This runtime is larger than state-of-the-art KASLR breaks,but could be reduced with additional engineering effort.Covert Channel Like prior work, our covert channel usesa sender and a receiver. To transmit a 0, the sender executesa loop of or instructions with high HD and HW in their dataflow. This loop increases the power consumption and resultsin lower CPU frequency values. To transmit a 1, the senderexecutes a loop of shlx instructions with low HD and HW intheir data flow. This loop decreases the power consumptionand results in higher CPU frequency values. The receivermeasures the current CPU frequency using the unprivilegedscaling_cur_freq interface from the cpufreq driver.We evaluated our covert channel by transmitting 1 kB ofrandom data on our i7-9700 CPU. Our unoptimized, proof-of-concept implementation achieved a bandwidth of 30 bpswith an error rate of 0.03% (average across 10 runs). Thisbandwidth is similar to the one of prior covert channels relyingon software-based power measurement interfaces [63, 64].7 DiscussionAffected CPUs We successfully reproduced our attack onIntel CPUs from the 8th to the 11th generation of the Coremicroarchitecture (reported in Table 1). We also tested twodesktop CPUs from older generations, namely the i7-6700K(Skylake) and i7-7700K (Kaby Lake), and we found that bothmodels only support Turbo frequencies on single core work-loads: as soon as more than 1 core is active, the P-state iscapped at the base frequency. In our experiments, we werenot able to force the frequency into steady state (i.e., below themax turbo frequency) with single-core workloads, and weretherefore unable to reproduce our attack on these models.Besides CPUs from the (client-class) Core microarchitec-ture, our attack should also work on Intel Xeon CPUs (server-class) since they also use similar P-state management tech-niques. Additionally, other CPU vendors implement similarDVFS mechanisms and are likely vulnerable. For example,we verified that the AMD Ryzen processors are also vulnera-ble to our attack, featuring a similar HW/HD leakage modeland enabling the same SIKE vulnerability that we describedin Section 5. We leave reverse engineering the specific char-acteristics of the AMD leakage model to future work.Mitigating Leakage via the Frequency Channel Our at-tack is enabled by data-dependent frequency adjustments atsteady state. As we showed, the affected CPUs enter this statewhen certain power and thermal limits are hit during a work-load’s execution. Thus, one approach to mitigate the attack isto reduce the likelihood that the CPU hits these limits. Oneworkload-independent way to do so is to either disable TurboBoost, or to disable SpeedStep and HWP from the BIOS. Weverified that, with otherwise standard system configurations,both methods cause the frequency to stay fixed at the basefrequency during workload execution and never enter steadystate, preventing leakage via the frequency side channel. How-ever, this approach significantly reduces system performance.Moreover, this approach may not be sufficient on system con-figurations with custom power limits. Indeed, in concurrentwork, Liu et al. show that a privileged adversary can extractAES-NI keys using the frequency side channel after reducingthe power limits to fractions of their default values [65].Mitigating Leakage in Ciphers Another mitigation strat-egy consists of removing secret-dependent leakage in crypto-graphic software. For example, SIKE’s mitigation discussedin Section 5.4 hinders our attack by preventing attack cipher-texts from triggering secret-dependent computations on 0s.For cryptographic software in general, mitigating the powerleakage itself would naturally close the frequency channel.True decoupling would require that all operands have no sta-tistical correlation with secrets, which is only feasible withtechniques like fully homomorphic encryption. A more realis-tic approach takes advantage of the fact that it is not the powerusage of each operand that is leaked, but an average of thepower usage across all operands in a time period. This goalmay be achieved using masking/blinding techniques. Priorworks have introduced protocol-specific masking techniquesfor ciphers such as AES [8,38,82,86] and blinding techniquesfor elliptic-curve cryptography [54]. Automatic masking tech-niques have also been proposed either in software [7, 17, 27]or leveraging additional hardware support [26, 33, 41, 42, 79].However, masked/blinded implementations may still leak inpractice via power side channels [4, 5, 34, 69, 81, 84, 85].Future defenses could also examine the potential of fus-ing unrelated loops, vectorizing operations, or other meth-ods of interleaving different computations. These approachescould be done by combining multiple, normally sequential,computations in the program or by introducing an additionalcomplementary kernel. Effective blinding will require thatthe combined computation’s power trace is not related to anysecret computation. For example, if we can construct a bit-inverted version of a cryptographic kernel, we can interleavethe real kernel and the blinding kernel. Our model of HW andHD provides a starting point for future work on blinding.8 ConclusionWe discovered that in modern Intel (and AMD) x86 CPUs,DVFS-induced frequency variations depend on the currentpower consumption, and hence on the data being processed.We showed, for the first time, that the HD and HW of dataindividually and non-uniformly contribute to power consump-tion and frequency on modern x86 CPUs. We described anovel chosen-ciphertext attack against SIKE, which uses thisknowledge to leak full cryptographic keys via remote timing.The security implications of our findings are significant.Not only do they expand the attack surface of power side-channel attacks by removing the need for power measurementinterfaces, but they also show that, even when implementedas constant time, cryptographic code can still leak via remotetiming analysis. The takeaway is that current cryptographicengineering practices for how to write constant-time code areno longer sufficient to guarantee constant time execution ofsoftware on modern, variable-frequency processors.AcknowledgmentsThis work was funded in part through NSF grants 1942888and 1954521, and gifts from Google, Mozilla, and Qualcomm.Wang was partly supported by a Packard Fellowship (viaBrent Waters). We thank our shepherd Michael Schwarz andthe anonymous reviewers for their valuable feedback.AvailabilityWe have open sourced the code of all the experiments of thispaper at https://github.com/FPSG-UIUC/hertzbleed.References[1] Andreas Abel and Jan Reineke. uops.info: Characterizing latency,throughput, and port usage of instructions on Intel microarchitectures.In ASPLOS, 2019.[2] Andreas Abel and Jan Reineke. uiCA: Accurate throughput predictionof basic blocks on recent Intel microarchitectures. In ICS, 2022.[3] Ross Anderson. Security Engineering: A Guide to Building DependableDistributed Systems, 3rd Edition. John Wiley & Sons, 2020.[4] Josep Balasch, Benedikt Gierlichs, Vincent Grosso, Oscar Reparaz, andFrançois-Xavier Standaert. On the cost of lazy engineering for maskedsoftware implementations. In CARDIS, 2014.[5] Sven Bauer. Attacking exponent blinding in RSA without CRT. InCOSADE, 2012.[6] Daniel J. Bernstein and Tanja Lange. Montgomery curves and theMontgomery ladder. In Topics in Computational Number TheoryInspired by Peter L. Montgomery. Cambridge University Press, 2017.[7] Alex Biryukov, Daniel Dinu, Yann Le Corre, and Aleksei Udovenko.Optimal first-order boolean masking for embedded iot devices. InCARDIS, 2017.[8] Johannes Blömer, Jorge Guajardo, and Volker Krummel. Provablysecure masking of AES. In SAC, 2004.[9] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation poweranalysis with a leakage model. In CHES, 2004.[10] Len Brown. powercap: restrict energy meter to root access. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=949dd0104c496fa7c14991a23c03c62e44637e71, 2020. Accessed on Jun 7, 2022.[11] Ileana Buhan, Lejla Batina, Yuval Yarom, and Patrick Schaumont. SoK:Design tools for side-channel-aware implementions. In ASIACCS,2022.[12] Claudio Canella, Daniel Genkin, Lukas Giner, Daniel Gruss, MoritzLipp, Marina Minkin, Daniel Moghimi, Frank Piessens, MichaelSchwarz, Berk Sunar, Jo Van Bulck, and Yuval Yarom. Fallout: Leakingdata on Meltdown-resistant CPUs. In CCS, 2019.[13] Claudio Canella, Michael Schwarz, Martin Haubenwallner, MartinSchwarzl, and Daniel Gruss. KASLR: Break it, fix it, repeat. InASIACCS, 2020.[14] Suresh Chari, Charanjit Jutla, Josyula R Rao, and Pankaj Rohatgi. Acautionary note regarding evaluation of AES candidates on smart-cards.In AES2, 1999.[15] Yimin Chen, Xiaocong Jin, Jingchao Sun, Rui Zhang, and YanchaoZhang. POWERFUL: Mobile app fingerprinting via power analysis.In INFOCOM, 2017.[16] Jean-Sébastien Coron. Resistance against differential power analysisfor elliptic curve cryptosystems. In CHES, 1999.[17] Jean-Sébastien Coron, Johann Großschädl, Mehdi Tibouchi, andPraveen Kumar Vadnala. Conversion from arithmetic to boolean mask-ing with logarithmic complexity. In FSE, 2015.[18] Craig Costello. Pairings for beginners. Online: https://www.craigcostello.com.au/s/PairingsForBeginners.pdf, 2012.[19] Craig Costello. Supersingular isogeny key exchange for beginners. InSAC, 2019.[20] Craig Costello. The case for SIKE: A decade of the supersingularisogeny problem. Cryptology ePrint Archive, Report 2021/543, 2021.[21] Craig Costello and Benjamin Smith. Montgomery curves and theirarithmetic - the case of large characteristic fields. J. Cryptogr. Eng.,8(3), 2018.[22] Ian Cutress. Why Intel processors draw more power than expected:TDP and Turbo explained. https://www.anandtech.com/show/13544/why-intel-processors-draw-more-power-than-expected-tdp-turbo, 2018. Accessed on Jun 7, 2022.[23] Luca De Feo. Mathematics of isogeny based cryptography. Preprint,arXiv:1711.04062 [cs.CR], 2017.[24] Luca De Feo. Exploring isogeny graphs. Habilitation thesis, Universitéde Versailles Saint-Quentin-en-Yvelines, 2018.[25] Luca De Feo, Nadia El Mrabet, Aymeric Genêt, Novak Kaluderovi ¯ c,´Natacha Linard de Guertechin, Simon Pontié, and Élise Tasso. SIKEchannels. Cryptology ePrint Archive, Report 2022/054, 2022.[26] Elke De Mulder, Samatha Gummalla, and Michael Hutter. ProtectingRISC-V against side-channel attacks. In DAC. IEEE, 2019.[27] Hassan Eldib and Chao Wang. Synthesis of masking countermeasuresagainst side channel attacks. In CAV, 2014.[28] Armando Faz-Hernández and Kris Kwiatkowski. Introducing CIRCL:An Advanced Cryptographic Library. Cloudflare, 2019. https://github.com/cloudflare/circl. Accessed on Jun 7, 2022.[29] Armando Faz-Hernández, Julio López, Eduardo Ochoa-Jiménez, andFrancisco Rodríguez-Henríquez. A faster software implementation ofthe supersingular isogeny Diffie-Hellman key exchange protocol. IEEETransactions on Computers, 67(11), 2018.[30] Pierre-Alain Fouque and Frédéric Valette. The doubling attack–whyupwards is better than downwards. In CHES, 2003.[31] Eiichiro Fujisaki and Tatsuaki Okamoto. Secure integration of asym-metric and symmetric encryption schemes. Journal of Cryptology,26(1), 2013.[32] Steven D. Galbraith, Christophe Petit, Barak Shani, and Yan Bo Ti. Onthe security of supersingular isogeny cryptosystems. In ASIACRYPT,2016.[33] Si Gao, Johann Großschädl, Ben Marshall, Dan Page, Thinh Pham,and Francesco Regazzoni. An instruction set extension to supportsoftware-based masking. Cryptology ePrint Archive, Report 2020/773,2020.[34] Si Gao, Ben Marshall, Dan Page, and Elisabeth Oswald. Share-slicing:Friend or foe? TCHES, 2020.[35] Daniel Genkin, Lev Pachmanov, Itamar Pipman, Eran Tromer, andYuval Yarom. ECDSA key extraction from mobile devices via nonin-trusive physical side channels. In CCS, 2016.[36] Daniel Genkin, Itamar Pipman, and Eran Tromer. Get your hands offmy laptop: Physical side-channel key-extraction attacks on PCs. InCHES, 2014.[37] Enes Göktas, Kaveh Razavi, Georgios Portokalidis, Herbert Bos, andCristiano Giuffrida. Speculative probing: Hacking blind in the Spectreera. In CCS, 2020.[38] Jovan D Golic and Christophe Tymen. Multiplicative masking and ´power analysis of AES. In CHES, 2002.[39] Louis Goubin and Jacques Patarin. DES and differential power analysisthe “duplication” method. In CHES, 1999.[40] Corey Gough, Ian Steiner, and Winston Saunders. Energy EfficientServers: Blueprints for Data Center Optimization. Apress, 2015.[41] Hannes Groß, Manuel Jelinek, Stefan Mangard, Thomas Unterluggauer,and Mario Werner. Concealing secrets in embedded processors designs.In CARDIS, 2016.[42] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-orientedmasking: Compact masked hardware implementations with arbitraryprotection order. In TIS, 2016.[43] Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, andStefan Mangard. Prefetch side-channel attacks: Bypassing SMAP andkernel ASLR. In CCS, 2016.[44] Dennis Hofheinz, Kathrin Hövelmanns, and Eike Kiltz. A modularanalysis of the Fujisaki-Okamoto transformation. In TCC, 2017.[45] Ralf Hund, Carsten Willems, and Thorsten Holz. Practical timing sidechannel attacks against kernel space ASLR. In S&P, 2013.[46] Intel. Running average power limit energy reporting / cve-2020-8694 ,cve-2020-8695 / intel-sa-00389. https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/running-average-power-limit-energy-reporting.html. Accessed on Jun 7, 2021.[47] Intel. Thermal design power (TDP) in Intel processors. https://www.intel.com/content/www/us/en/support/articles/000055611/processors.html. Accessed on Jun 7, 2022.[48] Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual,June 2021.[49] Intel. Intel 64 and IA-32 Architectures Software Developer’s Manual,June 2021.[50] Intel. Power management - technology overview. https://builders.intel.com/docs/networkbuilders/power-management-technology-overview-technology-guide.pdf, 2021. Accessed on Jun7, 2022.[51] Yeongjin Jang, Sangho Lee, and Taesoo Kim. Breaking kernel addressspace layout randomization with Intel TSX. In CCS, 2016.[52] David Jao, Reza Azarderakhsh, Matthew Campagna, Craig Costello,Luca De Feo, Basil Hess, Amir Jalali, Brian Koziel, Brian LaMacchia,Patrick Longa, Michael Naehrig, Joost Renes, Vladimir Soukharev,David Urbanik, Geovandro Pereira, Koray Karabina, and AaronHutchinson. SIKE. Technical report, National Institute of Standardsand Technology, 2020.[53] David Jao and Luca De Feo. Towards quantum-resistant cryptosystemsfrom supersingular elliptic curve isogenies. In PQCrypto, 2011.[54] Marc Joye and Christophe Tymen. Protections against differentialanalysis for elliptic curve cryptography. In CHES, 2001.[55] Manuel Kalmbach, Mathias Gottschlag, Tim Schmidt, and Frank Bel-losa. TurboCC: A practical frequency-based covert channel with IntelTurbo Boost. Preprint, arXiv:2007.07046 [cs.CR], 2020.[56] Nikolaos Kavvadias, Periklis Neofotistos, Spiridon Nikolaidis, CA Kos-matopoulos, and Theodore Laopoulos. Measurements analysis of thesoftware-related power consumption in microprocessors. IEEE Trans-actions on Instrumentation and Measurement, 53(4), 2004.[57] Colin Ian King. stress-ng. https://github.com/ColinIanKing/stress-ng, 2022. Accessed on Jun 7, 2022.[58] Paul Kocher. Timing attacks on implementations of Diffie-Hellman,RSA, DSS, and other systems. In CRYPTO, 1996.[59] Paul Kocher, Joshua Jaffe, and Benjamin Jun. Differential power anal-ysis. In CRYPTO, 1999.[60] Yann Le Corre, Johann Großschädl, and Daniel Dinu. Micro-architectural power simulator for leakage assessment of cryptographicsoftware on ARM Cortex-M3 processors. In COSADE, 2018.[61] Sheayun Lee, Andreas Ermedahl, Sang Lyul Min, and Naehyuck Chang.An accurate instruction-level energy consumption model for embeddedRISC processors. ACM SIGPLAN Notices, 36(8), 2001.[62] Linux. aperfmperf.c. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/aperfmperf.c. Accessed on Jun 7, 2022.[63] Moritz Lipp, Daniel Gruss, and Michael Schwarz. AMD prefetchattacks through power and time. In USENIX Security, 2022.[64] Moritz Lipp, Andreas Kogler, David Oswald, Michael Schwarz, Cather-ine Easdon, Claudio Canella, and Daniel Gruss. PLATYPUS: Software-based power side-channel attacks on x86. In S&P, 2021.[65] Chen Liu, Abhishek Chakraborty, Nikhil Chawla, and Neer Roggel.Frequency throttling side-channel attack. Preprint, arXiv:2206.07012[cs.CR], 2022.[66] Patrick Longa. Post-quantum Cryptography. Microsoft, 2019. Avail-able at https://github.com/microsoft/PQCrypto-SIDH. Ac-cessed on Jun 7, 2022.[67] Stefan Mangard. A simple power-analysis (SPA) attack on implemen-tations of the AES key expansion. In ICISC, 2002.[68] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power AnalysisAttacks: Revealing the Secrets of Smart Cards, volume 31. SpringerScience & Business Media, 2008.[69] Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald. Success-fully attacking masked AES hardware implementations. In CHES,2005.[70] Heiko Mantel, Johannes Schickel, Alexandra Weber, and FriedrichWeber. How secure is green IT? the case of software-based energy sidechannels. In ESORICS, 2018.[71] Rita Mayer-Sommer. Smartly analyzing the simplicity and the powerof simple power analysis on smartcards. In CHES, 2000.[72] David McCann, Elisabeth Oswald, and Carolyn Whitnall. Towardspractical tools for side channel aware software engineering:’greybox’modelling for instruction leakages. In USENIX Security, 2017.[73] Thomas Messerges. Using second-order power analysis to attack DPAresistant software. In CHES, 2000.[74] Thomas Messerges, Ezzy Dabbish, and Robert Sloan. Investigations ofpower analysis attacks on smartcards. In USENIX Smartcard, 1999.[75] Thomas Messerges, Ezzy Dabbish, and Robert Sloan. Power analysisattacks of modular exponentiation in smartcards. In CHES, 1999.[76] Yan Michalevsky, Aaron Schulman, Gunaa Arumugam Veerapandian,Dan Boneh, and Gabi Nakibly. PowerSpy: Location tracking usingmobile device power analysis. In USENIX Security, 2015.[77] Jeremy Morse, Steve Kerrison, and Kerstin Eder. On the limitations ofanalyzing worst-case dynamic energy of processing. ACM Transactionson Embedded Computing Systems (TECS), 17(3):1–22, 2018.[78] Hassan Mujtaba. [IDF15]Intel’s 6th gen Skylake unwrapped - CPUmicroarchitecture, Gen9 graphics core and Speed Shift hardware P-state. https://wccftech.com/idf15-intel-skylake-analysis-cpu-gpu-microarchitecture-ddr4-memory-impact/4/, 2015.Accessed on Jun 7, 2022.[79] Svetla Nikova, Christian Rechberger, and Vincent Rijmen. Thresholdimplementations against side-channel attacks and glitches. In ICICS,2006.[80] Roman Novak. SPA-based adaptive chosen-ciphertext attack on RSAimplementation. In PKC, 2002.[81] Kostas Papagiannopoulos and Nikita Veshchikov. Mind the gap: To-wards secure 1st-order masking in software. In COSADE, 2017.[82] Matthieu Rivain and Emmanuel Prouff. Provably secure higher-ordermasking of AES. In CHES, 2010.[83] Erkay Savas and Çetin Kaya Koç. Montgomery inversion. J. Cryptogr.Eng., 8(3), 2018.[84] Werner Schindler and Andreas Wiemers. Power attacks in the presenceof exponent blinding. J. Cryptogr. Eng., 4(4), 2014.[85] Werner Schindler and Andreas Wiemers. Generic power attacks onRSA with CRT and exponent blinding: new results. J. Cryptogr. Eng.,7(4), 2017.[86] Kai Schramm and Christof Paar. Higher order masking of the AES. InCT-RSA, 2006.[87] Madura A Shelton, Niels Samwel, Lejla Batina, Francesco Regazzoni,Markus Wagner, and Yuval Yarom. Rosita: Towards automatic elimina-tion of power-analysis leakage in ciphers. NDSS, 2021.[88] Ankush Varma, Eric Debes, Igor Kozintsev, and Bruce Jacob.Instruction-level power dissipation in the Intel XScale embedded mi-croprocessor. In Embedded Processors for Multimedia and Communi-cations II, 2005.[89] Nikita Veshchikov. SILK: high level of abstraction leakage simulatorfor side channel analysis. In PPREW, 2014.[90] Vince Weaver. Reading RAPL energy measurements from linux. http://web.eece.maine.edu/~vweaver/projects/rapl/. Accessedon Jun 7, 2022.[91] Rafael J. Wysocki. intel_pstate CPU performance scaling driver.https://www.kernel.org/doc/html/v4.19/admin-guide/pm/intel_pstate.html. Accessed on Jun 7, 2022.[92] Lin Yan, Yao Guo, Xiangqun Chen, and Hong Mei. A study on powerside channels on mobile devices. In Internetware, 2015.[93] Qing Yang, Paolo Gasti, Gang Zhou, Aydin Farajidavar, and KiranBalagani. On inferring browsing activity on smartphones via USBpower analysis side-channel. IEEE Trans. Inf. Forensics Secur., 12(5),2016.[94] Sung-Ming Yen, Wei-Chih Lien, SangJae Moon, and JaeCheol Ha.Power analysis by exploiting chosen message and internal collisions -vulnerability of checking mechanism for RSA-decryption. In Mycrypt,2005.[95] Zhenkai Zhang, Sisheng Liang, Fan Yao, and Xing Gao. Red alertfor power leakage: Exploiting Intel RAPL-induced side channels. InASIACCS, 2021.A AppendixA.1 Leakage Model—Additional ExperimentsHD in the ALU Input In Section 4.1, we saw that increas-ing the number of bit transitions in the ALU output causes anincrease in power consumption and a decrease in frequency.Here, we set out to understand if the same effect happens whenbit transitions occur in the ALU input. We need a sender thatoffers fine-grained control over the number of transitions inthe ALU input, while avoiding potential side-effects such asthe HW effect or bit transitions in the ALU output.We design a sender that is symmetric to the one of Figure 3a.Our sender still uses shlx and shrx instructions, as shownin Figure 13a. However, it is designed such that the outputof all shlx and shrx instructions is always the same, andonly their input varies as a function of COUNT. Hence, anyHD effect is caused by bit transitions on the ALU input only.For example, when COUNT = 8, the source register to eachshlx contains 0x000000ffffffff00, and the source registerto each shrx contains 0x00ffffffff000000, the alternationof which translates to a HD of 4×8 in the ALU input.Figure 14 shows the results for increasing COUNT values.We see that the power consumption grows and the frequencydrops when COUNT grows, confirming that the number ofbit transitions (i.e., the HD) in the ALU input directly affectspower consumption and CPU frequency. We also see that thechanges in power / frequency become more significant whenrax = COUNTrbx = 0x0000FFFFFFFF0000 >> COUNTrcx = 0x0000FFFFFFFF0000 << COUNTloop: shlx %rax,%rbx,%rdx // rdx = rbx << rax shlx %rax,%rbx,%rsi // rsi = rbx << rax shrx %rax,%rcx,%rdi // rdi = rcx >> rax shrx %rax,%rcx,%r8 // r8 = rcx >> rax shlx %rax,%rbx,%r9 // r9 = rbx << rax shlx %rax,%rbx,%r10 // r10 = rbx << rax shrx %rax,%rcx,%r11 // r11 = rcx >> rax shrx %rax,%rcx,%r12 // r12 = rcx >> raxjmp loop(a) Variant of sender for the HD experiments.rax = 1rsp = pointer_to_memoryrbx = … = r15 = INPUTloop: mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memory mov %rax,(%rsp) // store rax to memoryjmp loop(b) Sender for the HW at rest experiments.Figure 13: Additional sets of instructions (senders) used to reverse engineer the dependency between data and power consumption/ frequency on our CPUs. Different senders are designed to target different effects. Each sender can be run with variable inputs.0 5 10 15COUNT4.2754.2804.2854.2904.295Frequency (GHz)(a) Mean frequencies.0 5 10 15COUNT22.822.923.023.1Power (W)(b) Mean power consumptions.Figure 14: Effect of increasing COUNT in Figure 13a’s senderon our i7-9700 CPU. Higher COUNT values cause higher HDsin the ALU output. As the HD increases, the mean power con-sumption grows and the mean steady-state frequency drops.the COUNT > 8, as a result of the non-uniform HW cost ofhaving 1s closer to the MSB in the fixed source register rcx.Non-uniform HW In Section 4.2, we saw that the HW ef-fect it depends on the position of 1s in the data (i.e., it isnon-uniform). We now discuss two experiments that provideadditional evidence that the HW effect is non-uniform. Werefer to these experiments as shift0and shift1. Both experi-ments use the same sender of Section 4.2, shown in Figure 3b.In shift1, we fix the number of consecutive 1s and measurethe impact of changing the position of these consecutive 1sin the LEFT = RIGHT input, when all surrounding bits are0s. In shift0, we do the opposite: we fix the number of con-secutive 0s and measure the impact of changing the positionof these consecutive 0s in the LEFT = RIGHT input, whenall surrounding bits are 1s. By construction, since the HW isfixed and the sender does not introduce any HD effect, anydifferences in the results depend only on the position of 1s.We label different positions of the consecutive bit patternsbased on their “shift offset” starting from the LSB. For exam-ple, when the number of consecutive 1s in shift1is 32, a shiftoffset of 0 refers to input value 0x00000000ffffffff and ashift offset of 16 refers to input value 0x0000ffffffff0000.20 40 60Shift Offset4.124.134.144.15Frequency (GHz)8 ones16 ones32 ones48 ones(a) Frequency vs shift offset20 40 60Shift Offset26.426.626.827.027.2Power (W)8 ones16 ones32 ones48 ones(b) Power vs shift offsetFigure 15: Effect of shifting consecutive 1s in the LEFT =RIGHT input to Figure 3b’s sender on our i7-9700 CPU. Aswe shift the 1s towards the MSB, the mean power consump-tion grows and the mean steady-state frequency drops.Similarly, when the number of consecutive 0s in shift0is 32, ashift offset of 16 refers to input value 0xffff00000000ffff.Figure 15 shows the results for the shift1experiment whenwe fix the number of 1s to 8, 16, 32, or 48. Consider thecase when the number of 1s is 16. When the shift offset isbetween 0 to 16, we see almost no variation in the mean power/ frequency. This is because as we shift in this range, 1s arestill all the low 32 bits, and we know from Figure 7 that thereis little difference in the HW effect for 1s that are in the low32 bits. However, when the shift offset increases from 16 to48, the power consumption grows and the frequency drops.This is because we start gaining 1s in the high 32 bits andapproaching the MSB. This is consistent with what we sawin Figure 7, where 1s closer to the MSB have a stronger HWeffect than 1s closer to the 32nd bit. The results are similarwhen the number of 1s is 8. When the number of 1s is 32 or 48,the HW effect increases every time the shift offset increases.This is because, in these cases, shifting means that we lose 1sin the low 32 bits and gain 1s in the high 32 bits, and we knowfrom Figure 7 that 1s in the high 32 bits have a stronger HWeffect than 1s in the low 32 bits. The HW increments in thesecases are also more significant, because the delta between theHW effect of the bits we gain and the bits we lose is larger.20 40 60Shift Offset4.124.134.144.15Frequency (GHz)8 zeros16 zeros32 zeros48 zeros(a) Frequency vs shift offset20 40 60Shift Offset26.626.827.027.2Power (W)8 zeros16 zeros32 zeros48 zeros(b) Power vs shift offsetFigure 16: Effect of shifting consecutive 0s in the LEFT =RIGHT input to Figure 3b’s sender on our i7-9700 CPU. Aswe shift the 0s towards the MSB, the mean power consump-tion drops and the mean steady-state frequency grows.Figure 16 shows the results for the shift0experiment. Theseresults are symmetrical to the shift1 ones and can be explainedby the same reasons described for the shift1experiment.In summary, the shift0and shift1experiments support ourobservation that the HW effect is non-uniform.HW Root Cause In Section 4.3, we saw that the HD effectand the HW effect are additive. Recall that the HD effectis due to 1 → 0 and 0 → 1 bit transitions in the data beingprocessed. This is a well-understood effect in the literature,and can be attributed to the fact that when more bits flip duringa computation, more transistors are switched in the datapath,which causes dynamic power consumption to grow [46, 68].However, it is difficult to pinpoint the root cause of the HWeffect on x86 Intel CPUs. For example, it is unclear if the HWeffect occurs only when data is actively computed on, or if itis due to any data-dependent power cost of simply keepingdata stored inside registers. Our sender from Figure 3b cannotdistinguish between these two cases because it is designed tocontinuously compute on and overwrite identical data values.Here, we design a new sender to test if the HW effect occursalso when data values with different HWs are simply storedinto registers (at rest), but not actively computed on.Our sender, shown in Figure 13b, is designed as follows.First, it sets the content of rax to 1, rsp to a memory location,and all other architectural registers to a fixed INPUT value.Then, it enters an infinite loop of stores that write the contentof rax into the memory location pointed to by rsp.13By construction, the store operations in the loop are alwaysthe same and independent of the value of INPUT. Changingthe value of INPUT only affects the content of registers thatare initialized, but not actively computed on by the sender.Any difference in power consumption due to different INPUTvalues would then be due to HW effect at rest.Figure 17 shows the results when we increase the HW ofINPUT from 0 to 64. We see no differences in the mean powerconsumption and mean steady-state frequency when the HW13We use a store so that the register file is constantly being read from, inthe offchance an inactive register file could be powered down.0 20 40 60HW of INPUT4.334.344.354.364.37Frequency (GHz)(a) Frequency vs HW0 20 40 60HW of INPUT21.221.421.621.822.0Power (W)(b) Power vs HWFigure 17: Effect of increasing the HW of INPUT (at rest)in Figure 13b’s sender on our i7-9700 CPU. As we increaseHW from 0 to 64, the mean power consumption and the meansteady-state frequency do not change.grows. This result suggests that the HW effect does not occurwhen simply keeping data stored inside registers.A.2 Mathematical Preliminaries for SIKESIKE is an isogeny-based key encapsulation method which in-volves arithmetic operations of elliptic curves over finite fields.In particular, SIKE uses Montgomery elliptic curves. Its se-curity relies on the hardness of finding a specific isogeny be-tween two such elliptic curves. Here, we provide an overviewof the details of SIKE that are relevant to our attack.14Let p be a prime of the form 2e2 3e3 − 1. SIKE works inthe field Fp2 = Fp(i) with i2 = −1 (mod p) and uses the su-persingular elliptic curves over Fp2 that have (2e2 3e3 )2 points.The set of points P ∈ E(Fp) that satisfy [n]P = O is calledthe n-torsion of E. The curves of interest were chosen so thatthe entire (2e2 3e3 )-torsion is already defined over Fp2 , andwe have E[2e2 3e3 ] ∼= Z/(2e2 3e3 )Z×Z/(2e2 3e3 )Z; as a result,for each curve of interest, E[2e2 ] can be generated by linearcombinations of two points P2 and Q2 with coefficients in Fp2 ;and likewise E[3e3 ] can be generated by linear combinationsof two points P3 and Q3 with coefficients in Fp2 .An isogeny 𝜙: E1(Fp2 ) → E2(Fp2 ) is a group homomor-phism from E1(Fp2 ) to E2(Fp2 ) and a non-constant rationalmap defined over Fp2 that preserves the point at infinity O.The kernel of an isogeny is ker𝜙 = {P ∈ E1 : 𝜙(P) = O}.Every finite subgroup H of a curve E(Fp2 ) defines anisogeny 𝜙: E → E/H, unique up to isomorphism, such thatker𝜙 = H. The cardinality of H is also the degree of the ra-tional map 𝜙. Given H, Vélu’s algorithm allows the rationalmap for the isogeny corresponding to H to be computed; thecomputation is tractable when |H| is small.An `-isogeny is defined as 𝜙`: E → E/hPi, where P hasexact order `. The order of 𝜙(Q) in E/hPi is the same asthe order of Q in E unless Q lies above ker𝜙 (meaning that14For more information on SIKE, we refer to the SIKE tutorial byCostello [19] and to the SIKE specification [52]. For more information onelliptic curves and isogenies, we refer to the pairings tutorial by Costello [18]and to De Feo’s lecture notes [23] and habilitation thesis [24]. For moreinformation on Montgomery ladders, we refer to Bernstein and Lange [6].Algorithm 2: Computing and evaluating a 3e-isogeny,simple version ( [52], Appendix A)1 function 3_e_isoStatic parameters: Integer e3 from the publicparametersInput: Constants (A+24 : A−24) corresponding to acurve EA/C, (XS : ZS) where S has exactorder 3e3 on EA/COutput: (A+240: A−240) coresponding to the curveEA0/C0 = E/hSi1 for e = e3 −1 downto 0 by −1 do2 (XT : ZT ) ← xTPLe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 130459 bytes
Desc: not available
URL: <https://lists.cpunks.org/pipermail/cypherpunks/attachments/20220616/b9093ec2/attachment.txt>
More information about the cypherpunks
mailing list