A Deeper Look at Machine Learning-Based Cryptanalysis Adrien Benamira1, David Gerault1,2, Thomas Peyrin1, and Quan Quan Tan1 1 Nanyang Technological University, Singapore 2 University of Surrey, UK adrien002@e.ntu.edu.sg, dagerault@gmail.com, thomas.peyrin@ntu.edu.sg, quanquan001@e.ntu.edu.sg Keywords: Differential Cryptanalysis, SPECK, Machine Learning, Deep Neu- ral Networks, Interpretability Abstract. At CRYPTO’19, Gohr proposed a new cryptanalysis strat- egy based on the utilisation of machine learning algorithms. Using deep neural networks, he managed to build a neural based distinguisher that surprisingly surpassed state-of-the-art cryptanalysis efforts on one of the versions of the well studied NSA block cipher SPECK (this distinguisher could in turn be placed in a larger key recovery attack). While this work opens new possibilities for machine learning-aided cryptanalysis, it remains unclear how this distinguisher actually works and what infor- mation is the machine learning algorithm deducing. The attacker is left with a black-box that does not tell much about the nature of the possible weaknesses of the algorithm tested, while hope is thin as interpretability of deep neural networks is a well-known difficult task. In this article, we propose a detailed analysis and thorough explanations of the inherent workings of this new neural distinguisher. First, we stud- ied the classified sets and tried to find some patterns that could guide us to better understand Gohr’s results. We show with experiments that the neural distinguisher generally relies on the differential distribution on the ciphertext pairs, but also on the differential distribution in penul- timate and antepenultimate rounds. In order to validate our findings, we construct a distinguisher for SPECK cipher based on pure cryptanaly- sis, without using any neural network, that achieves basically the same accuracy as Gohr’s neural distinguisher and with the same efficiency (therefore improving over previous non-neural based distinguishers). Moreover, as another approach, we provide a machine learning-based distinguisher that strips down Gohr’s deep neural network to a bare minimum. We are able to remain very close to Gohr’s distinguishers’ accuracy using simple standard machine learning tools. In particular, we show that Gohr’s neural distinguisher is in fact inherently building a very good approximation of the Differential Distribution Table (DDT) of the cipher during the learning phase, and using that information to directly classify ciphertext pairs. This result allows a full interpretability of the distinguisher and represents on its own an interesting contribution towards interpretability of deep neural networks. Finally, we propose some method to improve over Gohr’s work and possi- ble new neural distinguishers settings. All our results are confirmed with experiments we have been conducted on SPECK block cipher (source code available online). 1 Introduction While modern symmetric-key cryptography designs are heavily relying on se- curity by construction with strong security arguments (resistance against sim- ple differential/linear attacks, study of algebraic properties, etc.), cryptanalysis remains a crucial part of a cipher’s validation process. Only a primitive that went through active and thorough scrutiny of third-party cryptanalysts should gain enough trust by the community to be considered as secure. However, there has been more and more cipher proposals in the past decade (especially with the recent rise of lightweight cryptography) and cryptanalysis effort could not really keep up the pace: conducting cryptanalysis remains a very tough and low-rewarding task. In order to partially overcome this shortage in cryptanalysts manpower, a recent trend arose of automating as much as possible the various tasks of an attacker. Typically, searching for differential and linear characteristics can now be modeled as Satisfiability/Satisfiability Modulo Theories [17] (SAT/SMT), Mixed Linear Integer Programming [18] (MILP) or Constraint Programming [25] (CP) problems, which can in turn simply be handled by an appropriate solver. The task of the cryptanalyst is therefore reduced to only providing an efficient modeling of the problem to be studied. Due to the impressive results considering the simplicity of the process, a lot of advances have been made in the past decade in this very active research field and this even improved the ciphers designs themselves (how to choose better cryptographic bricks and how to assemble them has been made much easier thanks to these new automated tools). One is then naturally tempted to push this idea further by even getting rid of the modeling part. More generally, can a tool recognize possible weaknesses/patterns in a cipher by just interacting with it, with as little input as possible from the cryptanalysts? One does not expect such a tool to replace a cryptanalyst’s job, but it might come in handy for easily pre-checking a cipher (or reduced versions of it) for possible weaknesses. Machine learning and particularly deep learning have recently attracted a lot of attention, due to impressive advances in important research areas such as computer vision, speech recognition, etc. Some possible connections between cryptography and machine learning were already identified in [21] and we have seen many applications of machine learning for side-channels analysis [16]. How- ever, machine learning for black-box cryptanalysis remained mostly unexplored until Gohr’s article presented at CRYPTO’19 [11]. In his work, Gohr trained a deep neural network on labeled data composed of ciphertext pairs: half the data coming from ciphering plaintexts pairs with a fixed input difference with the cipher studied, half from random values. He then checks if the trained neural network is able to classify accurately random from real ciphertext pairs. Quite surprisingly, when applying his framework to the 2 block cipher SPECK-32/64 (the 32-bit block 64-bit key version of SPECK [2]), he managed to obtain a good accuracy for a non-negligible number of rounds. He even managed to mount a key recovery process on top of his neural distin- guisher, eventually leading to the current best known key recovery attack for this number of rounds (improving over works on SPECK-32/64 such as [6, 24]). Even if his distinguisher/key recovery attack had not been improving over the state-of-the-art, the prospect of a generic tool that could pre-scan for vulnera- bilities in a cryptographic primitive (while reaching an accuracy close to exiting cryptanalysis) would have been very attractive anyway. Yet, Gohr’s paper actually opened many questions. The most important, listed by the author as an open problem, is the interpretability of the distin- guisher. An obvious issue with a neural distinguisher is that its black-box nature is not really telling us much about the actual weakness of the cipher analyzed. More generally, interpretability for deep neural networks has been known to be a very complex problem and represents a key challenge for the machine learning community. At first sight, it seems therefore very difficult to make any advances in this direction. Another interesting aspect to explore is to try to match Gohr’s neural dis- tinguisher/key recovery attack with classical cryptanalysis tools. It remains very surprising that a trained deep neural network can perform better than the scrutiny of experienced cryptanalysts. As remarked by Gohr, his neural dis- tinguisher is mostly differential in nature (on the ciphertext pairs), but some unknown extra property is exploited. Indeed, as demonstrated by one of his ex- periments, the neural distinguisher can still distinguish between a real and a random set that have the exact same differential distribution on the ciphertext pairs. Since we know there is some property that researchers have not seen or exploited, what is it? Finally, a last natural question is: can we do better? Are there some better settings that could improve the accuracy of Gohr’s distinguishers? Our Contributions. In this article, we analyze the behavior of Gohr’s neural distinguishers when working on SPECK-32/64 cipher. We first study in detail the classified sets of real/random ciphertext pairs in order to get some hints on what criterion the neural network is actually basing its decisions on. Looking for patterns, we observe that the neural distinguisher is very probably deducing some differential conditions not on the ciphertext pairs directly, but on the penultimate or antepenultimate rounds. We then conduct some experiments to validate our hypothesis. In order to further confirm our findings, we construct for 5, 6 and 7-round reduced SPECK-32/64 a new distinguisher purely based on cryptanalysis, with- out any neural network or machine learning algorithm, that matches Gohr’s neural distinguisher’s accuracy while actually being faster and using the same amount of precomputation/training data. In short, our distinguisher relies on selective partial decryption: in order to attack nr rounds, some hypothesis is made on some bits of the last round subkey and partial decryption is performed, eventually filtered by a precomputed approximated DDT on nr − 1 rounds. 3 We then take a different approach by tackling the problem not from the crypt- analysis side, but the machine learning side. More precisely, as a deep learning model learns high-level features by itself, in order to reach full interpretability we need to discover what these features are. By analyzing the components of Gohr’s neural network, we managed to identify a procedure to model these fea- tures, while retaining almost the same accuracy as Gohr’s neural distinguishers. Moreover, we also show that our method performs similarly on other primitives by applying it on the SIMON block cipher. This result is interesting from a cryp- tography perspective, but also from a machine learning perspective, showing an example of interpretability by transformation of a deep neural network. Finally, we explore possible improvements over Gohr’s neural distinguishers. By using batches of ciphertexts instead of pairs, we are able to significantly im- prove the accuracy of the distinguisher, while maintaining identical experimental conditions. Outline. In Section 2, we introduce notations as well as basic cryptanalysis and machine learning concepts that will be used in the rest of the paper. In Section 3, we describe in more detail the various experiments conducted by Gohr and the corresponding results. We provide in Section 4 an explanation of his neural distinguishers as well as the description of an actual cryptanalysis-only distinguisher that matches Gohr’s accuracy. We propose in Section 5 a machine learning approach to enable interpretability of the neural distinguishers. Finally, we studied possible improvements in Section 6. 2 Preliminaries In the rest of this article, ⊕, ∧ and (cid:1) will denote the Basic notations. eXclusive-OR operation, the bitwise AND operation and the modular addition3 respectively. A right/left bit rotation will be denoted as ≫ and ≪ respectively, while a||b will represent the concatenation of two bit strings a and b. 2.1 A Brief Description of SPECK The lightweight family of ARX block ciphers SPECK was proposed by the US National Security Agency (NSA) [2] in 2013, targeting mainly good performances on micro-controllers. Several versions of the cipher have been proposed within its family, but in this article (and in Gohr’s work [11]) we will focus mainly on SPECK-32/64, the 32-bit block 64-bit key version of SPECK, which is composed of 22 rounds (for simplicity, SPECK-32/64 will be referred to as SPECK in the rest of the article). The 32-bit internal state is divided into a 16-bit left and a 16-bit right part, that we will generally denote li and ri at round i respectively, and is initialised with the plaintext (l0||r0) ← P . The round function of the cipher is then a very 3 The modulo will be stated explicitly if it is not clear from the context 4 simple Feistel structure combining bitwise XOR operation and 16-bit modular addition. See Figure 1 where ki represents the 16-bit subkey at round i and where α = 7, β = 2. The final ciphertext C is then obtained as C ← (l22||r22). The subkeys are generated with a key schedule that is very similar to the round function (we refer to [2] for a complete description, as we do not make use of the details of the key schedule in this article). li+1 = ((li ≫ α) (cid:1) ri) ⊕ ki ri+1 = (ri ≪ β) ⊕ li+1 ki li ≫ α ri ≪ β li+1 ri+1 Fig. 1: The SPECK-32/64 round function. 2.2 Differential Cryptanalysis Differential cryptanalysis studies the propagation of a difference through a ci- 2 and x, x(cid:48) be two different inputs for f with a pher. Let a function f : Fb difference ∆x = x⊕x(cid:48). Let y = f (x) and y(cid:48) = f (x(cid:48)) and a difference ∆y = y⊕y(cid:48). f−→ ∆y): Then, we are interested in the transition probability from ∆x to ∆y (∆x 2 → Fb P(∆x f−→ ∆y) := #{x|f (x) ⊕ f (x ⊕ ∆x) = ∆y} 2b One classical tool for differential cryptanalysis is the Difference Distribution Ta- ble (DDT), which simply lists the differential transition probabilities for each possible input/output difference pairs (∆x, ∆y). The studied function f is usu- ally some Sbox, or some small cipher sub-component, as the DDT of an entire 64-bit or 128-bit cipher would obviously be too large to store. Since SPECK is internally composed of a left and right part, for a ciphertext C we will denote by Cl and Cr its 16-bit left and right parts respectively. Then, for two ciphertexts C and C(cid:48), we will denote ∆L the XOR difference Cl ⊕ C(cid:48) between the left parts of the two ciphertexts (respectively ∆R = Cr ⊕ C(cid:48) l r for the right parts). Moreover, for a round i, we will denote by Vi the difference between the two parts of the internal state Vi = li ⊕ ri. 2.3 Deep Neural Networks Deep Neural Networks (DNN) are a family of non-linear machine learning clas- sifiers that have gained popularity since their success in addressing a variety of data-driven tasks, such as computer vision, speech recognition, etc. 5 n(cid:88) The main problem tackled by DNN is, given a dataset D = {(x0, y0)...(xn, yn)}, with xi ∈ X being samples and yi ∈ [0, . . . , l] being labels, to find the optimal parameters θ∗ for the DN Nθ model, with the parameters θ such that: θ∗ = argmin θ L(yi, DN N θ(xi)) (1) with L being the loss function. As there is no literal expression of θ∗, the ap- proximate solution will depend on the chosen optimization algorithm such as the stochastic gradient descent. Moreover, hyper-parameters of the problem (param- eters whose value is used to control the learning process) need to be adjusted as they play an important role in the final quality of the solution. i=0 DNN are powerful enough to derive accurate non-linear features from the training data, but these features are not robust. Indeed, adding a small amount of noise at the input can cause these features to deviate and confuse the model. In other words, the DNN is a very unbiased classifier, but has a high variance. Different blocks can be used to implement these complex models. However, in this paper, we will be using four types of blocks: the linear neural network, the one-dimensional convolutional neural network, the activation functions (ReLU and sigmoid) and the batch normalization. Linear neural network. Linear neural networks applies a linear transforma- tion to the incoming data: out = in.AT + b. Here we have θ = (A, b). The linear neural network is also commonly named perceptron layer or dense layer. One-dimensional convolutional neural network. The 1D-CNN applies a convolution over a fixed (multi-)temporal input signal. The 1D-CNN operation can be seen as multiple linear neural networks (one per filter) where each one is applied to a sub-part of the input. This sub-part is sliding, its size is kernel size, its pitch is the stride and its start and end points depend on the padding. Activation functions. The three activation functions that we discuss here are the Rectified Linear Unit (ReLU), defined as ReLU(x) = max(0, x), the 1+exp(−x) and the Heaviside step sigmoid, defined as Sigmoid(x) = σ(x) = function, defined as H(x) = 1 . This block, added between each layer of the DNN, introduces the non-linear part of the model. 2 + sgn(x) 2 1 Batch normalization. Training samples are typically randomly collected in batches to speed up the training process. It is thus usual to normalize the overall tensor according to the batch dimension. 3 A First Look at Gohr’s CRYPTO 2019 Results Since its release, the lightweight block cipher SPECK attracted a lot of external cryptanalysis, together with its sibling SIMON (this was amplified by the fact 6 that no cryptanalysis was reported in the original specifications document [2]). Many different aspects of SPECK have been covered by these efforts, but the works from Dinur [6] and Song et al. [24] are the most successful advances on its differential cryptanalysis aspect so far. Dinur [6] studied all versions of SPECK, improving the best known differential characteristics (from [1, 3]) as well as de- scribing a new key recovery strategy for this cipher. In particular, he devised a 4-round attack for 11 rounds of SPECK-32/64 using a 7 round differential char- acteristic, that has a time complexity of 246 and data complexity of 222 (chosen plaintexts). Later, at CRYPTO’19, Gohr published a cryptanalysis work on SPECK-32/64 that is based on deep learning [11]. Gohr proposed a key-recovery attack on 11- round SPECK-32/64 with estimated time complexity 238, improving the previous best attack [6] in 246, albeit with a slightly higher data complexity: 214.5 cipher- text pairs required. In this section, we will briefly review Gohr’s results [11]. Overview. In his article, Gohr proposes multiple differential cryptanalysis of SPECK, focusing on the input difference ∆in = 0x0040/0000. In this setting, the aim is to distinguish real pairs, i.e., encryptions of plaintext pairs P, P (cid:48) such that P ⊕ P (cid:48) = ∆in, from random pairs, which are the encryptions of random pairs of plaintext with no fixed input difference. Gohr compares a traditional (pure) differential distinguisher with a distinguisher based on a DNN for 5 to 8 rounds of SPECK-32/64 and showed that the DNN performs better. Pure differential distinguishers. Gohr computed the full DDT for the input difference ∆in, using the Markov assumption. Then, to classify a ciphertext pair (C, C(cid:48)), the probability p of the output difference C ⊕ C(cid:48) is read from the DDT and compared to the uniform probability. Let ∆out = C ⊕ C(cid:48), then if DDT (∆in → ∆out) > 1 232−1 (cid:40) Classification = Real Random otherwise These distinguishers for reduced-round SPECK-32/64 are denoted Dnr, where nr ∈ {5, 6, 7, 8} represents the number of rounds. The neural distinguishers are denoted as Nnr. Gohr’s neural distinguisher. We provide in Figure 2 a representation of Gohr’s neural distinguisher. It is a deep neural network, whose main components are: 1. Block 1: a 1D-CNN with kernel size of 1, a batch normalization and a ReLU activation function 2. Blocks 2-i: one to ten layers with each layer consisting of two 1D-CNN with kernel size of 3, each followed by batch normalization and a ReLU activation function. 3. Block 3: a non-linear final classification block, composed of three percep- tron layers separated by two batch normalization and ReLU functions, and finished with a sigmoid function. ... Fig. 2: The whole pipeline of Gohr’s deep neural network. Block 1 refers to the initial convolution block, Block 2-1 to 2-10 refer to the residual block and Block 3 refers to the classification block. The input to the initial convolution block (Block 1) is a 4 × 16 matrix, where each row corresponds to each 16-bit value in this order: Cl, Cr, C(cid:48) l, C(cid:48) r, a convolution layer with 32 filters is then applied. The kernel size of this 1D- CNN is 1, thus, it maps (Cl, Cr, C(cid:48) r) to (f ilter1, f ilter2, ..., f ilter32). Each f ilter is a non-linear combination of the features (Cl, Cr, C(cid:48) r) after the ReLU activation function depending on the value of the inputs and weights learned by the 1D-CNN. The output of the first block is connected to the input and added to the output of the subsequent layer in the residual block (see Figure 3). l, C(cid:48) l, C(cid:48) In the residual blocks (Blocks 2-i), both 1D-CNNs have a kernel of size 3, a padding of size 1 and a stride of size 1 which make the temporal dimension invariant across layers. At the end of each layer, the output is connected to the input and added to the output of the subsequent layer to prevent the relevant input signal from being wiped out across layers. The output of a residual block is a (32 × 16) feature tensor (see Figure 4). ... Fig. 3: Initial convolution block (Block 1). Fig. 4: The residual block (Blocks 2-i). The final classification block takes as input the flattened output tensor of the residual block. This 512 × 1 vector is passed into three perceptron layers (Multi-Layer Perceptron or MLP) with batch normalization and ReLU activation functions for the first two layers and a final sigmoid activation function performs the binary classification (see Figure 5). ... Fig. 5: The classification block (Block 3). Accuracy and efficiency of the neural distinguishers. For each pair, the neural distinguishers outputs a real-valued score between 0 and 1. If this score is greater than or equal to 0.5, the sample is classified as a real pair, and as a random pair otherwise. The results given by Gohr are presented in Table 1. Note that N7 and N8 are trained using some sophisticated methods (we refer to [11] for more details on the training). We remark that Gohr’s neural distinguisher has about 100,000 floating parameters, which is size efficient considering the accuracies obtained. Table 1: Accuracies of neural distinguishers for 5, 6, 7 and 8 rounds (taken from Table 2 of [11]). TPR and TNR denote true positive and true negative rates respectively. ... Real differences experiment. The neural distinguishers performed better than the distinguishers using the full DDT, indicating that the neural distin- guishers may learn something more than pure differential cryptanalysis. Gohr explores this effect with the real differences experiment. In this experiment, instead of distinguishing a real pair from a random pair, the challenge is to distinguish real pairs from masked real pairs, computed as (C ⊕ M, C(cid:48) ⊕ M ), where M is a random 32-bit value. These experiments use the Nnr distin- guishers directly, without retraining them for this new task. Table 2 shows the accuracies of these distinguishers. Notice that this operation does not affect 9 ∆out = C ⊕ C(cid:48) = (C ⊕ M ) ⊕ (C(cid:48) ⊕ M ) and thus the output difference distri- bution. However, the neural distinguishers are still able to distinguish real pairs from masked pairs even without re-training for this particular purpose, which shows that they do not just rely on the difference distribution. Table 2: Accuracies of various neural distinguishers in the real differences exper- iment. ... Interpretation of Gohr’s Neural Network: a Cryptanalysis Perspective Interpretability of neural networks remains a highly researched area in machine learning, but the focus has always been on improving the model and computa- tional efficiency. We will discuss more about the interpretability in a machine learning sense in Section 5. In this section, we want to find out why and how the neural distinguishers work in a cryptanalysis sense. In essence, we want to answer the following question: What type of cryptanalysis is Gohr’s neural distinguisher learning? If the neural distinguisher is learning some currently-unknown form of cryptanal- ysis, then we would like to extrapolate the additional statistics that it exploits. If not, then we want to find out what causes Gohr’s neural distinguishers to perform better than pure differential attacks, and even improve state-of-the-art attacks. With this question in mind, we perform a series of experiments and analyses in order to come up with a reasonable guess, later validated by the creation of a pure cryptanalysis-based distinguisher that matches the accuracy of Gohr’s one. Gohr’s neural distinguishers are able to correctly identify approximately 90.4%, 68.0% and 54.3% of the real ciphertext pairs (given by the true posi- tive rates) for 5, 6 and 7 rounds of SPECK-32/64 respectively (see Table 1). We will try to find out what these ciphertext pairs are if there are any common patterns and see whether we are able to identify and isolate them. Choice of input difference. As a start, we looked into Gohr’s choice of input difference: 0x0040/0000. This difference is part of a 9-round differential charac- teristics from Table 7 of [1]. The reason given by Gohr is that this difference de- terministically transits to a difference with low Hamming weight after one round. Using constraint programming and techniques similar to [10], we found that the 10 best differential characteristics with a fixed input difference of 0x0040/0000 for 5 rounds is 0x0040/0000 → 0x802a/d4a8, with probability of 2−13. In contrast, when we do not restrict the input difference, the best differential characteristics for 5 rounds is 0x2800/0010 → 0x850a/9520, with probability of 2−9. However, when we trained the neural distinguishers to recognize ciphertext pairs with the input difference of 0x2800/0010, the neural distinguishers performed worse (an accuracy of 75.85% for 5 rounds). This is surprising as it is generally natural for a cryptanalyst to maximize the differential probability when choosing a differ- ential characteristic. We believe this is explained by the fact that 0x0040/0000 is the input difference maximizing the differential probability for 3 or 4 rounds of SPECK-32/64 (verified with constraint programming), which has the most chances to provide a biased distribution one or two rounds later. Generally, we believe that when using such neural distinguisher, a good method to choose an input difference is to simply use the input difference leading to the highest differential probability for nr − 1 or nr − 2 rounds. Changing the inputs to the neural network. Gohr’s neural distinguishers are trained using the actual ciphertext pairs (C, C(cid:48)) whereas the pure differential distinguishers are only using the difference between the two ciphertexts C ⊕ C(cid:48). Thus, it is unfair to compare both as they are not exploiting the same amount of information. To have a fair comparison of the capability of neural distinguishers and pure differential distinguishers, we trained new neural distinguishers using C ⊕ C(cid:48), instead of (C, C(cid:48)). The results are an accuracy of 90.6% for 5 rounds, 75.4% for 6 rounds and 58.3% for 7 rounds. This shows us that when the neural distinguishers are restricted to only have access to the difference distribution, 4 as they do not perform as well as their respective Nnr, and similarly to Dnr can be seen in Table 1. Therefore, this is another confirmation (on top of the real differences experiment conducted in [11]), that Gohr’s neural distinguishers are learning more than just the distribution of the differences on the ciphertext. With that information, we therefore naturally looked beyond just the difference distribution at round nr. 4.1 Analyzing Ciphertext Pairs In this section, we limit and focus the discussions and results mostly to 5 rounds of SPECK-32/64. We recall that the last layer of the neural distinguisher is a sigmoid activation function. Thus, its output is a value between 0 and 1. When the score is 0.5 or more, the neural distinguisher predicts it as a real pair or otherwise, random pair. The closer a score is to 0.5, the least certain the neural distinguisher is on the classification. In order to know what are the traits that the neural distinguisher is looking for, we segregate the ciphertext pairs that yield extreme scores, i.e. scores that are either less than 0.1 (bad score) or more than 0.9 (good score). For the 4 Note that the new neural distinguishers are trained with 107 pairs, the same number as in [11] 11 rest of this section, we label the ciphertext pairs as “bad” and “good” ciphertext pairs and refer to the sets as B and G respectively. As we were experimenting with them, we kept the keys (unique to each pair) that are used to generate the ciphertext pairs. The goal now is to find similarities and differences in these two groups separately. As we believe that most of the features the neural distinguishers learned is differential in nature, we focus on the differentials of these ciphertext pairs. To start, we did the following experiment (Experiment A): 1. Using 105 real 5-round SPECK-32/64 ciphertext pairs, extract the set G. 2. Obtain the differences of the ciphertext pairs and sort them by frequency 3. For each of the differences δ: (a) Generate 104 random 32-bit numbers and apply the difference, δ to get 104 different ciphertext pairs. (b) Feed the pairs to the neural distinguisher N5 to obtain the scores. (c) Note down the number of pairs that yield a score ≥ 0.5 In Table 3, we show the top 25 differences for 5 rounds of SPECK-32/64 with their respective score from the above experiment. Out of the first 1000 differences, each records about 75% of the pairs scoring more than 0.5. Also, there exist multiple pairs of differences such that one is more probable than the other, and yet, it has a lower number of pairs classifying as real (e.g. No. 21 in Table 3). Thus, there is little evidence showing that if a difference is more probable, then the neural distinguisher is necessarily more likely to recognize it. Table 3: The top 25 differences (5 rounds of SPECK-32/64) in G with their respective results for Experiment A as a percentage of how many pairs having a score of ≥ 0.5 out of 104 pairs. Cnt refers to the number of differences obtained in G. ... Since the neural distinguishers outperform the ones with just the XOR input, we started to look beyond just the differences at 5 rounds. We decided to partially 12 decrypt the ciphertext pairs from G for a few rounds and re-run Experiment A on these partially decrypted pairs: for each pair, we compute the difference and for each difference, we created 104 random plaintext pairs with these differences and encrypted them to round nr using random keys. The results are very intriguing, as compared to that of Table 3: almost all of the (top 1000) unique differences obtained in this experiment achieved 99% or 100% of ciphertext pairs having a score of ≥ 0.5. We can see that the differences at rounds 3 and 4 (after decrypting 2 and 1 round respectively) start to show some strong biases. In fact, for all of the top 1000 differences at rounds 3 and 4, all 104 pairs × 1000 differences returned a score of ≥ 0.55. With that, we conduct yet another experiment (Experiment B): 1. For all the ciphertext pairs in G, decrypt i rounds with their respective keys and compute the corresponding difference. Denote the set of differences as Diff5-i. 2. Generate 105 plaintext pairs with a difference of 0x0040/0000 with random keys, encrypt to 4 rounds 3. If the pair’s difference is in Diff5-i, keep the pair. Otherwise, discard. 4. Encrypt the remaining pairs to 5 rounds and evaluate them using N5. When i = 2, we obtain 1669 unique differences with a dataset size of 89,969. 97.86% of these ciphertext pairs yielded a score ≥ 0.5 (i.e. by this method, we can isolate 88.04% of the true positive ciphertexts pair). Using i = 1, we have 128,039 unique differences and the size of the dataset is 74,077. While we could get a cleaner set with 99.98% of these ciphertext pairs obtaining a score of ≥ 0.5, we only managed to isolate 74.06% of the true positive pairs. Comparing with the true positive rate of N5 from Table 1, which is 0.904 ± 8.33 × 10−4, the case when i = 2 seems to be closer. We also looked into the bias of the difference bits (the jth difference bit refers to the jth bit index of C5−2 ⊕ C(cid:48) 5−2 where Cnr−i refers to the nr round ciphertext decrypted by i rounds. Table 4 shows the difference bit biases of the first 1000 (most common) unique differences of ciphertext pairs in G and B after decrypting two rounds. We assume that the neural distinguisher is able to identify some bits at these rounds because they are significantly more biased, though both the set B and G are from the real distribution. Now, we state the assumption required for our conjecture, which we will verify experimentally in Section 4.3. Assumption 1 Given a 5-round SPECK-32/64 ciphertext pair, N5 is able to determine the difference of certain bits at rounds 3 and 4 with high accuracy. Conjecture 1. Given a 5-round SPECK-32/64 ciphertext pair, N5 finds the dif- ference of certain bits at round 3 and decides if the ciphertext pair is real or random. Interestingly, the difference bit biases after decrypting 1 and 2 rounds are very similar (in their positions). We will provide an explanation in Section 4.2. 5 the differences were obtained experimentally. 13 Table 4: Difference bit bias of ciphertext pairs in G and B after decrypting 2 rounds. A negative (resp. positive) value indicates a bias towards ‘0’ (resp. ‘1’). bit position 31 ... 0.476 -0.454 -0.142 -0.006 0.025 0.084 -0.009 0.487 -0.473 -0.426 0.165 0.094 -0.006 0.019 -0.500 -0.500 0.031 -0.009 -0.015 -0.007 -0.014 -0.024 0.025 0.026 0.034 -0.005 -0.018 -0.021 0.006 0.009 0.079 -0.065 The exact truncated differentials are (∗ denotes no specific constraint, while 0 or 1 denotes the expected bit difference): 3 rounds: 10 ∗ ∗ ∗ ∗ ∗ 00 ∗ ∗ ∗ ∗ ∗ 00 10 ∗ ∗ ∗ ∗ ∗ 00 ∗ ∗ ∗ ∗ ∗ 10 4 rounds: 10 ∗ ∗ ∗ ∗ ∗ 10 ∗ ∗ ∗ ∗ ∗ 10 10 ∗ ∗ ∗ ∗ ∗ 10 ∗ ∗ ∗ ∗ ∗ 00 We refer to these particular truncated differential masks as T D3 and T D4 for the following discussion. Using constraint programming, we evaluate that the probabilities for these truncated differentials are 87.86% and 49.87% respectively. In order to verify how much the neural distinguisher is relying on these bits, we perform the following experiment (Experiment C): 1. Generate 106 plaintext pairs with initial difference 0x0040/0000 and 106 2. Encrypt all 106 plaintext pairs to 5 − i rounds. If a plaintext pair satisfies random keys. the T D5−i, then we keep it. Otherwise, it will be discarded. 3. Encrypt the remaining pairs to 5 rounds and evaluate them using N5. Table 5: Results of Experiment C with T D3 and T D4. Proport. refers to the number of true positive ciphertext pairs captured by the experiment. ... Table 5 shows the statistics of the above experiment with 5 rounds of SPECK- 32/64. The true positive rates for ciphertext pairs that follow these are closer to that of Gohr’s neural distinguisher. Now, there remains about 3% of the ciphertext pairs yet to be explained (comparing the results of T D5−2 with N5). The important point to note here is that the pairs we have identified are exactly the ones verified by the neural distinguisher as well, by the nature of these experiments. In other words, we managed to find what the neural distinguisher is looking for and not just another distinguisher that would achieve a good accuracy by identifying a different set of ciphertext pairs. 4.2 Deriving T D3 and T D4 With an input difference of 0x0040/0000, which has a deterministic transition to 0x8000/8000 in round 1, the difference will only start to spread after round 1 due 14 to the modular addition in the SPECK-32/64 round function. The inputs to the modular addition at round 2 are 0x0100 and 0x8000 (cf. Figure 6). While there are two active bits, only one of them will propagate the carry (as the other is the MSB), resulting in multiple differences. Assuming a uniform distribution, the carry has a probability 1 2 of propagating to the left. This causes the probability of the various differentials to reduce by 1 2 as the carry bit propagates until b31 (bit position 31) is reached and any further carry will be removed by the modular addition. Fig. 6: The distribution of the possible output differences after passing through the modular addition operation. ... In Figure 7 and Figure 8, we show how the bits evolve along the most probable differential path from round 1 (0x8000/8000) to round 4 (0x850a/9520). As it passes through the modular addition operation, we highlight the bits that have a relatively higher probability of being different from the most probable differential. The darker the color, the higher the probability of the difference being toggled. Figure 7 and Figure 8 show us why T D3 is important at round 3, and how the active bits shift in SPECK-32/64 when we start with the input difference of 0x0040/0000. In every round, b31, (the leftmost bit) has a high probability of staying active. This bit is then rotated to b24 before it goes into the modular 2 chance of switching from 1 → 0 or addition operation. In each round, b26 has a 1 the other way round. b27 and b28 have a 1 8 chance respectively of switching. This makes them highly volatile and therefore, unreliable. On the other hand, the right part of SPECK-32/64 rotates by 2 to the left at the end of each round. Because of the high rotation value in the left part of SPECK-32/64, low rotation value of the right part of SPECK-32/64, and the fact that the left part is added into the right part after the rotation, it takes about 3 to 4 rounds for the volatile and unreliable bits to spread. ... Fig. 7: The left (resp. right) part shows how the active bit from differ- ence 0x8000/8000 (resp. 0x8100/8102) propagates to difference 0x8100/8102 (0x8000/820a). The darker the color, the higher the probability (≥ 1 4 ) that it has a carry propagated to. ... Fig. 8: Showing how the active bit from difference 0x8000/820a propagates to difference 0x850a/9520. The darker the color, the higher the probability (≥ 1 4 ) that it has a carry propagated to. 16 4.3 Verifying Assumption 1 To verify if Gohr’s neural distinguisher is able to recognize the truncated differ- ential, we retrain the neural distinguisher with a slight difference (Experiment D): 1. Generate 107 plaintext pairs such that about 1 2 of the pairs satisfy T D3 (these are the positive pairs) 2. Encrypt the plaintext pairs for two rounds 3. Train the neural network to distinguish the two distributions, and validate with the same hyper-parameters as in [11], with a depth of 1 in the residual block. After retraining, the neural distinguisher has an accuracy of 96.57% (TPR: 99.95%, TNR: 93.19%) This shows that the neural distinguisher has the ca- pabilities to actually recognize the truncated differential with an outstanding accuracy. 4.4 SPECK-32/64 Reduced to 6 Rounds We perform Experiments C for 6 rounds of SPECK-32/64 as well. Table 6 shows the comparison of the true positive results of rounds 5 and 6. While the results are not as obvious as for the case of 5 rounds, we can still observe a similar trend for 6 rounds. ... Table 6: Results of SPECK-32/64 reduced to six rounds for Experiment C. Pro- portion refers to the number of true positive ciphertext pairs captured by the experiment. 4.5 Average Key Rank Differential Distinguisher Taking into consideration the observations we presented in this section, we intro- duce a new average key rank distinguisher that is not based on machine learning and almost matches the accuracy as Gohr’s neural network for 5, 6 and 7 rounds of SPECK-32/64. Here are the key considerations used in our distinguisher: – The training set of Gohr’s neural network consists of 107 ciphertext pairs. Thus, we restrict our distinguisher to only use 107 ciphertext pairs as well. – If we do an exhaustive key search for two rounds, the time complexity will be extremely high. Instead, we may need to limit ourselves to only one round to match the complexity of the neural distinguishers. – If we know the difference at round i, the i − 1 round difference for the right part is known as well, since ri−1 = (li ⊕ ri) ≫ 2 17 With those pointers in mind, we created a distinguisher that uses an ap- proximated DDT (aDDT); that is, a truncated DDT that is experimentally constructed based on n ciphertext pairs. In this distinguisher, we use n = 107 to ensure that both our distinguisher and the neural distinguishers have the same amount of information. The idea of the distinguisher is to decrypt the last round, nr, using all possible subkey bits that are relevant to the bits we are interested in. Then, we compute the average of the probabilities of all partial decryptions for a given pair, read from aDDT (nr − 1), to get a score. If the score is greater than that of the random distribution, the distinguisher will return 1 (Real) and 0 (Random) otherwise. The bits we are interested in can be represented as an AND mask, that is, a mask that has ‘1’ in the bit positions that we want to consider the bit and ‘0’ for those we want to ignore. The mask value we have chosen is 0xff8f/ff8f rather than the expected 0xc183/c183 as we believe the truncated differential they are detecting is at nr − 2 rounds. Thus, other than the bits that are identified earlier in this section, we decided to include more bits to improve the accuracy. With the look-up table to the aDDT, we do not just only match the data complexity (of the offline training) of the Gohr’s neural distinguishers, but at the same time, include the correlations between bits as well. The pseudocodes for creating the aDDT and the average key rank distin- guisher can be found in Algorithm 1 and Algorithm 2 given in Supplementary Materials. We applied the distinguisher for 5, 6 and 7 rounds of SPECK-32/64 and the results are given in Table 7. It shows that our distinguisher closely matches the accuracies of Gohr’s neural distinguishers. Degree of closeness. We now study the similarity between our distinguishers and Gohr’s neural distinguishers. In particular, we are interested in whether the classifications of the ciphertext pairs are the same for both distinguishers. To verify this, we gave a set of 105 5-round ciphertext pairs (approx. 50,000 from real and random distribution each) to both our average key rank distinguisher and N5, and measured how many times did they have the same output. The results for nr = 5 are shown in Table 8. We can see that about 97.6% of the ciphertext pairs tested have the same classification in both distinguishers. For nr = 6, we achieved 94.98% of the pairs with the same classification. Complexity comparison. In our average key rank distinguisher, for each pair, we perform the partial decryption of two ciphertexts, and a table lookup in aDDT. In the partial decryption, we enumerate the 212 keys affecting the right- most 13 bits of δlnr−1 covered by our mask. Therefore, the complexity of our distinguisher is 213 one-round SPECK-32/64 decryptions, and 213 table lookups. Comparing its complexity with Gohr’s distinguishers is not trivial, as the op- erations involved are different. Gohr evaluates the complexity of his neural key recovery by their runtime and an estimation of the number of speck encryptions that could be performed at the same time on a GPU implementation. We pro- pose to use the number of floating point multiplications performed by the neural network instead. Let I and O respectively denote the number of inputs and out- 18 puts to one layer. The computational cost of going through a dense layer is I · O multiplications. For 1D-CNN with kernel size ks = 1, a null padding, a stride equal to 1 and F filters, with input size (I, T ) the cost is computed as I · F · T multiplications. With the same input but with kernel size ks = 3, a padding equal to 1, the cost is I · ks · F · T Applying these formulas to Gohr’s neural network, we obtain a total of 137280 ≈ 217.07 multiplications. Note that we do not account for batch normalizations and additions, which are dominated by the cost of the multiplications. Using this estimation, it seems that our distinguisher is slightly better in terms of complexity. Table 7: Accuracy of the average key rank distinguisher with a mask value of 0xff8f/ff8f. ... Table 8: Closeness of the outputs of N5 and average key rank distinguisher. ... 4.6 Discussion Even though Gohr trained a neural distinguisher with a fixed input difference, it is unfair to compare the accuracy of neural distinguisher to that of a pure differ- ential cryptanalysis (with the use of DDT), since there are alternative cryptanal- ysis methods that the neural distinguisher may have learned. The experiments performed indicate that while Gohr’s neural distinguishers did not rely much on difference at the nr round, they rely strongly on the differences at round nr − 1 and even more strongly at round nr − 2. These results support the hypothesis that the neural distinguisher may learn differential-linear cryptanalysis [13] in the case of SPECK. While we did not present any attacks here, using the MILP model shown in [9], we verified that there are indeed many linear relations with large biases for 2 to 3 rounds. Unlike traditional linear cryptanalysis, which usually use independent char- acteristics or linear hull involving the same plaintext and ciphertext bits, a well- trained neural network is able to learn and exploit several linear characteristics while taking into account their dependencies and correlations. We believe that neural networks find the easiest way to achieve the best accuracy. In the case of SPECK, it seems that differential-linear cryptanalysis would be a good fit since it requires less data and the truncated differential has a very high probability. Thus, we think that neural networks have the abil- ity to efficiently learn short but strong differential, linear or differential-linear characteristics for small block ciphers for a small number of rounds. 4.7 Application to AES-2-2-4 [7] We are also interested in the capabilities of the neural distinguishers on a Substitution-Permutation Network (SPN) cipher. We chose a small scale variant 19 of AES from [7] with the parameters: r = 2, c = 2, e = 4. We chose this cipher as it has a small state size, which could be exhaustively searched through. AES- 2-2-8 would be a good choice as it also has a state size of 32-bit, however, our distinguishers are not able to learn anything significant. We trained AES-2-2-4 with 215 pairs, starting with an input difference of (1, 0, 0, 1). This input dif- ference was chosen such that only after two rounds, all S-boxes will be active. We trained them for 3 rounds and obtained an accuracy of 61.0%. In contrast, we use the same number of pairs, we trained an aDDT distinguisher and we obtained an accuracy of 62.3%. To show the possibilities of relying purely on differences, we perform an experiment similar to Experiment A. With the trained neural distinguisher, we exhaust all possible 16-bit differences and we generate 100 random pairs for each difference. Next, we feed the pairs to the neural distinguisher and count the number of pairs in each basket of score: [0.0 − 0.1), [0.1 − 0.2), ..., [0.9 − 1.0]. Our result shows that for each differential, the 100 random pairs form a cluster about a center similar to a Gaussian distribution. These results seem to suggest the nature of the neural distinguisher for AES-2-2-4 is one that relies fully on differential: giving a confidence interval based on just the difference. 5 Interpretation of Gohr’s Neural Network: a Machine Learning Perspective In this section, we are exploring the following practical question: Can Gohr’s neural network be replaced by a strategy inspired by both differential cryptanalysis and machine learning? We will demonstrate here that this is possible. First of all, it should be em- phasized that DNNs often outperform mathematical modeling or standard ma- chine learning approaches in supervised data-driven settings, especially on high- dimensional data. It seems to be the case because correlations found between input and output pairs during DNN training lead to more relevant character- istics than those found by experts. In other words, Gohr’s neural distinguisher seems to be capable of finding a property P currently unknown by cryptana- lysts. One may ask if we could experimentally approach this unknown property P that encodes the neural distinguisher behavior, using both machine learning and cryptanalysis expertise. With this question in mind, we propose our best estimate with a focus on 5 and 6 SPECK-32/64 rounds where the DNN achieves accuracies of 92.9% and 78.8% in a real/random distinction setting and where the full DDT approach can achieve accuracies of 91.1% and 75.8%. In our best setting, we reach accuracy values of 92.3% and 77.9%. Section 3 discusses in detail how Gohr’s neural distinguisher is modeled in three blocks. Our objective here is to replace each of these individual blocks by a more interpretable one, coming either from machine learning or from the crypt- analysts’ point of view. This work is thus the result of the collaboration between 20 two worlds addressing the open question of deep learning interpretability. In the course of the study, we set forth and challenged four conjectures to estimate the property P learned by the DNN as detailed below. 5.1 Four Conjectures Conjectures 2 & 3 aim to uncover Block 3 behavior. Conjecture 4 tackles Block 1 while Conjecture 5 concerns Block 2-i. The DNN can not be entirely replaced by another machine learning model. Ensemble-based machine learning models such as random forests [4] and gradient boosting decision trees [8] are accurate and easier to interpret than DNNs [14]. Nevertheless, DNNs outperform ensemble-based machine learning models for most tasks on high-dimensional data such as images. However, with only 64 bits of input, we could legitimately wonder whether the DNN could be replaced by another ensemble-based machine learning model. Despite our small size problem, our experiments reveal that other models significantly decrease the accuracy. Conjecture 2. Gohr’s neural network outperforms other non-neuronal network machine learning models. Experiment. To challenge this conjecture, we tested multiple machine learn- ing models, such as Random Forest (RF), Light Gradient Boosting Machine (LGBM), Multi-Layer Perceptron (MLP), Support Vector Machine (SVM) and Linear Regression (LR). They all performed equally. For the rest of this paper, we will only consider LGBM [12] as an alternative ensemble classifier for DNN and MLP. LGBM is an extension of Gradient Boosting Decision Tree (GBDT) [8] and we fixed our choice on it because it is accurate, interpretable and faster to train than RF or GBDT. In support of our conjecture, we established that the accuracy for the LGBM model is significantly lower than the one of the DNN when the inputs are (Cl, Cr, C(cid:48) r), see third column of Table 9. l, C(cid:48) Table 9: A comparison of the neural distinguisher and LGBM model for 5 round, for 106 samples generated of type (Cl, Cr, C(cid:48) ... The final MLP block is not essential. As described above, we can not replace the entire DNN with another non-neuronal machine learning model that is easier to interpret. However, we may be able to replace the last block (Block 21 3) of the neural distinguisher performing the final classification, by an ensemble model. Conjecture 3. The MLP block of Gohr’s neural network can be replaced by an- other ensemble classifier. Experiment. We successfully exchanged the final MLP block for a LGBM model. The reasons for choosing LGBM as a non-linear classifier were detailed in the previous experiment paragraph. The first attempt is a complete substitution of Block 3, taking the 512-dimension output of Block 2-10 as input. In the fourth column of Table 9, we observe that this experiment leads to much better results than the one from Conjecture 2, and even better results than the classical DDT method D5 (+0.39%). To further improve the accuracy, we implemented a partial substitution, taking only the 64-dimension output of the first layer of the MLP as input. As can be seen in the fifth column from Table 9, the accuracy with those inputs is now much closer to the DNN accuracy. In both cases, the accuracy is close to the neural distinguisher, supporting our conjecture. At this point, in order to grasp the unknown property P, one needs to understand the feature vector at the residuals’ output. The linear transformation on the inputs. We saw in Section 3 that Block 1 performs a linear transformation on the input. By looking at the weights of the DNN first convolution, we observe that it contains many opposite values. This indicates that the DNN is looking for differences between the input features. Consequently, we propose the following conjecture. Conjecture 4. The first convolution layer of Gohr’s neural network transforms the input (Cl, Cr, C(cid:48) r) into (∆L, ∆V, V0, V1) and a linear combination of those terms. l, C(cid:48) Experiments. As the inputs of the first convolution are binary, we could formally verify our conjecture. By forcing to one all non-zero values of the output of this layer, we calculated the truth-table of the first convolution. We thus obtained the boolean expression of the first layer for the 32 filters. We observed that eight filters were empty and the remaining twenty-four filters were simple. The filter expressions are provided in Table 14 in the Supplementary Materials. However, one may argue that setting all non-zero values to one is an over- simplified approach. Therefore, we replaced the first ReLU activation function by the Heaviside activation function, and then we retrained the DNN. Since the Heaviside function binarizes the intermediate value (as in [28]), we can estab- lish the formal expression of the first layer of the retrained DNN. This second DNN had the same accuracy as the first one and almost the same filter boolean expression. Finally, we trained the same DNN with the following entries (∆L, ∆V, V0, V1). Using the same method as before, we established the filters’ boolean expressions. This time, we obtained twenty five null filters and seven non-null filters, with the 22 following expressions: ∆L, V0∧ V1, ∆L, ∆L, V0∧ V1, ∆L∧ ∆V , ∆L∧ ∆V . These observations support conjecture 4. Therefore, we kept only (∆L, ∆V, V0, V1) as inputs for our pipeline. The masked output distribution table. With regards to the remaining residual block replacement, our first assumption is that the DNN calculates a shape close to the DDT in that residual block. However, two major properties of the neural distinguisher prevent us from assuming that it is a DDT in the classical sense of the term. The first property, as explained in Section 3, is that the neural distinguisher does not only rely on the difference distribution to distinguish real pairs as presented in Table 2. The second specificity is that the DNN has only approximately 100,000 floating parameters to perform classification, which can be considered as size efficient. Our second assumption is therefore that the DNN is able to compress the distribution table. We introduce the following definitions. Output Distribution Table (ODT). We propose to compute a distribution table on the values (∆L, ∆V, V0, V1) directly, instead of doing so on the difference of the ciphertext pair (Cl ⊕ C(cid:48) r). We call this new table an Output Distribution Table (ODT) and it can be seen as a generalization of the DDT. The entries of the ODT are 64 bits, which is not tractable for 107 samples. Also, the DNN has only 100,000 parameters. The DNN is therefore able to compress the ODT. l, Cr ⊕ C(cid:48) Masked Output Distribution Table (M-ODT). A compressed ODT means that the input is not 64 bits, but instead hw bits, where hw represents the Hamming weight of the mask. Let us consider a mask M ∈ Mhw with Mhw the ensemble of 64-bits masks with Hamming weight hw and M = (M1, M2, M3, M4), with Mi a 16-bit mask. Compressing the ODT therefore means applying the M mask to all inputs. In our case, with I = (∆L, ∆V, V0, V1), we get IM = (∆L ∧ M1, ∆V ∧ M2, V0 ∧ M3, V1 ∧ M4) = I ∧ M , before computing the ODT. By calculating that way, the number of ODT entries per mask decreases. It becomes a function that depends only on hw and on the bit positions in the masks. It is therefore a more compact representation of the complete ODT. However, it turns out that if we consider only one mask, we get only one value per sample to perform the classification: P (Real|IM ), while the DNN has a final vector size of 512. We considered several masks. Thus, by defining the ensemble RM ∈ Mhw, the set of relevant masks of Mhw, we can calculate for a specific input I = (∆L, ∆V, V0, V1) the probability P (Real|IM ),∀M ∈ RM . Then, we concatenate all the probabilities into a feature vector of size m = |RM|. We get the feature . We are now able F for the input I: F =(cid:0) P (Real|IM 1) P (Real|IM 2) ··· P (Real|IMm)(cid:1)T to propose the final conjecture. Conjecture 5. The neural distinguisher internal data processing of Block 2-i can be approached by: 1. Computing a distribution table for input (∆L, ∆V, V0, V1). 23 2. Finding several relevant masks and applying them to the input in order to compress the output distribution table. We abbreviate M-ODT this Masked-Output Distribution Table. Thus, the fea- ture vector of the DNN can be replaced by a vector where each value represents the probability stored in the M-ODT for each mask. This approach enables us to replace Block 2-i of the DNN. Though, we still need to clarify how to get the RM ensemble. Extracting masks. Based on local interpretation methods, we can extract these masks from the DNN. Indeed, these methods consist of highlighting the most important bits of the entries for classification. Thus, by sorting the entries ac- cording to their score and by applying these local interpretation methods, we can obtain the relevant masks. 5.2 Approximating the Expression of the Property P From our conjectures, we hypothesized that we can approximate the unknown property P that encodes the neural distinguisher behavior by the following: – Changing (C, C(cid:48)) into I = (∆L, ∆V, V0, V1). – Changing the 512-feature vector of the DNN by the feature vector of prob- abilities F =(cid:0) P (Real|IM 1) P (Real|IM 2) ··· P (Real|IMm)(cid:1)T . – Changing the final MLP block by the ensemble machine learning model LGBM. These points stand respectively for Block 1, Block 2-i and Block 3. 5.3 Implementation In this section and based on the verified conjectures, we are describing the step- wise implementation of our method. We consider that we have a DNN formed with 107 data of type (∆L, ∆V, V0, V1) for 5 and 6 rounds of SPECK-32/64. We developed a three-step approach: 1. Extraction of the masks from the DNN with a first dataset. 2. Construction of the M-ODT with a second dataset. 3. Training of the final classifier from the probabilities stored in the M-ODT with a third dataset. Mask extraction from the DNN. We first ranked 104 real samples accord- ing to DNN score, as described in Section 4.1, in order to estimate the masks from these entries. We used multiple local interpretation methods: Integrated Gradients [26], DeepLift [22], Gradient Shap [15], Saliency maps [23], Shapley Value [5], and Occlusion [27]. These methods score each bit according to their 24 importance for the classification. Following averaging by batch and by method, there were two possible ways to move forward. We could either assign a Ham- ming weight or else set a threshold above which all bits would be set to one. After a wide range of experiments, we chose the first option and set the Ham- ming weight to sixteen and eighteen (which turned out to be the best values in our testing). This approach allowed us to build the ensemble RM of the relevant masks. Implementation details. We used the captum library6 which brings together multiple methods on local interpretation. The dataset is divided into batches of size about 2,500 and grouped by scores. The categories we used were: scores from 1 to 0.9 (about 2,000 samples), scores from 0.9 to 0.5 (about 500 samples), scores from 1 to 0.8 (about 2,100 samples) and scores from 1 to 0.5 (about 2,500 samples). This way, one score per method could be derived for each bit of each sample. We then proposed several methods to average these importance scores by bit of category: the sum of absolute values, the median of absolute values and the average of absolute values. Then, we took the sixteen and eighteen best values and we obtained a mask. There is one mask per score, one per local inter- pretation method and one per averaging method. On average, for 5,000 samples we generate about 100 relevant masks. Finally, with the methods available in scikit-learn [20], we ranked the features and so the masks according to their per- formance. After multiple repetitions of mask generation and selection at every time, we obtained 50 masks that are effective: they are provided in Table 15 in the Supplementary Materials. The final ensemble of masks is the addition of those 50 effective masks and the generated relevant masks. Constructing the M-ODT. Once the ensemble RM of relevant masks is de- termined, we compute the M-ODT. Algorithm D (in Supplementary Materials) describes our construction method which is similar to that of the DDT. The in- puts of the algorithm include a second dataset composed of n = 107 real samples of type I = (∆L, ∆V, V0, V1), and the set of relevant masks RM . The output is the M-ODT dictionary with the mask as first key, the masked input as second key, and P (Real|I ∧ M ) = P (Real|IM ) as value. The M-ODT dictionary is constructed as follow: first, for each mask M in RM , we compute the corresponding masked-dataset DM which is simply the operation IM = I ∧ M for all I in D. Secondly we compute a dictionary U with key the element of DM and with value the occurrences number of that element in DM . Then, we compute for all element IM in DM the probability: P (Real|IM ) = P (IM|Real)P (Real) P (IM|Real)P (Real) + P (IM|Random)P (Random) 6 https://github.com/pytorch/captum 25 with P (Real) = P (Random) = 0.5, P (IM|Random) = 2−HW (M ), HW (M ) being n × U [IM ]. Finally we update the Hamming weight of M and P (IM|Real) = 1 M-ODT as follow: M-ODT[M ][IM ] = P (Real|IM ). vector Fj =(cid:0) P (Real|Ij∧M 1) P (Real|Ij∧M 2) ··· P (Real|Ij∧M m)(cid:1)T Training the classifier on probabilities. Upon building the M-ODT, we can start training the classifier. Given a third dataset D = {(input0, y0)... (inputn, yn)}, with inputj a sample of type (C, C(cid:48)), transformed into (∆L, ∆V, V0, V1) and the label yj ∈ [0, 1], with n = 106, we first compute the feature for all inputs and for m = |RM|. Next, we determined the optimal θ parameters for the gθ model according to Equation 1, with L being the square loss. Here, the gθ classifier is Light Gradient Boosting Machine (LGBM) [12]. Implementation details. Feature vectors are standardized. Model hyper-parameters fine-tuning has been achieved by grid search. Results were obtained by cross- validation on 20% of the train set and the test set had 105 samples. Finally, results are obtained on the complete pipeline for three different seeds, five times for every seed. 5.4 Results The M-ODT pipeline was implemented with numpy, scikit-learn [20] and pytorch [19]. The project code can be found at this URL address7. Our work station is constituted of a GPU Nvidia GeForce GTX 970 with 4043 MiB memory and four Intel core i5-4460 processors clocked at 3.20GHz. General results. Table 10 shows accuracies of the DDT, the DNN and our M- ODT pipeline on 5 and 6-round reduced SPECK-32/64 for 1.1 × 107 generated samples. When compared to DNN and DDT, our M-ODT pipeline reached an intermediate performance right below DNN. The main difference is the true positive rate which is higher in our pipeline (this can be explained by the fact that our M-ODT preprocessing only considers real samples). All in all, our M- ODT pipeline successfully models the property P. Matching. Table 11 summarizes the results of the quantitative correspondence studies for the prediction between the two models. We compared the DNN trained on samples type (∆L, ∆V, V0, V1) to our M-ODT pipeline. On 5 rounds, we obtained a rate of 97.5% identical predictions. In addition, 91.3% were both identical and equal to the label. On 6 rounds, matching prediction reduces down to 93.1%. We thus demonstrated that our method advantageously approximates the performance of the neural distinguisher. With an initial linear transformation 7 https://github.com/AnonymousSubmissionEuroCrypt2021/A-Deeper-Look-at- Machine-Learning-Based-Cryptanalysis 26 Table 10: A comparison of Gohr’s neural network, the DDT and our M- ODT pipeline accuracies for around 150 masks generated each time, with input (∆L, ∆V, V0, V1), LGBM as classifier and 1.1 × 107 samples generated in total. TPR and TNR refers to true positive and true negative rate respectively. ... on the inputs, computing a M-ODT for a set of masks extracted from the DNN and then classifying the resulting feature vector with LGBM, we achieved an efficient yet more easily interpretable approach than Gohr distinguishers. In- deed, DNN obscure features are simply approached in our pipeline by F = . Finally, we interpret the performance of the classifier globally (i.e. retrieving the decision tree) and locally (i.e. deduc- ing which feature played the greatest role in the classification for each sample) as in [14]. Those results are not displayed as they are beyond the scope of the present work, but they can be found in the project code. Table 11: A comparison of Gohr’s neural network predictions and our M- ODT pipeline predictions for around 150 masks generated each time, with input (∆L, ∆V, V0, V1), LGBM as classifier and 1.1 × 107 samples generated in total. ... 5.5 Application to SIMON Cipher In order to check whether our approach could be generalized to other crypto- graphic primitives, we evaluated our M-ODT method on 8 rounds of SIMON- 32/64 block cipher. Implementing the same pipeline, we enjoyed a 82.2% ac- curacy for the classification, whereas the neural distinguisher achieves 83.4% accuracy. In addition, the matching rate between the two models was up to 27 92.4%. The slight deterioration in the results of our pipeline for SIMON can be explained by the lack of efficient masks as introduced in Section 5.3 for SPECK. 5.6 Discussions From the cryptanalysts’ standpoint, one important aspect of using the neural distinguisher is to uncover the property P learned by the DNN. Unfortunately, while being powerful and easy to use, Gohr’s neural network remains opaque. Our main conjecture is that the 10-layer residual blocks, considered as the core of the model, are acting as a compressed DDT applied on the whole input space. We model our idea with a Masked Output Distribution Table (M-ODT). The M-ODT can be seen as a distribution table applied on masked outputs, in our case (∆L, ∆V, V0, V1), instead of only the difference (Cl ⊕ C(cid:48) l, Cr ⊕ C(cid:48) r). By doing so, features are no longer abstract as in the neural distinguisher. In our pipeline, each one of the features is a probability for the sample to be real knowing the mask and the input. In the end, with our M-ODT pipeline, we successfully obtained a model which has only −0.6% difference accuracy with the DNN and a matching of 97.3% on 5 rounds of SPECK-32/64. Additional analysis of our pipeline (e.g. masks independence, inputs influence, classifiers influence) are available into the project code. To the best of our knowledge, this work is the first successful attempt to exhibit the underlying mechanism of the neural distinguisher. However, we note that a minor limitation of our method is that it still requires the DNN to extract the relevant masks during the preparation of the distinguisher. Since it is only during preparation, this does not remove anything with regards to the interpretability of the distinguisher. Future work will aim at computing these masks without DNN. All in all, our findings represent an opportunity to guide the development of a novel, easy-to-use and interpretable cryptanalysis method. 6 Improved Training Models While in the two previous sections we focused on understanding how the neural distinguisher works, here we will explain how one can outperform Gohr’s results. The main idea is to create batches of ciphertext inputs instead of pairs. We refer to batch input of size B, a group of B ciphertexts that are con- structed from the same key. Here, we can distinguish two ways to train and evaluate the neural distinguisher pipeline with batch input. The straightfor- ward one is to evaluate the neural distinguisher score for each element of the batch and then to take the median of the results. The second is to consider the whole batch as a single input for a neural distinguisher. In order to do so, we used 2-dimensional CNN (2D-CNN) where the channel dimension is the fea- tures (∆L, ∆V, V0, V1). We should point out that, for sake of comparability with Gohr’s work, we maintained the product of the training set size by the batch size to be equal to 107. Both batch size-based challenging methods yielded sim- ilar accuracy values (see Table 12). Notably, in both cases, we enjoyed 100% accuracy on 5 and 6 rounds with batch sizes 10 and 50 respectively. 28 Table 12: Study of the batch size methods on the accuracies with (∆L, ∆V , V0, V1) as input for 5 and 6 rounds. ... Considering these encouraging outcomes, we extended the method to 7 rounds. As the 7-round training is more sophisticated and the two previous methods are equivalent, we decided to only apply the first method (the averaging one), be- cause it requires to train only one neural distinguisher. Results given in Table 13 confirm our previous findings: with a batch size of 100, we obtain 99.7% accuracy on 7 rounds. This remarkable outcome demonstrates the major improvement of our batch strategy over those from earlier Gohr’s work. Table 13: Study of the averaging batch size method on the 7-round accuracies with (∆L, ∆V , V0, V1) as input. 1 ... Conclusion In this article, we proposed a thorough analysis of Gohr’s deep neural network distinguishers of SPECK-32/64 from CRYPTO’19. By carefully studying the clas- sified sets, we managed to uncover that these distinguishers are not only basing their decisions on the ciphertext pair difference, but also the internal state differ- ence in penultimate and antepenultimate rounds. We confirmed our findings by proposing pure cryptanalysis-based distinguishers on SPECK-32/64 that match Gohr’s accuracy. Moreover, we also proposed a new simplified pipeline for Gohr’s distinguishers, that could reach the same accuracy while allowing a complete in- terpretability of the decision process. We finally gave possible directions to even improve over Gohr’s accuracy. Our results indicate that Gohr’s neural distinguishers are not really pro- ducing novel cryptanalysis attacks, but more like optimizing the information extraction with the low-data constraints. Many more distinguisher settings, ma- chine learning pipelines, types of ciphers should be studied to have a better understanding of what machine learning-based cryptanalysis might be capable of. Yet, we foresee that such tools could become of interest for cryptanalysts and designers to easily and generically pre-test a primitive for simple weaknesses. Our work also opens interesting directions with regards to interpretability of deep neural networks and we believe our simplified pipeline might lead to better interpretability in other areas than cryptography. 29 Acknowledgements The authors are grateful to the anonymous reviewers for their insightful com- ments that improved the quality of the paper. The authors are supported by the Temasek Laboratories NTU grant DSOCL17101. We would like to thank Aron Gohr for pointing out that the differential characteristics mentioned in the attacks of Dinur’s [6] have been extended by one free round, thus, our previous suggestion of extending Dinur’s attack by one round is invalid. References 1. Abed, F., List, E., Lucks, S., Wenzel, J.: Differential cryptanalysis of round-reduced simon and speck. In: Fast Software Encryption - FSE 2014. LNCS, vol. 8540, pp. 525–545. Springer (2014) 2. Beaulieu, R., Shors, D., Smith, J., Treatman-Clark, S., Weeks, B., Wingers, L.: The SIMON and SPECK families of lightweight block ciphers. IACR Cryptol. ePrint Arch. 2013, 404 (2013), http://eprint.iacr.org/2013/404 3. Biryukov, A., Roy, A., Velichkov, V.: Differential analysis of block ciphers SIMON and SPECK. In: Cid, C., Rechberger, C. (eds.) Fast Software Encryption - FSE 2014. LNCS, vol. 8540, pp. 546–570. Springer (2014) 4. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001) 5. Castro, J., G´omez, D., Tejada, J.: Polynomial calculation of the shapley value based on sampling. Computers & Operations Research 36(5), 1726–1730 (2009) 6. Dinur, I.: Improved differential cryptanalysis of round-reduced speck. In: Selected Areas in Cryptography - SAC 2014. pp. 147–164 (2014) 7. Duan, X., Yue, C., Liu, H., Guo, H., Zhang, F.: Attitude tracking control of small-scale unmanned helicopters using quaternion-based adaptive dynamic surface control. IEEE Access 9, 10153–10165 (2021), https://doi.org/10.1109/ACCESS. 2020.3043363 8. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. An- nals of statistics pp. 1189–1232 (2001) 9. Fu, K., Wang, M., Guo, Y., Sun, S., Hu, L.: Milp-based automatic search algorithms for diff erential and linear trails for speck. IACR Cryptol. ePrint Arch. 2016, 407 (2016) 10. Gerault, D., Minier, M., Solnon, C.: Constraint programming models for chosen key differential cryptanalysis. In: Principles and Practice of Constraint Programming. pp. 584–601. Springer (2016) 11. Gohr, A.: Improving attacks on round-reduced speck32/64 using deep learning. In: Advances in Cryptology - CRYPTO 2019. LNCS, vol. 11693, pp. 150–179. Springer (2019) 12. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in neural information processing systems. pp. 3146–3154 (2017) 13. Langford, S.K., Hellman, M.E.: Differential-linear cryptanalysis. In: Advances in Cryptology - CRYPTO ’94. LNCS, vol. 839, pp. 17–25. Springer (1994) 14. Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., Lee, S.I.: Explainable ai for trees: From local ex- planations to global understanding. arXiv preprint arXiv:1905.04610 (2019) 30 15. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in neural information processing systems. pp. 4765–4774 (2017) 16. Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations using deep learning techniques. In: Security, Privacy, and Applied Cryptography Engineering - SPACE 2016. pp. 3–26 (2016) 17. Mouha, N., Preneel, B.: A proof that the ARX cipher salsa20 is secure against differential cryptanalysis. IACR Cryptol. ePrint Arch. 2013, 328 (2013), http: //eprint.iacr.org/2013/328 18. Mouha, N., Wang, Q., Gu, D., Preneel, B.: Differential and linear cryptanalysis using mixed-integer linear programming. In: Information Security and Cryptology - Inscrypt 2011. pp. 57–76 (2011) 19. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. In: Advances in neural information processing systems. pp. 8026–8037 (2019) 20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. the Journal of machine Learning research 12, 2825–2830 (2011) 21. Rivest, R.L.: Cryptography and machine learning. In: Advances in Cryptology - ASIACRYPT ’91. pp. 427–439 (1991) 22. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. arXiv preprint arXiv:1704.02685 (2017) 23. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net- works: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) 24. Song, L., Huang, Z., Yang, Q.: Automatic differential analysis of ARX block ci- phers with application to SPECK and LEA. In: Information Security and Privacy - ACISP 2016. pp. 379–394 (2016) 25. Sun, S., G´erault, D., Lafourcade, P., Yang, Q., Todo, Y., Qiao, K., Hu, L.: Analysis of aes, skinny, and others with constraint programming. IACR Trans. Symmetric Cryptol. 2017(1), 281–306 (2017) 26. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365 (2017) 27. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European conference on computer vision. pp. 818–833. Springer (2014) 28. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160 (2016), http://arxiv.org/abs/1606.06160 31 Supplementary Materials ...