Re: Scientists create tool to kill cyber attacks in ‘less than a second’

Undiscussed Horrific Abuse, One Victim of Many gmkarl at
Thu May 19 14:58:59 PDT 2022

> Abstract
> Perimeter-based detection is no longer sufficient for mitigating the
> threat posed by malicious software. This is evident as antivirus (AV)
> products are replaced by endpoint detection and response (EDR) products,
> the latter allowing visibility into live machine activity rather than
> relying on the AV to filter out malicious artefacts. This paper argues that
> detecting malware in real-time on an endpoint necessitates an automated
> response due to the rapid and destructive nature of some malware. The
> proposed model uses statistical filtering on top of a machine learning
> dynamic behavioural malware detection model in order to detect individual
> malicious processes on the fly and kill those which are deemed malicious.
> In an experiment to measure the tangible impact of this system, we find
> that fast-acting ransomware is prevented from corrupting 92% of files with
> a false positive rate of 14%. Whilst the false-positive rate currently
> remains too high to adopt this approach as-is, these initial results
> demonstrate the need for a detection model that is able to act within
> seconds of the malware execution beginning; a timescale that has not been
> addressed by previous work.
> 1. Introduction
> Our increasingly digitised world broadens both the opportunities and
> motivations for cyberattacks, which can have devastating social and
> financial consequences [1
> <>]. Malicious
> software (malware) is one of the most commonly used vectors to propagate
> malicious activity and exploit code vulnerabilities.
> Due to the huge numbers of new malware appearing each day, the detection
> of malware samples needs to be automated [2
> <>].
> Signature-matching methods are not resilient enough to handle obfuscation
> techniques or to catch unseen malware types and as such, automated methods
> of generating detection rules, such as machine learning, have been widely
> proposed [3 <>–6
> <>]. These
> approaches typically analyse samples when the file is first ingested,
> either using static code-based methods or by observing dynamic behaviours
> in a virtual environment.
> This paper argues that both of these approaches are vulnerable to evasion
> from the attacker. Static methods may be thwarted by simple
> code-obfuscation techniques whether rules are hand-generated [7
> <>] or created using
> machine learning [8
> <>]. Dynamic
> detection in a sandboxed environment cannot continue forever, either it is
> time-limited [9 <>]
> or ends after some period of inactivity [10
> <>]. This fixed
> period allows attackers to inject benign activity during analysis and wait
> to carry out malicious activity once the sample has been deemed harmless
> and passed on to the victim’s environment. The pre-execution filtering of
> malware is the model used by antivirus but this is insufficient to keep up
> with the ever-evolving malware landscape and has led to the creation of
> endpoint detection and response (EDR) products which allow security
> professionals to monitor and respond to malicious activity on the victim
> machine. Real-time malware detection also monitors malware live on the
> machine thus capturing any malicious activity on the victim machine even if
> it was not evident during initial analysis. This paper proposes that once a
> threat is detected, due to the fast-acting nature of some destructive
> malware, it is vital to have automated actions to support these detections.
> In this paper, we investigate automated detection and killing of malicious
> processes for endpoint protection.
> There are several key challenges to address in detecting malware
> on-the-fly on a machine in use by comparison with detecting malicious
> applications that are detonated in isolation in a virtual machine. These
> are summarised below:(1)*Signal Separation:* Detection in real time
> requires that the malicious and benign activities are separated in order
> that automated actions can be taken on only the malicious processes.(2)*Use
> of Partial Traces:* In order to try and mitigate damage, malware needs to
> be detected as early as possible, but, as shown in previous work [11
> <>], there is a
> trade-off between the amount of data collected and classification accuracy
> in the first few seconds of an application launching and the same may be
> true for individual processes.(3)*Quick Classification:* The inference
> itself should be as fast as possible in order to further limit the change
> of malicious damage once the process is deemed malicious.(4)*Impact of
> Automated Killing in Supervised Learning:* Supervised learning averages
> the error rate across the entire training set but when the classification
> results in an action, this smoothing out of errors across the temporal
> dataset is not possible.
> This paper seeks to address these key challenges and provides preliminary
> results including a measure of “damage prevented” in a live environment for
> fast-acting destructiveware. As well as the results from these experiments,
> this paper contributes an analysis of the computational resources against
> detection accuracy for many of the most popular machine-learning algorithms
> used for malware detection.
> The key contributions of this paper are as follows:(i)The first general
> malware detection model to demonstrate damage mitigation in real-time using
> process detection and killing(ii)Benchmarking of commonly used ML
> algorithm implementations with respect to computational resource consumption
> (iii)Presentation of real-time malware detection against more user
> background applications than have previously been investigated; increasing
> from 5 to 36 (up to 95 simultaneous processes)
> The next section outlines related work, followed by a report of the three
> methodologies that were tested to try and address these challenges 3 in
> which the method for evaluating these models is also explained (6.5). The
> experimental setup is described in Section 5.2.1
> <>, followed
> by results in Section 6
> <>.
> 2. Related Work2.1. Malware Detection with Static or Post-collection
> Behavioural Traces2.1.1. Static Sources
> Machine learning models trained on static data have shown good detection
> accuracy. Chen et al. [5
> <>] achieved 96%
> detection accuracy using statically extracted sequences of API calls to
> train a Random Forest model. However, static data have been demonstrated to
> be quite vulnerable to concept drift [12
> <>, 13
> <>]. Adversarial
> samples present an additional emerging concern; Grosse et al. [14
> <>] and Kolosnaji
> et al. [8 <>]
> demonstrated that static malware detection models achieving over 90%
> detection accuracy could be thwarted by injecting code or simply altering
> the padded code at the end of a compiled binary, respectively.
> 2.1.2. Post-Collection Dynamic Data
> Dynamic behavioural data are generated by the malware carrying out its
> functionality. Again machine learning models have been used to draw out
> patterns between malicious and benign software using dynamic data. Various
> dynamic data can be collected to describe malware behaviour. The most
> commonly used data are API calls made to the operating system, typically
> recorded in short sequences or by frequency of occurrence. Huang and
> Stokes’s research [3
> <>] reports the
> highest accuracy in recent malware detection literature with a very large
> dataset of more than 6 million samples to achieve an accurate detection
> rate of 99.64% using a neural network trained on the input parameters
> passed to API calls, their return values, and the co-occurrence of API
> calls. Other dynamic data sources include dynamic opcode sequences (e.g.,
> Carlin et al. [9 <>]
> achieved 99% using a Random Forest), hardware performance counters (e.g.,
> Sayadi [15 <>]
> achieved 94% on Linux/Ubuntu malware using a decision tree), network
> activity and file system activity (e.g., Usman et al. [16
> <>] achieved 93%
> using a decision tree in combination with threat intelligence feeds and
> these data sources), and machine activity metrics (e.g., Burnap et al. [17
> <>] achieved 94%
> using a self-organising map). Previous work [18
> <>] demonstrated
> the robustness of machine activity metrics over API calls in detecting
> malware collected from different sources.
> Dynamic detection is more difficult to obfuscate but typically the time
> taken to collect data is several minutes, making it less attractive for
> endpoint detection systems. Some progress has been made on early detection
> of malware. Previous work [11
> <>]) was able to
> detect malware with 94% accuracy within 5 seconds of execution beginning.
> However, as a sandbox-based method, malware which is inactive for the first
> 5 seconds is unlikely to be detected with this approach. Moreover, the
> majority of dynamic malware detection papers use virtualised environments
> to collect data.
> 2.2. Real-Time Malware Detection with Partial Behavioural Traces
> OS = operating system; HPCs = Hardware performance counters; DT = Decision
> Tree; MLP = Multi-layer perceptron; NN = Neural Network; RF = Random Forest.
> Previous work has begun to address the four challenges set out in the
> introduction. Table 1
> <> summarises the
> related literature and the problems considered by the researchers.
> Table 1  <>
> Real-time malware detection literature problems considered.
> To the best of our knowledge, challenge *(1) signal separation* has only
> previously been addressed by Sun et al. [23
> <>] using
> sequential API call data. The authors execute up to 5 benign and malicious
> programs simultaneously achieving 87% detection accuracy after 5 minutes of
> execution and 91% accuracy after 10 minutes of execution.
> Challenge *(2) to detect malware using partial traces as early as
> possible* has not been directly addressed. Some work has looked at early
> run-time detection; Das et al. [20
> <>] used an FPGA as
> part of a hybrid hardware-software approach to detect malicious Linux
> applications using system API calls which are then classified using a
> multilayer perceptron. Their model was able to detect 46% of malware within
> the first 30% of its execution with a false-positive rate of 2% in offline
> testing. These findings however were not tested with multiple benign and
> malicious programs running simultaneously and do not explain the impact of
> detecting 46% of malware within 30% of its execution trace in terms of
> benefits to a user or the endpoint being protected. How long does it take
> for 30% of the malware to execute? What has occurred in that time?
> Greater attention has been paid to challenge *(3) quick classification*,
> insofar as this problem also encompasses the need for lightweight
> detection. Some previous work has proposed hardware based detection for
> lightweight monitoring. Syadi et al. [15
> <>] use high
> performance counters (HPCs) as features to train ensemble learning
> algorithms and scored 0.94 AUC using a dataset of 100 malicious and 100
> benign Linux software samples. Ozsoy et al. [21
> <>] use low-level
> architectural events to train a multilayer perceptron on the more widely
> used [25 <>] (and
> attacked) Windows operating system. The model was able to detect 94% of
> malware with a false-positive rate of 7% using partial execution traces of
> 10,000 committed instructions. The hardware-based detection models,
> however, are less portable than software-based systems due to the ability
> of the same operating system to run on a variety of hardware configurations.
> Both Sun et al. [23
> <>] and Yuan [22
> <>] propose
> two-stage models to address the need for lightweight computation. The first
> stage comprises a lightweight ML model such as a Random Forest to alert
> suspicious processes, the second being a deep learning model which is more
> accurate but more computationally intensive to run. Two-stage models, as
> Sun et al. [23 <>]
> note, can get stuck in an infinite loop of analysis in which the first
> model flags a process as suspicious but the second model deems it benign
> and this labelling cycle continues repeatedly. Furthermore, if the first
> model is prone to false negatives, malware will never be passed to the
> second model for deeper analysis.
> Challenge *(4) the impact of automated actions* has been discussed by Sun
> et al. [23 <>]. The
> authors also propose the two-stage approach as a solution to this problem.
> The authors apply restrictions to the process whilst the deeper NN analysis
> takes place followed by the killing of malicious-labelled processes. The
> authors found that the delaying strategy impacted benignware more than
> malware and used this two-stage process to account for the irreversibility
> of the decision to kill a process. The authors did not assess the impact on
> the endpoint with respect to the time at which the correctly classified
> malware was terminated.
> 3. Methodology-Three Approaches
> As noted above, supervised learning models average errors across the
> training set but in the case of real-time detection and process killing, a
> *single* false positive on a benign process amongst 300 true-negatives
> would cause disruption to the user. The time at which an malware is
> detected is also important, the earlier the better. Therefore, the
> supervised learning model needs to be adapted to take account of these new
> requirements.
> Tackling this issue was attempted in three different ways and all three
> are reported here in the interests of reporting negative results as well as
> the one which performed the best. These were:(1)Statistical methods to
> smooth the alert surface and filter out single false-positives(2)Reinforcement
> learning, which is capable of incorporating the consequences of model
> actions into learning(3)A regression model based on the feedback of a
> reinforcement learning model made possible by having the ground-truth labels
> Figure 1 <> gives
> a high-level depiction of the three approaches tested in this paper.
> Figure 1  <>
> High-level depiction of three approaches taken.
> 3.1. Statistical Approach: Alert Filtering
> It is expected that transitioning from a supervised learning model to a
> real-time model will see a rise in false-positives since one single alert
> means benign processes (and all child processes) are terminated, which
> effectively renders all future data points as false positives. Filtering
> the output of the models, just as the human brain filters out transient
> electrical impulses in order to separate background noise from relevant
> data [26 <>], may
> be sufficient to make supervised models into suitable agents. This is
> attractive because supervised learning models are already known to perform
> well for malware detection, as confirmed by the previous paper and other
> related work [11 <>
> , 20 <>, 27
> <>, 28
> <>]. A disadvantage
> of this approach is that it introduces additional memory and computational
> requirements both in order to calculate the filtered results and to track
> current and historic scores; therefore, a model which integrates the
> expected consequences of an action into learning is also tested:
> reinforcement learning.
> 3.2. Reinforcement Learning: Q-Learning with Deep *Q* Networks
> The proposed automated killing model may be better suited to a
> reinforcement learning strategy than to supervised learning. Reinforcement
> learning uses rewards and penalties from the model’s environment. The
> problem that this paper is seeking to solve is essentially a supervised
> learning problem, but one for which it is not possible to average
> predictions. There are no opportunities to classify the latter stages of a
> process if the agent kills the process, and this can be reflected by the
> reward mechanism of the reinforcement learning model (see Figure 1
> <>). Therefore,
> reinforcement learning seems like a good candidate for this problem space.
> Two limitations of this approach are that (1) reinforcement learning
> models can struggle to converge on a balanced solution, and the models must
> learn to balance the exploration of new actions with the re-use of known
> high-reward actions; commonly known as the exploration-exploitation
> trade-off [29 <>]
> (2) in these experiments, the reward is based on the malware/benignware
> label at the application level rather than being linked to the actual
> damage being caused; therefore, the signal is a proxy for what the model
> should be learning. This is used because, as discussed above, the damage
> caused by different malware is subjective.
> For reinforcement learning, loss functions are replaced by reward
> functions which update the neural network weights to reinforce actions (in
> context) that lead to higher rewards and discourage actions (in context)
> that lead to lower rewards; these contexts and actions are known as
> state-action pairs. Typically, the reward is calculated from the perceived
> value of the new state that the action leads to, e.g., points scored in a
> game. Often this cannot be pre-labelled by a researcher since there are so
> many (maybe infinite) state-action pairs. However, in this case, all
> possible state-action pairs can be enumerated, which is the third approach
> tested (regression model, outlined in the next section).
> The reinforcement model was still tested. Here the reward is  for a
> correct prediction,  for an incorrect prediction where  is the total number
> of processes impacted by the prediction. For e.g., if there is only one
> process in a process tree but 5 more will appear over the course of
> execution, a correct prediction gives a reward of , and incorrect
> prediction gives a reward of .
> There are a number of reinforcement learning algorithms to choose from.
> This paper explores q-learning [30
> <>–33
> <>] to approximate
> the value or “quality” (*q*) of a given action in a given situation.
> Q-learning approximates q-tables, which are look-up tables of every
> state-action pair and their associated rewards. A state-action pair is a
> particular state in the environment coupled with a particular action, i.e.,
> the machine metrics of the process at a given point in time with the action
> to leave the process running. When the number of state-action pairs becomes
> quite large, it is easier to approximate the value using an algorithm. Deep
> *Q* networks (DQN) are neural networks that implement q-learning and have
> been used in state-of-the-art reinforcement learning arcade game playing,
> see Mnih et al. [34
> <>]. A DQN was the
> reinforcement algorithm trialled here; although it did not perform well by
> comparison with the other methods, a different RL algorithm may perform
> better [35 <>], but
> the results are still included in the interests of future work. The
> following paragraphs will explain some of the key features of the DQN.
> The DQN tries out some actions; stores the states, actions, resulting
> states, and rewards in memory; and uses these to learn the expected rewards
> of each available action, with the highest expected reward being the one
> that is chosen. Neural networks are well-suited to this problem since their
> parameters can easily be updated, tree-based algorithms like random forests
> and decision trees can be adapted to this end but not as easily. Future
> rewards can be built into the reward function and are usually discounted
> according to a tuned parameter usually signified by .
> In Mnih et al.’s [34
> <>] formulation, in
> order to address the exploration-exploitation trade off, DQNs either
> exploit a known action or explore a new one, with the chance of choosing
> exploration falling over time. When retraining the model based on new
> experiences, there is a risk that previous useful learned behaviours are
> lost; this problem is known as catastrophic forgetting [36
> <>]. Mnih et al.’s [
> 34 <>] DQNs use two
> tools to combat this problem. First, experience replay by which past
> state-action pairs are shuffled before being used for retraining so that
> the model does not catastrophically forget. Second, DQNs utilise a second
> network, which updates at infrequent intervals in order to stabilise the
> learning.
> Q-learning may enable a model to learn when it is confident enough to kill
> a process, using the discounted future rewards. For example, choosing not
> to kill some malware at time  may have some benefit as it allows the model
> to see more behaviour at *t* + 1 which gives the model greater confidence
> that the process is in fact malicious.
> Q-learning approximates rewards from experience, but in this case, all
> rewards from state-action pairs can actually be pre-calculated. Since one
> of the actions will kill the process and thus end the “experience” of the
> DQN, it could be difficult for this model to gain enough experience. Thus
> pre-calculation of rewards may improve the breadth of experience of the
> model. For this reason, a regression model is proposed to predict the *Q*-value
> of a given action.
> 3.3. Regression Using *Q*-Values
> Unlike classification problems, regression problems can predict a
> continuous value rather than discrete (or probabilistic) values relating to
> a set of output classes. Regression algorithms are proposed here to predict
> the q-value of killing a process. If this value is positive, the process is
> killed.
> *Q*-values estimate the value of a particular action based on the
> “experience” of the agent. Since the optimal action for the agent is always
> known, it is possible to precompute the “(q-) value” of killing a process
> and train various ML models to learn this value. It would typically be
> quicker to train a regression model which tries to learn the value of
> killing a process than to train a DQN which explores the state-action space
> and calculates rewards between learning, since the interaction and
> calculation of rewards is no longer necessary. The regression approach can
> be used with any machine learning algorithm capable of learning a
> regression problem, regardless of whether it is capable of partial training.
> There are two primary differences between this regression approach and the
> reinforcement learning DQN approach detailed in the previous section.
> Firstly, the datasets are likely to be the difference. Since the DQN
> generates training data through interacting with its environment, it may
> never see certain parts of the state-action space, e.g., if a particular
> process  is always killed during training before time , the model is not
> able to learn from the process  data after .
> Secondly, *only* the expected value of killing is modelled by the
> regressor, whereas the DQN tries to predict the value of both killing and
> of not killing the process. This means that the equation used to model the
> value of process killing is only an approximation of the reward function
> used by the DQN.
> The equation used to calculate the value of killing is positive for
> malware and negative for benignware; in both cases, it is scaled by the
> number of child processes impacted and in the case of malware, early
> detection increases the value of process killing (with an exponential
> decay). Let  be the true label of the process (0 = benign, 1 =
> malicious),  is the number of child processes, and  is the time in seconds
> at which the process is killed; then, the value of killing a process is:
> The equation above negatively scores the killing of benignware in
> proportion to the number of subprocesses and scores the killing of malware
> positively in proportion to the number of subprocesses. A bonus reward is
> scored for killing malware early, with an exponential decay over time.
> 4. Evaluation Methodology: Ransomware Detection
> Previous, research has not addressed the extent to which damage is
> mitigated by process killing, since Sun et al. [23
> <>] presented the
> only previous work to test process killing and damage with and without
> process killing is not assessed. To this end, this paper uses ransomware as
> a proxy to detect malicious damage, inspired by Scaife et al.’s approach [
> 24 <>]. A brief
> overview of Scaife et al.’s damage measurement is outlined below:
> Early detection is particularly useful for types of malware from which
> recovery is difficult and/or costly. Cryptographic ransomware encrypts user
> files and withholds the decryption key until a ransom is paid to the
> attackers. This type of attack is typically costly to remedy, even if the
> victim is able to carry out data recovery [37
> <>]. Scaife et
> al.’s work [24 <>]
> on ransomware detection uses features from file system data, such as
> whether the contents appear to have been encrypted, and number of changes
> made to the file type. The authors were able to detect and block all of the
> 492 ransomware samples tested with less than 33% of user data being lost in
> each instance. Continella et al. [38
> <>] propose a
> self-healing system, which detects malware using file system machine
> activity (such as read/write file counts); the authors were able to detect
> all 305 ransomware samples tested, with a very low false-positive rate.
> These two approaches use features selected specifically for their ability
> to detect ransomware, but this requires knowledge of how the malware
> operates, whereas the approach taken here seeks to use features which can
> be used to detect malware *in general*. The key purpose of this final
> experiment (Section 6.5
> <>) is to show
> that our general model of malware detection is able to detect general types
> of malware as well as time-critical samples such as ransomware.
> 5. Experimental Setup
> This section outlines the data capture process and dataset statistics.
> 5.1. Features
> The same features as were used in previous work [11
> <>] are used here
> for process detection, with some additional features to measure
> process-specific data. Despite the popularity of API calls noted in Ref. [
> 18 <>], due to
> these findings and Sun et al.’s [23
> <>] difficulties
> hooking this data in real-time, these were not considered as features to
> train the model.
> At the process-level, 26 machine metric features are collected; these were
> dictated by the attributes available using the Psutil [39
> <>] python library.
> It is also possible to include the “global” machine learning metrics that
> were used in the previous papers. Although global metrics will not provide
> process-level granularity, they may give muffled indications of the
> activity of a wider process tree. The 9 global metrics are: system-level
> CPU use, user-level CPU use, memory use, swap memory use, number of packets
> received and sent, number of bytes received and sent, and the total number
> of processes running.
> The process-level machine activity metrics collected are: CPU use at the
> user level, CPU use at the system level, physical memory use, swap memory
> use, total memory use, number of child process, number of threads, maximum
> process ID from a child process, disk read, write and other I/O count,
> bytes read, written and used in other I/O processes, process priority, I/O
> process priority, number of command line arguments passed to process,
> number of handles being used by process, time since the process began, TCP
> packet count, UDP packet count, number of connections currently open, and 4
> port statuses of those opened by the process (see Table 2
> <>).
> Table 2  <>
> 26 process-level features: 22 features + 4 port status values.
> 5.1.1. Preprocessing
> Feature normalisation is necessary for NNs to avoid over-weighting
> features with higher absolute values. The test, train, and validation sets (
> ) are all normalised by subtracting the mean () and dividing by the
> standard deviation () of each feature in the training set: . This sets
> the range of input values largely between −1 and 1 for all input features,
> avoiding the potential for some features to be weighted more important than
> others during training purely due to the scalar values of those features.
> This requires additional computational resources but is not necessary for
> all ML algorithms; this is another reason why the supervised RNN used in
> Ref. [11 <>] may
> not be well-suited for real-time detection.
> 5.2. Data Capture
> During data capture, this research sought to improve upon previous work
> and emulate real machine use to a greater extent than has previously been
> trialled. The implementation details of the VM, simultaneous process
> execution, and RL simulation are outlined below:
> 5.2.1. Environment: Machine Setup
> The following experiments were conducted using a virtual machine (VM)
> running with Cuckoo Sandbox [40
> <>] for ease of
> collecting data and restarting between experiments and because the Cuckoo
> Sandbox emulates human interaction with programs to some extent to promote
> software activity. In order to emulate the capabilities of a typical
> machine, the modal hardware attributes of the top 10 “best seller” laptops
> according to a popular Internet vendor [41
> <>] were used, and
> these attributes were the basis of the VM configuration. This resulted in a
> VM with 4GB RAM, 128GB storage, and dual-core processing running Windows 7
> 64 bit. Windows 7 was the most prevalent computer operating system (OS)
> globally at the time of designing the experiment [25
> <>]. Although
> Windows 10 is now the most popular OS, the findings in this research should
> still be relevant.
> 5.2.2. Simultaneous Applications
> In typical machine use, multiple applications run simultaneously. This is
> not reflected by behavioural malware analysis research in which samples are
> injected individually to a virtual machine for observation. The environment
> used for the following experiments launches multiple applications on the
> same machine at slightly staggered intervals as if a user were opening
> them. Each malware is launched with a small number (1–3) and a larger
> number (3–35) of applications. It was not possible to find up-to-date user
> data on the number of simultaneous applications running on a typical
> desktop, so here it was elected to launch up to 36 applications (35 benign
> + 1 malicious) at once, which is the largest number of simultaneous apps
> for real-time data collection to date. From the existing real-time analysis
> literature, only Sun et al. [23
> <>] run multiple
> applications at the same time, with a maximum of 5 running simultaneously.
> Each application may in turn launch multiple processes, causing more than
> 35 processes to run at once; 95 is the largest number of simultaneous
> processes recorded; this excludes background OS processes.
> 5.2.3. Reinforcement Learning Simulation
> For reinforcement learning, the DQN requires an observation of the
> resulting state following an action. To train the model, a simulated
> environment is created from the pre-collected training data whereby the
> impact of killing or not killing a process is returned as the next state.
> For process-level elements, this reduces all features to zero. A caveat
> here is that in reality, killing the process may not occur immediately and
> therefore memory, processing power, etc., may still be being consumed at
> the next data observation. For global metrics, the process-level values for
> the killed processes (includes child processes of the killed process) are
> subtracted from the global metrics. There is a risk again that this
> calculation may not correlate perfectly with what would be observed in a
> live machine environment.
> In order to observe the model performance, a visualisation was developed
> to accompany the simulated environment. Figures 2
> <> and 3
> <> show
> screenshots of the environment visualisation for one malicious and one
> benign process.
> Figure 2  <>
> Benignware sample, normalised process-level metrics, 6 observations made
> without process being killed.
> Figure 3  <>
> Malware sample, normalised process-level metrics, no observations made yet.
> 5.3. Dataset
> The dataset comprises 3,604 benign executables and 2,792 malicious
> applications (each containing at least one executable), with 2,877 for
> training and validation and 3,519 for testing. These dataset sizes are
> consistent with previous real-time detection dataset sizes (Das et al. [20
> <>] use 168
> malicious, 370 benign; Sayadi et al. [15
> <>] use over 100
> each benign and malicious; Ozsoy et al. [21
> <>] use 1,087
> malicious and 467 benign; Sun et al. [23
> <>] use 9,115
> malicious, 877 benign). With multiple samples running concurrently to
> simulate real endpoint use, there are 24K processes in the training set and
> 34K in the test set. Overall, there are 58K behavioural traces of processes
> in the training and testing datasets. The benign samples comprise files
> from VirusTotal [42
> <>], from free
> software websites (later verified as benign with VirusTotal), and from a
> fresh Microsoft Windows 7 installation. The malicious samples were
> collected from two different VirusShare [43
> <>] repositories.
> In Pendelbury et al.’s analysis [13
> <>], the authors
> estimate that in the wild between 6% and 22% of applications are malicious,
> normalising to 10% for their experiments. Using this estimation of Android
> malware, a similar ratio was used in the test set in which 13.5% were
> malicious.
> 5.3.1. Malware Families
> PUA = potentially unwanted application, RAT = remote access trojan.
> This paper is not concerned with distinguishing particular malware
> families, but rather with identifying malware in general. However, a
> dataset consisting of just one malware family would present an unrealistic
> and easier problem than is found in the real world. The malware families
> included in this dataset are reported in Table 3
> <>. The malware
> family labels are derived from the output of around 60 antivirus engines
> used by VirusTotal [42
> <>].
> Table 3  <>
> Malware families with more than 10 samples in the dataset. 315 families
> were represented in the dataset, with 27 having being represented more than
> 10 times. Basic description provided which does not cover the wide range of
> behaviours carried out by some malware families but is intended to indicate
> the range of behaviours in the top 27 families included in the dataset.
> Ascribing family labels to malware is nontrivial since antivirus vendors
> do not follow standardised naming conventions and many malware families
> have multiple aliases. Sebastián et al. [44
> <>] have developed
> an open source tool, AVClass, to extract meaningful labels and correlate
> aliases between different antivirus outputs. AVClass was used to label the
> malware in this dataset. Sometimes there is no consensus amongst the
> antivirus’ output or the sample is not recognised as a member of an
> existing family. AVClass also excludes malware that belongs to very broad
> classes of malware (e.g., “agent,” “eldorado,” and “artemis”) as these are
> likely to comprise a wide range of behaviours and so may be applied as a
> default label in cases for which antivirus engines are unsure. In the
> dataset established in this research, 2,121 of the 2,792 samples were
> assigned to a malware family. Table 3
> <> gives the
> number of samples in each family for which more than 10 instances were
> found in the dataset. 315 families were detected overall, with 27 families
> being represented more than 10 times. These better-represented families
> persist in the train and test sets, but the other families have little
> overlap. 104 of the 154 other families seen in the test set are not
> identified by AVClass as being in the training set.
> 5.3.2. Malicious Vs. Benign Behaviour
> Statistical inspection of the training set reveals that benign
> applications have fewer sub-processes than malicious processes, with 1.17
> processes in the average benign process tree and 2.33 processes in the
> average malicious process tree. Malware was also more likely to spawn
> processes outside of the process tree of the root process, often using the
> names of legitimate Windows processes. In some cases, malware launches
> legitimate applications, such as Microsoft Excel in order to carry out a
> macro-based exploit. Although Excel is not a malicious application in
> itself, it is malicious in this context, which is why malicious labels are
> assigned if a malware sample has caused that process to come into being. It
> is therefore possible to argue that some processes launched by malware are
> not malicious, because they do not individually cause harm to the endpoint
> or user, but without the malware they would not be running and so can be
> considered at least undesirable even if only in the interests of conserving
> computational resources.
> 5.3.3. Train-Test Split
> The dataset is split in half with the malicious samples in the test set
> coming from the more recent VirusShare repository, and those in the
> training set from the earlier repository. This is to increase the chances
> of simulating a real deployment scenario in which the malware tested
> contains new functionality by comparison with those in the training set.
> Ideally, the benignware should also be split by date across the training
> and test set; however, it is not a trivial task to calculate the date at
> which benignware was compiled. It is possible to extract the compile time
> from PE header, but it is possible for the PE author to manually input this
> date which had clearly happened in some instances where the compile date
> was 1970-01-01 or in one instance 1970-01-16. In the latter case
> (1970-01-16), the file is first mentioned online in 2016, perhaps
> indicating a typographic error [45
> <>]. Using Internet
> sources such as VirusTotal [42
> <>] can give an
> indication when software was first seen, but if the file is not very
> suspicious, i.e., from a reputable source, it may not have been uploaded
> until years after it was first seen “in the wild.” Due to the difficulty in
> dating benignware in the dataset collected for this research, samples were
> assigned to the training or test set randomly.
> For training, an equal number of benign and malicious processes are
> selected, so that the model does not bias towards one class. 10% of these
> are held out for validation. In most ML model evaluations, the validation
> set would be drawn from the same distribution as the test set. However,
> because it is important not to leak any information about the malware in
> the test set, since it is split by date, the validation set here is drawn
> from the training distribution.
> 5.3.4. Implementation Tools
> Data collection used the Psutil [39
> <>] Python library
> to collect machine activity data for running processes and to kill those
> processes deemed malicious. The RNN and Random Forests were implemented
> using the Pytorch [46
> <>] and
> Scikit-Learn [47 <>]
> Python libraries, respectively. The model runs with high priority and
> administrator rights to make sure the polling is maintained when compute
> resources are scarce.
> 6. Experimental Results6.1. Supervised Learning for Process Killing
> First, we demonstrate the unsuitability of a full-trace supervised
> learning malware detection model, which achieved more than 96% detection
> accuracy in Ref. [11
> <>]. The model used
> is a gated-recurrent unit recurrent neural network since this algorithm is
> designed to process time-series data. The hyperparameter configuration of
> this model was conducted using a random search of hyperparameters (see
> Table 4 <> in the
> Appendix for details.)
> Table 4  <>
> Hyperparameter search space and the hyperparameters of the model giving
> the lowest mean false-positive and false-negative rates.
> It is expected that supervised malware detection models will not adapt
> well to process-killing due to the averaging of loss metrics as described
> earlier. Initially, this is verified by using supervised learning models to
> kill processes that are deemed malicious. For supervised classification,
> the model makes a prediction every time a data measurement is taken from a
> process. This approach is compared with one taking average predictions
> across all measurements for a process and for a process tree as well as the
> result of process killing. The models with the highest validation accuracy
> for classification and killing are compared.
> Figure 4 <> illustrates
> the difference in validation set and test set F1-score, true-positive rate,
> and false-positive rate for these 4 levels of classification: each
> measurement, each process, each process tree, and finally showing process
> killing; see Figure 5
> <> for
> diagrammatic representation of these first 3 levels. Table 5
> <> reports the F1,
> TPR, and TNR for classification (each measurement of each process) and for
> process killing.
> Figure 4  <>
> F1 scores, true positive rates (TPR), and true negative rates (TNR) for
> partial-trace detection (process measurements), full-trace detection (whole
> process), whole application (process tree), and with process-level
> measurements + process killing (process killing) for validation set (left
> column) and test set (right column).
> Figure 5  <>
> Three levels of data collection: each measurement, each process, each
> process tree.
> Table 5  <>
> F1-score, true positive rate (TPR), and true negative rates (TNR)
> (all  100) on test and validation sets for classification and process
> killing.
> The highest F1-score on the validation set is achieved by an RNN using
> process data only. When process killing is applied, there is a drop of less
> than 5 percentage points in the F1-score, but more than 15 percentage
> points are lost from the TNR.
> On the unseen test set, the highest F1-score is achieved by an RNN using
> process data + global metrics, but the improvement over the process data +
> total number of processes is negligible. Overall, there is a reduction in
> F1-score from (97.44, 94.61) to (74.91, 77.66), highlighting the initial
> challenge of learning to classifying individual processes rather than
> entire applications, especially when accounting for concept drift. Despite
> the low accuracy, these initial results indicate that the model is
> discriminating some of the samples correctly and may form a baseline from
> which to improve.
> The test set TNR and TPR for classification on the best-performing model
> (process data only) are 79.70 and 82.91, respectively, but when process
> killing is applied, although the F1-score drops by 10 percentage points,
> the TNR and TPR move in opposite directions with the TNR falling to 59.63
> and TPR increasing to 90.24. This is not surprising since a single
> malicious classification results in a process being classed as malicious.
> This is true for the best-performing models using either of the two feature
> sets (see Figure 4
> <>).
> 6.2. Accuracy Vs. Resource Consumption
> Previous work on real-time detection has highlighted the requirement for a
> lightweight model (speed and computational resources). In the previous
> paper, RNNs were the best performing algorithm in classifying
> malware/benignware, but RNNs have many parameters and therefore may consume
> significant RAM and/or CPU. They also require preprocessing of the data to
> scale the values, which other ML algorithms such as tree-based algorithms
> do not. Whilst RAM and CPU should be minimised, taking model accuracy into
> account, inference duration is also an important metric.
> Although the models in this paper have not been coded for performance and
> use common python libraries, comparing these metrics helps to decide
> whether certain models are vastly preferable to others with respect to
> computational resource consumption. The PyRAPL library [49
> <>] is used measure
> the CPU, RAM, and duration used by each model. This library uses Intel
> processor “Running Average Power Limit” (RAPL) metrics. Only data
> preprocessing and inference is measured as training may be conducted
> centrally in a resource-rich environment. Batch sizes of 1, 10, 100, and
> 1000 samples are tested with 26 and 37 features, respectively, since there
> are 26 process-level features and 37 when global metrics are included. Each
> model is run 100 times for each of the different batch sizes.
> For the RNN, a “large” and a “small” model are included. The large models
> have the highest number of parameters tested in the random search (981
> hidden neurons, 3 hidden layers, sequence length of 17) and the smallest
> (41 neurons, 1 hidden layer, sequence length of 13). These two RNN
> configurations are compared against other machine learning models which
> have been used for malware detection: Multi-Layer Perceptron (feed-forward
> neural network), Support Vector Machine, Naive Bayes Classifier, Decision
> Tree Classifier, Gradient Boosted Decision Tree Classifier (GBDTs), Random
> Forest, and AdaBoost.
> 26 features = process-level only, 37 features = machine and process level
> features
> Table 6 <> reports
> the computational resource consumption and accuracy metrics together.
> Decision tree with 38 features is the lowest cost to run, RNN performs best
> at supervised learning classification on the validation set but only just
> outperforms the decision tree with 26 features, which is the best
> performing model at process killing on the validation set at 92.97
> F1-score. The highest F1-score for process killing uses a Random Forest
> with 37 features, scoring 77.85 F1, which is 2 percentage points higher
> than the RF with 26 features (75.97). The models all perform at least 10
> percentage points better on the validation set, indicating the importance
> of taking concept drift into account when validating models.
> Table 6  <>
> Average resource consumption over 100 iterations for a batch size of 100
> vs. F1-scores on validation and test set for classification and process
> killing across 14 models..
> 6.3. How to Solve a Problem like Process Killing?
> From the results above, it is clear that supervised learning models see a
> significant drop in classification accuracy when processes are killed as
> the result of a malicious label. This confirmation of the initial
> hypothesis presented here justifies the need to examine alternative
> methods. In the interests of future work and negative result reporting,
> this paper reports all of the methods attempted and finds that simple
> statistical manipulations on the supervised learning models perform better
> than using alternative training methods. This section briefly describes the
> logic of each method and provides a textual summary of the results with a
> formula where appropriate. This is followed by a table of the numerical
> results for each method. In the following section, let  be a set of
> processes  in a process tree,  be the time at which a prediction is made,
> let  be the prediction for process  at time  where a prediction equal to or
> greater than 1 classifies malware.
> 6.3.1. Mean Predictions
> Reasoning: Taking the average prediction across the whole process will
> smooth out those process killing results.
> *Not tested.* This was not attempted for two reasons: (1) Taking the mean
> at the end of the process means the damage is done. (2) This method can
> easily be manipulated by an attacker: 50 seconds of injected benign
> activity required 50 seconds of malicious activity to achieve a true
> positive
> 6.3.2. Rolling Mean Predictions
> Reasoning: Taking the average over a few measurements will eliminate those
> false positives that are caused by a single false positive over a subset of
> the execution trace. Window sizes of 2 to 5 are tested. Let  be the window
> size:
> *Summary of results:* A small but unilateral increase in F1-Score using a
> rolling window over 2 measurements on the validation set. Using a rolling
> window of size 2 on the test set saw a 10 to 20 percentage point increase
> in true negative rate (to a maximum of 80.77) with 3 percentage points lost
> from the true-positive rate. This was one of the most promising approaches.
> 6.3.3. Alert Threshold
> Reasoning: Like the rolling mean, single false positives will be
> eliminated but unlike the rolling mean, the alerts are cumulative over the
> entire trace such that a single alert at the start and 30 seconds into the
> process will cause the process to be killed rather than requiring that both
> alerts are within a window of time. Between 2 and 5 minimum alerts are
> tested
> *Summary of results:* Again, a small increase across all models, with an
> optimal minimum number of alerts being 2 for maximum F1-score, competitive
> with the rolling mean approach.
> 6.3.4. Process-Tree Averaging
> Reasoning: The data are labelled at the application level; therefore, the
> average predictions across the process tree should be considered for
> classification
> *Summary of results:* Negligible performance increase on validation and
> test set data (less than 1 percentage point). This is likely because few
> samples have more than one process executing simultaneously.
> 6.3.5. Process-Tree Training
> Reasoning: The data are labelled at the application level; therefore, the
> sum of resources of each process tree should be classified at each
> measurement, not the individual processes.
> *Summary of results:* Somewhat surprisingly, there was a slight reduction
> in classification accuracy when using process tree data. One explanation
> for this may be that the process tree creates noise around the
> differentiating characteristics that are visible at the process level.
> 6.3.6. DQN
> Reasoning: Reinforcement learning is designed for state-action space
> learning. Both pre-training the model with a supervised learning approach
> and not pre-training the model were tested.
> *Summary of results:* Poor performance, typically converging to either
> kill or not kill everything, of the few models that did not converge to a
> single dominant action; it does not distinguish malware or benignware well,
> indicating that it may not have learned anything. Reinforcement learning
> may help the problem of real-time malware detection and process killing,
> but this initial implementation of a DQN did not converge to a better or
> even competitive solution to supervised learning. Perhaps, better
> formulation of rewards (e.g., damage prevented) would help the agent learn.
> 6.3.7. Regression on Predicted Kill Value
> Reasoning: Though the DQN explores and exploits different state-action
> pairs and their associated rewards, when the reward from each action is
> known in the first place and the training set is limited, as it is here,
> Q-learning can be framed as a regression problem in which the model tries
> to learn the return (rewards + future rewards), the training is faster and
> can be used by any regression-capable algorithm. Let  be the number of
> current and future child processes for  at
> *Summary of results:* Improved performance on true negative rate,
> although not perceptible for the highest-scoring F1 models since F1-scores
> reward true positives more than true negatives, this metric can struggle to
> reflect a balance between the true-positive and true-negative rates. The
> highest true-negative rate models are all regression models.
> Table 7 <> lists
> the F1, TPR, and TNR on the validation and test set for each of the methods
> described above. The best-performing model on the test and validation sets
> is reported and the full results can be found in Appendix Table 8
> <>–10
> <>. Small
> improvements are made by some models on the validation F1-score, but the
> test set F1-score improves by 4 percentage points in the best instance.
> Table 7  <>
> Summary of the best process killing models by model training methodology.
> F1, TNR, and TPR for validation and test datasets (full results in Appendix
> Tables 8 <>–10
> <>).
> Table 8  <>
> Summary of process killing models, validation and test set score metrics
> [Table 1 of 3].
> Table 9  <>
> Summary of process killing models, validation, and test set score metrics
> [Table 2 of 3].
> Table 10  <>
> Summary of process killing models, validation, and test set score metrics
> [Table 3 of 3].
> In most cases, the models with the highest F1-score on the validation and
> test sets are not the same. The highest F1-score is 81.50 from an RF using
> a minimum alert threshold of 2 and both process-level and global process
> metrics.
> 6.4. Further Experiment: Favouring High TNR
> Although the proposed model is motivated by the desire to prevent malware
> from executing, the best TNR reported amongst the models above is 81.50%.
> 20% of benign processes being killed would not be acceptable to a user.
> Whilst this research is a novel attempt at very early-stage real-time
> malware detection and process killing, one might consider the usability and
> prefer a model with a very high TNR, even if this is at the expense of the
> TPR.
> Considering this, the AdaBoost regression algorithm achieves a 100% TNR
> with a 39.50% TPR on the validation set. The high FNR is retained in the
> test set standing at 97.92%, but the TPR drops even further to just 8.40%.
> The GBDT also uses regression to estimate the value of process killing and
> coupled with a minimum of 4 alerts performs well on the test set but does
> not stand out in the validation set, see Table 11
> <>.
> Table 11  <>
> Two models’ F1-score, TNR, TPR for the validation and test set scoring the
> highest TNR on the validation and test sets.
> Although less than 10% of the test set malicious processes is killed by
> the AdaBoost regressor, this model may be the most viable despite the low
> TPR. Future work may examine the precise behaviour and harm caused by
> malware that is/is not detected. To summarise results, the most-detected
> families were Ekstak (180), Mikey (80), Prepscram (53 processes), and Zusy
> (49 processes) of the 745 total samples.
> 6.5. Measuring Damage Prevention in Real Time
> Although a high percentage of processes are correctly identified as
> malicious by the best performing model (RF with 2 alerts and 37 features),
> it may be that the model detects the malware after it has already caused
> damage to the endpoint. Therefore, instead of looking at the time at which
> the malware is correctly detected, a live test was carried out with
> ransomware to measure the percentage of files corrupted with and without
> the process killing model working. This real-time test also assesses
> whether malware can indeed be detected in the early stages of execution or
> whether the data recording, model inference, and process killing is too
> slow in practice to prevent damage.
> Ransomware is the broad term given to malware that prevents access to user
> data (often by encrypting files) and holds the means for restoring the data
> (usually a decryption key) from the user until a ransom is paid. It is
> possible to quantify the damage caused by ransomware using the proportion
> of modified files as Scaife et al. [24
> <>] have done in
> developing a real-time ransomware (only) detection system. The damage of
> some malware types are more difficult to quantify owing to their dependence
> on factors outside the control of the malware. For example, the damage
> caused by spyware will depend on what information it is able to obtain, so
> it is difficult to quantify the benefit of killing spyware 5 seconds after
> execution compared with 5 minutes into execution. Ransomware offers a clear
> metric for the benefits of early detection and process killing.
> Although the RF with a minimum of 2 alerts using both process and global
> data gave the highest F1-score on the test set (81.50), earlier experiments
> showed that RFs are not one of the most computationally efficient models by
> comparison with those tested. Therefore, a decision tree is trained on
> process-only data (26 features) in case the time-to-classification is
> important for damage reduction despite the lower F1-score. For this reason,
> the decision tree model is used in this test. The DT also has a very
> slightly higher TPR (see Table 12
> <>) so a higher
> damage prevention rate may be partially due to the model itself rather than
> just the fewer features being collected and model classification speed.
> Table 12  <>
> Random Forest and Decision Tree each with a minimum requirement of two
> alerts (“malicious classifications”) to kill a process. F1, TNR, and TPR
> reported on validation and test set.
> 22 fast-acting ransomware files were identified from a separate VirusShare
> [43 <>] repository
> which *(i)* do not require Internet connection and *(ii)* begin
> encrypting files within the first few seconds of execution. The former
> condition is set because the malicious server may no longer exist and for
> safety, it is not desirable to connect to it if it does exist. Some malware
> is able to cause significant damage in seconds, in which the timeframes are
> impossible for a human to see, process, react to, and alert in.
> The 22 samples were executed for 30 seconds each without the process
> killing model and the number of files modified was recorded. The process
> was repeated with 4 process killing models: DT with min. 2 alerts and 26
> features, RF with min. 2 alerts and 37 features, AdaBoost regressor with 26
> features, and GDBT regressor with min. 4 alerts and 26 features.
> It was necessary to run the killing model with administrator privileges
> and to write an exception for the Cuckoo sandbox agent process which
> enables the host machine to read data from the guest machine since the
> models killed this process. The need for this exception highlights that
> there are benign applications with malicious-like behaviours, perhaps
> especially those used for networking and security.
> Figure 6 <> and
> Table 13 <> give
> the total number of corrupted files across the 22 samples. The damage
> prevention column is a proxy metric denoting how many files were not
> corrupted using a given process killing model by comparison with no model
> being in place. The 22 samples on average each corrupt 910 files within 30
> seconds.
> Figure 6  <>
> Total number of files corrupted by ransomware with no process killing and
> with three process killing models within the first 30 seconds of execution.
> Table 13  <>
> Total number of files corrupted by ransomware with no process killing and
> with three process killing models within the first 30 seconds of execution.
> Damage reduction is the percentage of files spared when no killing is
> implemented.
> The DT model almost entirely eliminates any file corruption with only
> three being corrupted. The RF saves 92.68% of files. The ordinal ranking of
> “damage prevention” is the same as the TPR on the test set, but the
> relationship is not proportional. The same ordinal relationship indicates
> that the simulated impact of process killing on the collected test set was
> perhaps a reasonable approximation of measuring at least fast-acting
> ransomware damage, despite the TPR test set metrics being based on other
> malware families, too.
> The DT demonstrates that this architecture is capable of preventing
> damage, but the TNR on the test set of the DT model is so low (66.19) that
> this model cannot be preferred to the RF (81.53 TNR), which still prevents
> over 90% of file damage.
> The GBDT prevents some damage, and detects a comparable number of
> ransomware samples (1 in 5). The AdaBoost regressor detected 2 ransomware
> samples of the 22, and in these two cases more than 64% and 45% of files
> were saved, respectively; perhaps, with more execution time, the files
> would be detected but the key benefit of process killing is to stop
> damaging software like these ransomware samples and this algorithm actually
> saw more files encrypted than when no killing model was used; this is
> because there will be a slight variance in the ransomware behaviour and
> execution time each time it runs. The Random Forest is the most plausible
> model, balancing damage prevention and TNR; however, the delay in
> classification may be a result of the requirement to collect more features
> and/or the real-time of the model itself.
> 7. Discussion: Measuring Execution Time in a Live Environment
> Although algorithm execution duration was measured above, due to batch
> processing used by the models, the number of processes being classified can
> be increased by an order of magnitude with a negligible impact on execution
> time. The data collection and process killing both have linear, , complexity;
> where  is the number of processes; therefore it is expected that the number
> of processes impacts classification time. The RF with statistical filters
> has complexity  where  is the number of trees in the forest and  is the
> number of alerts considered by the filter; efficient library
> implementations of matrix operations means that the execution time does not
> scale linearly with  for the RF inference. Given this, a further experiment
> was carried out with the RF to measure in a live environment how long the
> data collection, model inference, and process killing takes as the number
> of processes increases. This was tested by executing more than 1000
> processes in the virtual machine whilst the process killing RF runs.
> Some processes demand more computational resources than others, and some
> malware in our test set locked pages in memory [50
> <>], which
> prevented the model from having sufficient resources to collect data,
> leading to tens of seconds during which no data were captured and many
> processes were launched. With better software engineering practices, the
> model may be more robust against this kind of malicious activity.
> These differences in behaviour can cause the evaluation time to lag as
> demonstrated by the outlier points visible in Figure 7
> <>. The data show
> a broadly linear positive correlation between the number of processes
> (being monitored or killed) and the time taken for the data collection and
> process killing; this confirms the hypothesis that more processes equates
> to slower processing time. The slowest total processing time was 0.81
> seconds (seen with both 17 and 40 simultaneous processes running), but the
> mean processing time is just under 0.3 seconds with 65 simultaneous
> processes, fitting comfortably within the 1-second goal time. Additional
> code optimisation could greatly improve on these initial results which
> indicate that the processing, even using standard libraries and a
> high-level programming language, can execute reasonably quickly.
> Figure 7  <>
> Mean time to collect data, analyse data with Random Forest, and kill
> varying numbers of processes.
> 8. Implications and Analysis
> The experiments in this paper address a largely unexplored area of malware
> detection, by comparison with post-trace classification. Real-time
> processing and response has a number of benefits outlined above and the
> results presented here give tentative indications of the advantages and
> challenges of such an approach.
> The initial experiments (Section 6.1
> <>) demonstrate
> that a high-accuracy RNN (as used in [11
> <>]) does not
> maintain high-accuracy when used in real-time with an automated response to
> classify individual processes rather than full application traces, since a
> single false positive classification of sequential data cannot be
> outweighed by later correct predictions.
> The next set of experiments (Section 6.1
> <>) showed that
> whilst the RNN achieves one of the highest classification accuracies of a
> set of algorithms tested, it is not one of the best in terms of
> computational resource consumption or latency. However, a clear
> best-algorithm was not evident either since the low-resource consuming
> algorithms (like decision tree) did not always achieve high accuracy.
> Furthermore, all of the supervised learning algorithms were clearly
> unsuited to process killing with the highest F1 score from any algorithm
> being 77.85 on the test set compared with 85.55 for process-level
> classification alone. This 85.55 F1 score is lower than is seen in many
> dynamic malware detection research publications that use full-application
> behavioural traces, indicating the challenges of classification at the
> process level, where malware and benignware may share functionality.
> Attempting to improve detection accuracy, three approaches were tested:
> statistical filtering, reinforcement learning, and a regression model
> estimating the utility (q-value) of killing a process. Statistical filters
> using rolling mean or alert thresholds were the only approach to improve on
> the supervised learning model F1 score. Reinforcement learning tended to
> kill processes too early and therefore not explore enough scenarios (and
> thus receive the requisite reinforcement) to allow benign processes to
> continue; this does not mean that future models could not improve upon this
> result. This may be supported by the success of the regression models in
> maintaining a high true-negative rate, given that these models ascribed a
> similar utility to killing processes as the reinforcement learning models.
> The accuracy metrics tested thus far simply indicate whether a process was
> ever killed, but do not address whether damage was actually prevented by
> process killing. If damage was not prevented, there is little point to
> process killing and a database of alerts for analysis would be a better
> solution since the risk of killing benignware is eliminated. This is why
> the final set of experiments in Section 6.5
> <> were
> conducted to test the detection models in real time and see if damage could
> be prevented by looking at the number file corrupted by ransomware before
> and after infection. Here, we found that it is possible to prevent 92% of
> files from being encrypted whilst maintaining a true negative rate of 82%.
> This result does not indicate that the system is ready for real-world
> deployment but that perhaps further model analysis probably including
> anomaly detection could raise the true negative rate to a usable point.
> This work also demonstrates the damage that certain malware can carry out
> in a short space of time and reinforces the need for further research in
> this area, since previous work has either focused solely on ransomware [24
> <>] or waited
> minutes to being classification [23
> <>], by which time
> it is too late.
> 9. Future Work
> Real-time attack detection has wider applications than endpoint detection,
> as Alazab et al. [51
> <>] argue that
> Internet of Things networks in particular could benefit from real-time
> attack detection using heterogeneous data feed from different sensors
> combined using federated learning approaches.
> However, some challenges remain to be solved; behavioural malware analysis
> research using machine learning regularly reports 95% classification
> accuracy. Although useful for analysts, behavioural detection should be
> deployed as part of endpoint defensive systems to leverage the full
> benefits of a detection model. Dynamic analysis is not typically used for
> endpoint protection, perhaps because it takes too long in data collection
> to deliver the quick verdicts required for good user experience. Real-time
> detection on the endpoint allows for observation of the full trace without
> the user having to wait. However, real-time detection also introduces the
> risk that malware will cause damage to the endpoint. This risk requires
> that processes detected as malicious are automatically killed as early as
> possible to avoid harm.
> There are some key challenges to implementation, which have been outlined
> in this paper:(i)The need for signal separation drives the use of
> individual processes and only partial traces can be used.(ii)The
> significant drop in accuracy on the unseen test set, even without process
> killing demonstrates that additional features may be necessary to improve
> detection accuracy.(iii)With the introduction of process killing, the
> poor performance of the models on either benignware classification (RF min
> 2 alerts: TNR 81% with an 88% TPR on the test set) or on malware
> classification (GBDT regressor min 4 alerts: 56% TPR with a 94% TNR on the
> test set) means that considerable further work is needed before very early
> stage real-time detection can be considered for real-world use.(iv)Real-time
> detection using full execution traces of processes, however, may be viable.
> This is useful to handle VM-aware malware, which may only reveal its true
> behaviour in the target environment. Although the more complex approach
> using DQNs algorithms did not outperform the supervised models with some
> additional statistical thresholds, the regression models had better
> performance in correctly classifying benignware. Reinforcement learning
> could still be useful for real-time detection and automated cyber defense
> models, but the DQN in these experiments did not perform well.(v)Despite
> the theoretical unsuitability of supervised learning models to state-action
> problems, these experiments demonstrate how powerful supervised learning
> can be for classification problems, even if the problem is not quite the
> one that the model is attempting to solve.(vi)Future work may require a
> more comprehensive manual labelling effort at the process level and perhaps
> labelling sub-sections of processes as malicious or benign.
> An additional consideration for real-time detection with automated actions
> is whether this introduces an additional denial-of-service vector using
> process injection for example to trigger process killing. This may also
> however indicate that an attacker is present and therefore aid the user.
> 10. Conclusions
> This paper has built on previous work in real-time detection to address
> some of the key challenges: signal separation, detection with partial
> execution traces, and computational resource consumption with a focus on
> preventing harm to the user, since real-time detection introduces this risk.
> Behavioural malware detection using virtual machines is a well-established
> research field yielding high detection accuracy in recent literature [3
> <>, 6
> <>, 11
> <>, 20
> <>]. However, as is
> shown here, fixed-time execution in a sandbox may not reveal malicious
> functionality. Real-time malware analysis addresses this issue but risks
> executing malware on the endpoint and requires detection to take place at
> the process level, which is more challenging as the definition of a
> malicious process can be unclear. These two reasons may account for the
> limited literature on real-time detection. Looking forward, real-time
> detection may become more popular if static data manipulation and
> VM-evasion continue to be used and the costs of malicious execution
> continue to rise. Real-time detection does not need to be an alternative to
> these approaches, but could hold complementary value as part of a
> defense-in-depth endpoint security.
> To the best of our knowledge, previous real-time detection work has used
> up to 5 simultaneous applications, whereas other users may use far more.
> This paper has demonstrated that up to 35 simultaneous applications (and
> nearly 100 simultaneous processes) can be constantly monitored, where
> previous work [23 <>]
> had tested a maximum of 5. Moreover, these results demonstrated that data
> collection presented a greater limiting factor than machine-learning
> algorithms, which can easily process 1000 samples with negligible impact on
> performance. This result is not too surprising since batch processing
> allows algorithms to achieve O(1) complexity by comparison with O(n) for
> data collection.
> Automatic actions are necessary in response to detection if the goal is to
> prevent harm. Otherwise, this is equivalent to letting the malware fully
> execute and simply monitor its behaviour since human response times are
> unlikely to be quick enough for fast-acting malware. From a user
> perspective, the question is not “What percentage of malware was executed?”
> or “Was the malware detected in 5 or 10 minutes?” but “How much damage has
> been done?”.
> This paper found that by using simple statistical filters on top of
> supervised learning models, it was possible to prevent 92% of files from
> being corrupted by fast-acting ransomware thus reducing the requirements on
> the user or organisation to remediate the damage, since it was prevented in
> the first instance (the rest of the attack vector would remain a concern).
> This approach does not achieve the detection accuracies of state-of-the
> art offline behavioural analysis models but, as stated in the introduction,
> these models typically use the full post-execution trace of malicious
> behaviour. Delaying classification until post-execution negates the
> principal advantages of real-time detection. However, the proposed model
> presents an initial step towards a fully automated endpoint protection
> model, which becomes increasingly necessary as adversaries become more and
> more motivated to evade offline automated detection tools.
> Data Availability
> Information on the data underpinning the results presented here, including
> how to access them, can be found in the Cardiff University data catalogue
> at 10.17035/d.2021.0148229014.
> Conflicts of Interest
> The authors declare that they have no conflicts of interest.
> Acknowledgments
> This research was partly funded by the Engineering and Physical Sciences
> Research Council (EPSRC)-grant references EP/P510452/1 and EP/S035362/1.
> The research was also partly funded by Aureirbus Operations Ltd..
> References
>    1. U. Tatar, B. Nussbaum, Y. Gokce, and O. F. Keskin, “Digital force
>    majeure: the mondelez case, insurance, and the (un) certainty of
>    attribution in cyberattacks,” *Business Horizons*, vol. 64, 2021.View
>    at: Publisher Site <> | Google
>    Scholar
>    <,%20insurance,%20and%20the%20(un)%20certainty%20of%20attribution%20in%20cyberattacks&author=U.%20Tatar&author=B.%20Nussbaum&author=Y.%20Gokce&author=&author=O.%20F.%20Keskin&publication_year=2021>
>    2. S. K. Sahay, A. Sharma, and H. Rathore, “Evolution of malware and
>    its detection techniques,” in *Information and Communication
>    Technology for Sustainable Development*, pp. 139–150, Springer,
>    Singapore, 2020.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    3. W. Huang and J. W. Stokes, “Mtnet: a multi-task neural network for
>    dynamic malware classification,” in *Proceedings of the 13th
>    International Conference on Detection of Intrusions and Malware, and
>    Vulnerability Assessment-Volume 9721, DIMVA 2016*, pp. 399–418,
>    Springer-Verlag New York, Inc., New York, NY, USA, June 2016.View at: Google
>    Scholar
>    <>
>    4. W. Hu and Y. Tan, “Black-box attacks against rnn based malware
>    detection algorithms,” in *Proceedings of the Workshops at the
>    mThirty-Second AAAI Conference on Artificial Intelligence*, Peking
>    University, Beijing, China, February 2018.View at: Google Scholar
>    <>
>    5. Y. Chen, Z. Shan, F. Liu et al., “A gene-inspired malware detection
>    approach,” in *Journal of Physics: Conference Series*, vol. 1168, IOP
>    Publishing, 2019.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    6. M. Ijaz, M. H. Durad, and M. Ismail, “Static and dynamic malware
>    analysis using machine learning,” in *Proceedings of the 2019 16th
>    International Bhurban Conference on Applied Sciences and Technology
>    (IBCAST)*, pp. 687–691, IEEE, Islamabad, Pakistan, January 2019.View
>    at: Google Scholar
>    <>
>    7. I. You and K. Yim, “Malware obfuscation techniques: a brief
>    survey,” in *Proceedings of the 2010 International Conference on
>    Broadband, Wireless Computing, Communication and Applications*, pp.
>    297–300, IEEE, Fukuoka, Japan, November 2010.View at: Google Scholar
>    <>
>    8. B. Kolosnjaji, A. Demontis, B. Biggio et al., “Adversarial malware
>    binaries: evading deep learning for malware detection in executables,” in *Proceedings
>    of the 2018 26th European Signal Processing Conference (EUSIPCO)*, pp.
>    533–537, IEEE, Rome, Italy, September 2018.View at: Google Scholar
>    <>
>    9. D. Carlin, P. O’Kane, and S. Sezer, “A cost analysis of machine
>    learning using dynamic runtime opcodes for malware detection,” *Computers
>    & Security*, vol. 85, pp. 138–155, 2019.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    10. T. Shibahara, T. Yagi, M. Akiyama, D. Chiba, and T. Yada,
>    “Efficient dynamic malware analysis based on network behavior using deep
>    learning,” in *Proceedings of the 2016 IEEE Global Communications
>    Conference (GLOBECOM)*, pp. 1–7, IEEE, Washington, DC, USA, December
>    2016.View at: Google Scholar
>    <>
>    11. M. Rhode, P. Burnap, and K. Jones, “Early-stage malware prediction
>    using recurrent neural networks,” *Computers & Security*, vol. 77, pp.
>    578–594, 2018.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    12. J. Saxe and K. Berlin, “Deep neural network based malware
>    detection using two dimensional binary program features,” in *Proceedings
>    of the 2015 10th International Conference on Malicious and Unwanted
>    Software (MALWARE)*, pp. 11–20, IEEE, Fajardo, PR, USA, October 2015.View
>    at: Google Scholar
>    <>
>    13. F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L.
>    Cavallaro, “{TESSERACT}: eliminating experimental bias in malware
>    classification across space and time,” in *Proceedings of the 28th
>    {USENIX} Security Symposium ({USENIX} Security 19), {USENIX} Association*,
>    pp. 729–746, Santa Clara, CA, USA, September 2019.View at: Google
>    Scholar
>    <{TESSERACT}:%20eliminating%20experimental%20bias%20in%20malware%20classification%20across%20space%20and%20time&author=F.%20Pendlebury&author=F.%20Pierazzi&author=R.%20Jordaney&author=J.%20Kinder&author=&author=L.%20Cavallaro>
>    14. K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel,
>    “Adversarial perturbations against deep neural networks for malware
>    classification,” 2016, arXiv preprint
>    arXiv:1606.04435.View at: Google Scholar
>    <>
>    15. H. Sayadi, N. Patel, S. M. Pd, A. Sasan, S. Rafatirad, and H.
>    Homayoun, “Ensemble learning for effective run-time hardware-based malware
>    detection: a comprehensive analysis and classification,” in *Proceedings
>    of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)*,
>    pp. 1–6, IEEE, San Francisco, CA, USA, June 2018.View at: Google
>    Scholar
>    <>
>    16. N. Usman, S. Usman, F. Khan et al., “Intelligent dynamic malware
>    detection using machine learning in ip reputation for forensics data
>    analytics,” *Future Generation Computer Systems*, vol. 118, pp.
>    124–141, 2021.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    17. P. Burnap, R. French, F. Turner, and K. Jones, “Malware
>    classification using self organising feature maps and machine activity
>    data,” *Computers & Security*, vol. 73, pp. 399–410, 2018.View at: Publisher
>    Site <> | Google Scholar
>    <>
>    18. M. Rhode, L. Tuson, P. Burnap, and K. Jones, “Lab to soc: robust
>    features for dynamic malware detection,” in *Proceedings of the 2019
>    49th Annual IEEE/IFIP International Conference on Dependable Systems and
>    Networks–Industry Track*, pp. 13–16, IEEE, Portland, OR, USA, June
>    2019.View at: Google Scholar
>    <>
>    19. H. Sayadi, A. Houmansadr, S. Rafatirad, H. Homayoun, and P. D. Sai
>    Manoj, “Comprehensive assessment of run-time hardware-supported malware
>    detection using general and ensemble learning,” in *Proceedings of the
>    15th ACM International Conference on Computing Frontiers*, pp.
>    212–215, ACM, Ischia, Italy, May 2018.View at: Google Scholar
>    <>
>    20. S. Das, Y. Liu, W. Zhang, and M. Chandramohan, “Semantics-based
>    online malware detection: towards efficient real-time protection against
>    malware,” *IEEE Transactions on Information Forensics and Security*,
>    vol. 11, no. 2, pp. 289–302, 2016.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    21. M. Ozsoy, C. Donovick, I. Gorelik, N. Abu-Ghazaleh, and D.
>    Ponomarev, “Malware-aware processors: a framework for efficient online
>    malware detection,” in *Proceedings of the 2015 IEEE 21st
>    International Symposium on High Performance Computer Architecture (HPCA)*,
>    pp. 651–661, IEEE, Burlingame, CA, USA, February. 2015.View at: Google
>    Scholar
>    <>
>    22. X. Yuan, “Phd forum: deep learning-based real-time malware
>    detection with multi-stage analysis,” in *Proceedings of the 2017 IEEE
>    International Conference on Smart Computing (SMARTCOMP)*, pp. 1-2,
>    IEEE, Hong Kong, China, May 2017.View at: Google Scholar
>    <>
>    23. R. Sun, X. Yuan, P. He et al., “Learning fast and slow:
>    PROPEDEUTICA for real-time malware detection,” *CoRR*, vol. abs/1712,
>    Article ID 01145, 2017.View at: Google Scholar
>    <>
>    24. N. Scaife, H. Carter, P. Traynor, and K. R. Butler, “Cryptolock
>    (and drop it): stopping ransomware attacks on user data,” in *Proceedings
>    of the 2016 IEEE 36th International Conference on Distributed Computing
>    Systems (ICDCS)*, pp. 303–312, IEEE, Nara, Japan, June 2016.View at: Google
>    Scholar
>    <>
>    25. GlobalStats, “Market share of windows operating system versions,”
>    2018,
>    .View at: Google Scholar
>    <>
>    26. J. Benda, A. Longtin, and L. Maler, “Spike-frequency adaptation
>    separates transient communication signals from background oscillations,” *Journal
>    of Neuroscience*, vol. 25, no. 9, pp. 2312–2321, 2005.View at: Publisher
>    Site <> | Google Scholar
>    <>
>    27. H.-D. Huang, C.-S. Lee, H.-Y. Kao, Y.-L. Tsai, and J.-G. Chang,
>    “Malware behavioral analysis system: Twman,” in *Proceedings of the
>    2011 IEEE Symposium on Intelligent Agent (IA)*, pp. 1–8, IEEE, Paris,
>    France, April 2011.View at: Google Scholar
>    <>
>    28. T. Kim, B. Kang, and E. G. Im, “Runtime detection framework for
>    android malware,” *Mobile Information Systems*, vol. 2018, 2018.View
>    at: Publisher Site <> | Google
>    Scholar
>    <>
>    29. R. S. Sutton and A. G. Barto, *Introduction to Reinforcement
>    Learning*, vol. 135, MIT Press Cambridge, Cambridge, MA, USA, 1998.
>    30. C. J. C. H. Watkins, “Learning from delayed rewards,” King’s
>    University, London, UK, 1989, PhD thesis.View at: Google Scholar
>    <>
>    31. C. J. Watkins and P. Dayan, “Q-learning,” *Machine Learning*, vol.
>    8, no. 3-4, pp. 279–292, 1992.View at: Publisher Site
>    <> | Google Scholar
>    <>
>    32. R. S. Sutton, “Integrated architectures for learning, planning,
>    and reacting based on approximating dynamic programming,” *Machine
>    Learning Proceedings 1990*, vol. 1990, pp. 216–224, 1990.View at: Publisher
>    Site <> | Google
>    Scholar
>    <,%20planning,%20and%20reacting%20based%20on%20approximating%20dynamic%20programming&author=R.%20S.%20Sutton&publication_year=1990>
>    33. L.-J. Lin, “Self-improving reactive agents based on reinforcement
>    learning, planning and teaching,” *Machine Learning*, vol. 8, no. 3-4,
>    pp. 293–321, 1992.View at: Publisher Site
>    <> | Google Scholar
>    <,%20planning%20and%20teaching&author=L.-J.%20Lin&publication_year=1992>
>    34. V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with
>    deep reinforcement learning,” 2013, arXiv
>    preprint arXiv:1312.5602.View at: Google Scholar
>    <>
>    35. V. Mnih, A. P. Badia, M. Mirza et al., “Asynchronous methods for
>    deep reinforcement learning,” in *Proceedings of the International
>    Conference on Machine Learning*, pp. 1928–1937, PMLR, New York City,
>    NY, USA, June 2016.View at: Google Scholar
>    <>
>    36. J. Kirkpatrick, R. Pascanu, N. Rabinowitz et al., “Overcoming
>    catastrophic forgetting in neural networks,” *Proceedings of the
>    National Academy of Sciences*, vol. 114, no. 13, pp. 3521–3526, 2017.View
>    at: Publisher Site <> | Google
>    Scholar
>    <>
>    37. sophos, 2020,
>    .
>    38. A. Continella, A. Guagnelli, G. Zingaro et al., “Shieldfs: a
>    self-healing, ransomware-aware filesystem,” in *Proceedings of the
>    32nd Annual Conference on Computer Security Applications*, pp.
>    336–347, ACM, Los Angeles, CF, USA, December 2016.View at: Google
>    Scholar
>    <,%20ransomware-aware%20filesystem&author=A.%20Continella&author=A.%20Guagnelli&author=G.%20Zingaro%20et%20al.>
>    39. P. S. Foundation, “Psutil python library,” 2017.View at: Google
>    Scholar
>    <>
>    40. C. Guarnieri, A. Tanasi, J. Bremer, and M. Schloesser, “The cuckoo
>    sandbox,” 2012, at: Google Scholar
>    <>
>    41., “Amazon laptops,” 2018,
>    .View at: Google Scholar
>    <>
>    42. B. Quintero, E. Martínez, V. Manuel Álvarezv, K. Hiramoto, J.
>    Canto, and A. Bermúdez, “Virustotal,” 2004,
> at: Google Scholar
>    <>
>    43., “,” 2017,
>    at: Google Scholar
>    <>
>    44. M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero, “Avclass: a
>    tool for massive malware labeling,” in *Proceedings of the
>    International Symposium on Research in Attacks, Intrusions, and Defenses*,
>    pp. 230–253, Springer, New York, NY, USA, September 2016.View at: Google
>    Scholar
>    <>
>    45. SoftAntenna, “Msys2 installer,” 2019,
> at: Google
>    Scholar
>    <>
>    46. A. Paszke, S. Gross, S. Chintala et al., in *Automatic
>    Differentiation in Pytorch*, NIPS-W, Long Beach, CA, USA, 2017.
>    47. F. Pedregosa, G. Varoquaux, A. Gramfort et al., “Scikit-learn:
>    machine learning in Python,” *Journal of Machine Learning Research*,
>    vol. 12, pp. 2825–2830, 2011.View at: Google Scholar
>    <>
>    48. D. P. Kingma and J. Ba, “Adam: ’a’ method for stochastic
>    optimization,” *CoRR*, vol. abs/1412, p. 6980, 2014.View at: Google
>    Scholar
>    <>
>    49. C. Belgaid, A. d’Azémar, G. Fieni, and R. Rouvoy, “Pyrapl,” 2019,
>    Software version at: Google
>    Scholar
>    <>
>    50. M. Corporation, “Lock pages in memory,” 2017,
>    .View at: Google Scholar
>    <>
>    51. M. Alazab, S. P. Rm, M. Parimala, P. Reddy, T. R. Gadekallu, and
>    Q.-V. Pham, “Federated learning for cybersecurity: concepts, challenges and
>    future directions,” *IEEE Transactions on Industrial Informatics*,
>    2021.View at: Publisher Site <>
>     | Google Scholar
>    <,%20challenges%20and%20future%20directions&author=M.%20Alazab&author=S.%20P.%20Rm&author=M.%20Parimala&author=P.%20Reddy&author=T.%20R.%20Gadekallu&author=&author=Q.-V.%20Pham&publication_year=2021>
> Copyright
> Copyright © 2021 Matilda Rhode et al. This is an open access article
> distributed under the Creative Commons Attribution License
> <>, which permits unrestricted
> use, distribution, and reproduction in any medium, provided the original
> work is properly cited.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 227020 bytes
Desc: not available
URL: <>

More information about the cypherpunks mailing list