Machine learning for malware detection
Machine Learning is a subfield of computer science that aims to give computers the ability to learn from data instead of being explicitly programmed, thus leveraging the petabytes of data that exists on the internet nowadays to make decisions, and do tasks that are somewhere impossible or just complicated and time consuming for us humans.
Malware is one the imminent threats that companies and users face every day. Whether it is a phishing email or an exploit delivered throughout the browser, coupled with multiple evasion methods and other security vulnerabilities, it is a proven fact that nowadays defense systems cannot compete. The availability of frameworks such as Veil, Shelter, and others are known to be used by professionals when conducting pentesting work and are known to be quite effective.
Learn Cybersecurity Data Science
Today I am going to show you that indeed Machine Learning can be used to detect Malware without having to use neither a signature detection nor a behavioral analysis.
P.S: Many products nowadays like CylanceProtect, SentinelOne, Carbon Black are known to leverage these capabilities the framework we are going to develop trough out this session is not at any level capable of doing what these products do, and I will explain shortly why.
Machine learning a brief introduction
Machine Learning is a subfield that mixes many domains of mathematics mainly Statistics and Probabilities and Linear Algebra and Computation (Algorithms, Data Processing, Numerical Calculations). To gain insight from data it is used to detect fraud, spam and recommending movies and meals and products to buy, Amazon, Facebook, Google to name a few of the hundreds of companies that use Machine learning to improve their products.
Machine Learning can be split into two major methods supervised learning and unsupervised learning the first means that the data we are going to work with is labeled the second means it is unlabeled, detecting malware can be attacked using both methods, but we will focus on the first one since our goal is to classify files.
Classification is a sub domain of supervised learning it can be either binary (malware-not malware) or multi-class (cat-dog-pig-lama...) thus malware detection falls under binary classification.
Explaining Machine Learning is beyond this article, and nowadays you can find a large amount of resources to know more about it, and you can check the Appendix for more of these resources.
The problem set
Machine Learning works by defining a problem, collecting the data, processing the data to make it usable and then feeding it to the algorithms. This makes it quite hard to implement in everything for the extensive amount of resources you may need to do this; this is called the machine learning workflow it is the minimal steps you need to start doing Machine Learning.
In our case let's define our workflow:
- First, we need to collect malware samples and clean samples we cannot work with less than 10k samples of both, and it is advisable to use even more of these
- We need to extract meaningful features from our samples these features will be the basis of our study; features are what describe something, for example, the features of a house are:
- number of rooms
- SQ foot of the house
- price
- After extracting these features, we need to process all our samples to build a dataset it can be a database file or a CSV file this way it will be easier to turn it into vectors since the algorithms work by performing computation on vectors
- Lastly, we need metrics in this binary classification there are a multitude of metrics to benchmark the performance of an algorithm (ROC/AUC, Confusion Matrix...) we will use a confusion matrix since it represents the rates of True Positives and True Negatives as well as False Positives and False Negatives.
Collecting samples and feature extraction
I assume the reader knows about the PE File Format if you do not you can read about it here, collecting samples is quite easy you can either use a paid service like (VirusTotal) or one of the links here
Okay, let's start on by discussing our model.
For our algorithm to learn from the data you feed it we need to make that data understandable and clear, in our case, we will use 12 features to teach our algorithm these features will be extracted from each binary and organized into a CSV file once.
Feature extraction
To extract features, we will be using pefile. First Step is to download pefile I assume you know some Python and how to use pip.
From your terminal run:
pip install pefile
Now that you have the necessary tools let's write some code, but first let's discuss what kind of information we want to extract. We are interested in extracting the following fields of a PE File:
- Major Image Version: Used to indicate the major version number of the application; in Microsoft Excel version 4.0, it would be 4.
- Virtual Adress and Size of the IMAGE_DATA_DIRECTORY
- OS Version
- Import Adress Table Adress
- Ressources Size
- Number Of Sections
- Linker Version
- Size of Stack Reserve
- DLL Characteristics
- Export Table Size and Adress
To make our code more organized let's start by creating a class that represents the PE File information as one object
import
os
import
pefile
class
PEFile:
"""
This Class is constructed by parsing the pe file for the interesting features
each pe file is an object by itself and we extract the needed information
into a dictionary
"""
def
__init__(self, filename):
self.pe = pefile.PE(filename, fast_load=True)
self.filename = filename
self.DebugSize =
self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[6].Size
self.DebugRVA =
self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[6].VirtualAddress
self.ImageVersion =
self.pe.OPTIONAL_HEADER.MajorImageVersion
self.OSVersion =
self.pe.OPTIONAL_HEADER.MajorOperatingSystemVersion
self.ExportRVA =
self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[0].VirtualAddress
self.ExportSize =
self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[0].Size
self.IATRVA =
self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[12].VirtualAddress
self.ResSize =
self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[2].Size
self.LinkerVersion =
self.pe.OPTIONAL_HEADER.MajorLinkerVersion
self.NumberOfSections =
self.pe.FILE_HEADER.NumberOfSections
self.StackReserveSize =
self.pe.OPTIONAL_HEADER.SizeOfStackReserve
self.Dll =
self.pe.OPTIONAL_HEADER.DllCharacteristics
Now we move on to write a small method that constructs a dictionnary for each PE File thus each sample will be represented as a python dictionnary where keys are the features and values are the value of each parsed field .
def
Construct(self):
sample = {}
for attr, k in
self.__dict__.iteritems():
if(attr !=
"pe"):
sample[attr] = k
return sample
Since we can write code let's write a script that will loop trough all samples in a folder and process each one of them then dump all those dictionaries into one csv file that we will use .
def
pe2vec():
"""
dirty function (handling all exceptions) for each sample
it construct a dictionary of dictionaries in the format:
sample x : pe informations
"""
dataset = {}
for subdir, dirs, files in os.walk(direct):
for f in files:
file_path = os.path.join(subdir, f)
try:
pe = pedump.PEFile(file_path)
dataset[str(f)] = pe.Construct()
except
Exception
as e:
print e
return dataset
# now that we have a dictionary let's put it in a clean csv file
def
vec2csv(dataset):
df = pd.DataFrame(dataset)
infected = df.transpose() # transpose to have the features as columns and samples as rows
# utf-8 is prefered
infected.to_csv('dataset.csv',
sep=',', encoding='utf-8')
Okay now we are ready to process some data, I advise you to use the code from my Github .
Exploring the data
A Step that is not needed but can be quite eye opening experience it gives a more intuitive idea about the whole data.
In [2]:
import
pandas
as
pd
import
numpy
as
np
import
matplotlib.pyplot
as
plt
malicious = pd.read_csv("bucket-set.csv")
clean = pd.read_csv("clean-set.csv")
In [3]:
print
"Clean Files Statistics"
clean.describe()
Clean Files Statistics
Out[3]:
DebugRVA
DebugSize
Dll
ExportRVA
ExportSize
IATRVA
ImageVersion
LinkerVersion
NumberOfSections
OSVersion
ResSize
StackReserveSize
clean
2.467000e+03
2467.000000
2467.000000
2.467000e+03
2467.000000
2.467000e+03
2467.000000
2467.000000
2467.000000
2467.000000
2.467000e+03
2.467000e+03
2467.0
1.009835e+05
33.970004
6305.958654
1.473796e+05
1619.046210
4.863884e+04
302.233077
9.051885
3.978111
5.942440
1.690548e+05
3.025229e+05
1.0
5.217597e+05
14.873702
12392.766981
5.148365e+05
9275.796269
4.835382e+05
2484.761684
0.651705
1.165679
0.390389
9.364935e+05
1.871939e+05
0.0
0.000000e+00
0.000000
0.000000
0.000000e+00
0.000000
0.000000e+00
0.000000
2.000000
1.000000
0.000000
9.040000e+02
2.621440e+05
1.0
4.416000e+03
28.000000
320.000000
4.304000e+03
74.000000
4.096000e+03
6.000000
9.000000
4.000000
6.000000
1.056000e+03
2.621440e+05
1.0
4.816000e+03
28.000000
320.000000
1.472000e+04
147.000000
4.096000e+03
6.000000
9.000000
4.000000
6.000000
2.040000e+03
2.621440e+05
1.0
2.099400e+04
56.000000
1344.000000
8.676000e+04
287.000000
4.096000e+03
6.000000
9.000000
4.000000
6.000000
2.190800e+04
2.621440e+05
1.0
1.769935e+07
84.000000
49472.000000
1.019821e+07
205292.000000
1.786675e+07
21315.000000
14.000000
22.000000
10.000000
2.026722e+07
4.194304e+06
1.0
In [4]:
print
"Malicious Files Statistics"
malicious.describe()
Malicious Files Statistics
Out[4]:
DebugRVA
DebugSize
Dll
ExportRVA
ExportSize
IATRVA
ImageVersion
LinkerVersion
NumberOfSections
OSVersion
ResSize
StackReserveSize
2004.000000
2004.000000
2004.000000
2.004000e+03
2.004000e+03
2.004000e+03
2004.000000
2004.000000
2004.000000
2004.000000
2.004000e+03
2.004000e+03
15453.085828
5.182136
16616.363772
1.933029e+04
3.183463e+05
6.372132e+04
19.202096
7.705589
4.477545
36.024451
4.882199e+04
1.078599e+06
50630.027056
12.926161
16693.869293
2.049653e+05
1.283018e+07
9.307602e+04
755.237241
8.081842
1.524306
1225.262134
7.545737e+05
1.011342e+06
0.000000
0.000000
0.000000
0.000000e+00
0.000000e+00
0.000000e+00
0.000000
0.000000
2.000000
1.000000
0.000000e+00
0.000000e+00
0.000000
0.000000
0.000000
0.000000e+00
0.000000e+00
8.192000e+03
0.000000
6.000000
3.000000
4.000000
1.104000e+03
1.048576e+06
0.000000
0.000000
1024.000000
0.000000e+00
0.000000e+00
2.867200e+04
0.000000
7.000000
4.000000
4.000000
2.880000e+03
1.048576e+06
0.000000
0.000000
33088.000000
0.000000e+00
0.000000e+00
1.187840e+05
5.000000
9.000000
5.000000
5.000000
3.173800e+04
1.048576e+06
396224.000000
213.000000
59669.000000
8.273884e+06
5.704256e+08
1.327168e+06
33795.000000
248.000000
18.000000
54034.000000
3.356242e+07
3.355443e+07
We can see the discrepancies between the two sets especially in the first two features Let's plot some of these features to get a visual idea about those differences
In [6]:
#lets plot
#let's label our dataframes
malicious['clean'] =
0
clean['clean'] =
1
import
seaborn
%matplotlib inline
fig,ax = plt.subplots()
x = malicious['IATRVA']
y = malicious['clean']
ax.scatter(x,y,color='r',label='Malicious')
x1 = clean['IATRVA']
y1 = clean['clean']
ax.scatter(x1,y1,color='b',label='Cleanfiles')
ax.legend(loc="right")
Out[6]:
<matplotlib.legend.Legend at 0x7f7f1e5f83d0>
We can notice the "clustering" of the Malicious samples on a tight centroid while the cleanfiles are sparse over the 'x' line let's try now to plot other features as well to get an overall understanding of what we have here
In [13]:
%matplotlib inline
fig,ax = plt.subplots()
x = malicious['DebugRVA']
y = malicious['clean']
ax.scatter(x,y,color='r',label='Malicious')
x1 = clean['DebugRVA']
y1 = clean['clean']
ax.scatter(x1,y1,color='b',label='Cleanfiles')
ax.legend(loc="right")
Out[13]:
<matplotlib.legend.Legend at 0x7f7f1f570390>
In [14]:
%matplotlib inline
fig,ax = plt.subplots()
x = malicious['ExportSize']
y = malicious['clean']
ax.scatter(x,y,color='r',label='Malicious')
x1 = clean['ExportSize']
y1 = clean['clean']
ax.scatter(x1,y1,color='b',label='Cleanfiles')
ax.legend(loc="right")
Out[14]:
<matplotlib.legend.Legend at 0x7f7f1b402190>
The more we plot and analyze the data the more we understand and get a sense of the overall distribution,of course a problem arises what do I do if I have a high-dimensional dataset well what we have here is fairly low dimensional but a lot of technics can be used to reduce the dimensions to the more "important" features algorithms like PCA and t-SNE can be used to visualize the data on 3D or even 2D plots .
Machine learning application
Enough with the statistics let's do some work, till now we did not do any machine learning work what we did is part of the whole work we took some data, cleaned it and prepared it. Now to start experimenting with Machine Learning, we have to do a few more things:
- First, we need to merge our datasets (malicious and clean) into one DataFram
- We need to split our DataFrame into two parts the first one will be used for training and later for testing
- We will then proceed to apply few algorithms and see what happens
Dataset preparation
In [22]:
dataset = pd.read_csv('malware-dataset.csv')
"""
Add this points dataset holds our data
Great let's split it into train/test and fix a random seed to keep our predictions constant
"""
import
numpy
as
np
from
sklearn.model_selection
import train_test_split
from
sklearn.metrics
import confusion_matrix
#let's import 4 algorithms we would like to test
#neural networks
from
sklearn.preprocessing
import StandardScaler
from
sklearn.neural_network
import MLPClassifier
#random forests
from
sklearn.ensemble
import RandomForestClassifier
"""
Let's prepare our data
"""
state = np.random.randint(100)
X = dataset.drop('clean',axis =
1)
y = dataset['clean']
X = np.asarray(X)
y = np.asarray(y)
X = X[:,1:]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =
0.1,random_state=0)
Now we have 4 Matrices quite big ones X_train and y_train will be used to train our different classifiers, and X_test will be used to predict the labels, and y_test will be used for metrics, in fact, we are going to compare the predictions from X_test to y_test to see how we did perform. We start by using Random Forests which are an ensemble version of Decision Trees they work by creating a lot of decision trees at training time and outputting the class that is the mode of the classes (classification), they are quite performant when it comes to binary classification problems
In [25]:
#let's start with random forests
#we initiate the classifier
clf1 = RandomForestClassifier()
#training
clf1.fit(X_train,y_train)
#prediction labels for X_test
y_pred=clf1.predict(X_test)
#metrics evaluation
"""
tn = True Negative a correct prediction clean predicted as clean
fp = False Positive a false alarm clean predicted as malicious
tp = True Positive a correct prediction (malicious)
fn = False Negative a malicious label predicted as clean
"""
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print
"TN = ",tn
print
"TP = ",tp
print
"FP = ",fp
print
"FN = ",fn
TN = 697
TP = 745
FP = 6
FN = 4
Notice anything? Well if you have 6 False Positives and 4 False Negatives with no parameter tuning and no modifications are quite good,actually we were able to detect 697 Clean files correctly and 745 Malicious Ones Correctly, guess our small Anti-Virus is working :D.
Let's try this time another classifier, we will build a simple neural network and test it on another randomized split.
According to Wikipedia
A multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network. MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable.
A Multi-Layer Perceptron is the generalized version of the perceptron which is the basis model of the neuron they are the fundamental building blocks for deep learning methods where we meet larger and deeper networks.
In [26]:
#our usual split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =
0.3,random_state=0)
#This is a special process called feature engineering where we transform our data into the same scale for better predictions
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#Here we build a Multi Layer Perceptron of 12 Layers for 12 Features you can use more if you want but it will turn into a complex zoo
mlp = MLPClassifier(hidden_layer_sizes=(12,12,12,12,12,12))
#Training the MLP on our data
mlp.fit(X_train,y_train)
predictions = mlp.predict(X_test)
#evaluating our classifier
tn, fp, fn, tp = confusion_matrix(y_test,predictions).ravel()
print
"TN = ",tn
print
"TP = ",tp
print
"FP = ",fp
print
"FN = ",fn
TN = 695
TP = 731
FP = 8
FN = 18
The all mighty Neural Network failed to detect eighteen Threats not only that it detected them as clean files which is a very very bad problem imagine your antivirus detecting a ransomware as a clean file? Well this sounds like AV Evasion on AI but let's not be pessimistic our Neural Network is very primitive we can actually make it more accurate, but this is beyond the scope of this article
Learn Cybersecurity Data Science
Conclusion
This is just the beginning. I wanted to show that Malware Classification is indeed a solvable problem if we accept 99% as a good accuracy rate. Of course, building and deploying something like this, in reality, is time-consuming and requires more knowledge and more data. This was merely a preview of the infinite possibilities machine learning and AI, in general, offers us, I hope this was educational, fun and insightful.
Sources
- Machine Learning Course by Andrew NG
- Course that will make you a deep learning practitioner in 7 weeks only requirement (Python)
- Elements of Statistical Learning (Harstie) this is a more theoretical book but quite insightful
- Selecting features to classify malware