Machine learning and AI

Machine learning for malware detection

Machine Learning is a subfield of computer science that aims to give computers the ability to learn from data instead of being explicitly programmed, thus leveraging the petabytes of data that exists on the internet nowadays to make decisions, and do tasks that are somewhere impossible or just complicated and time consuming for us humans.

Malware is one the imminent threats that companies and users face every day. Whether it is a phishing email or an exploit delivered throughout the browser, coupled with multiple evasion methods and other security vulnerabilities, it is a proven fact that nowadays defense systems cannot compete. The availability of frameworks such as Veil, Shelter, and others are known to be used by professionals when conducting pentesting work and are known to be quite effective.

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

Start Learning

Today I am going to show you that indeed Machine Learning can be used to detect Malware without having to use neither a signature detection nor a behavioral analysis.

P.S: Many products nowadays like CylanceProtect, SentinelOne, Carbon Black are known to leverage these capabilities the framework we are going to develop trough out this session is not at any level capable of doing what these products do, and I will explain shortly why.

Machine learning a brief introduction

Machine Learning is a subfield that mixes many domains of mathematics mainly Statistics and Probabilities and Linear Algebra and Computation (Algorithms, Data Processing, Numerical Calculations). To gain insight from data it is used to detect fraud, spam and recommending movies and meals and products to buy, Amazon, Facebook, Google to name a few of the hundreds of companies that use Machine learning to improve their products.

Machine Learning can be split into two major methods supervised learning and unsupervised learning the first means that the data we are going to work with is labeled the second means it is unlabeled, detecting malware can be attacked using both methods, but we will focus on the first one since our goal is to classify files.

Classification is a sub domain of supervised learning it can be either binary (malware-not malware) or multi-class (cat-dog-pig-lama...) thus malware detection falls under binary classification.

Explaining Machine Learning is beyond this article, and nowadays you can find a large amount of resources to know more about it, and you can check the Appendix for more of these resources.

The problem set

Machine Learning works by defining a problem, collecting the data, processing the data to make it usable and then feeding it to the algorithms. This makes it quite hard to implement in everything for the extensive amount of resources you may need to do this; this is called the machine learning workflow it is the minimal steps you need to start doing Machine Learning.

In our case let's define our workflow:

First, we need to collect malware samples and clean samples we cannot work with less than 10k samples of both, and it is advisable to use even more of these
We need to extract meaningful features from our samples these features will be the basis of our study; features are what describe something, for example, the features of a house are:
- number of rooms
- SQ foot of the house
- price
After extracting these features, we need to process all our samples to build a dataset it can be a database file or a CSV file this way it will be easier to turn it into vectors since the algorithms work by performing computation on vectors
Lastly, we need metrics in this binary classification there are a multitude of metrics to benchmark the performance of an algorithm (ROC/AUC, Confusion Matrix...) we will use a confusion matrix since it represents the rates of True Positives and True Negatives as well as False Positives and False Negatives.

Collecting samples and feature extraction

I assume the reader knows about the PE File Format if you do not you can read about it here, collecting samples is quite easy you can either use a paid service like (VirusTotal) or one of the links here

Okay, let's start on by discussing our model.

For our algorithm to learn from the data you feed it we need to make that data understandable and clear, in our case, we will use 12 features to teach our algorithm these features will be extracted from each binary and organized into a CSV file once.

Feature extraction

To extract features, we will be using pefile. First Step is to download pefile I assume you know some Python and how to use pip.

From your terminal run:

pip install pefile

Now that you have the necessary tools let's write some code, but first let's discuss what kind of information we want to extract. We are interested in extracting the following fields of a PE File:

Major Image Version: Used to indicate the major version number of the application; in Microsoft Excel version 4.0, it would be 4.
Virtual Adress and Size of the IMAGE_DATA_DIRECTORY
OS Version
Import Adress Table Adress
Ressources Size
Number Of Sections
Linker Version
Size of Stack Reserve
DLL Characteristics
Export Table Size and Adress

To make our code more organized let's start by creating a class that represents the PE File information as one object

import
						os

import
						pefile

class
						PEFile:

"""

    This Class is constructed by parsing the pe file for the interesting features

    each pe file is an object by itself and we extract the needed information

    into a dictionary

"""


					def
							__init__(self, filename):


					self.pe = pefile.PE(filename, fast_load=True)


					self.filename = filename


					self.DebugSize =
										self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[6].Size


					self.DebugRVA =
										self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[6].VirtualAddress


					self.ImageVersion =
										self.pe.OPTIONAL_HEADER.MajorImageVersion


					self.OSVersion =
										self.pe.OPTIONAL_HEADER.MajorOperatingSystemVersion


					self.ExportRVA =
										self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[0].VirtualAddress


					self.ExportSize =
										self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[0].Size


					self.IATRVA =
										self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[12].VirtualAddress


					self.ResSize =
										self.pe.OPTIONAL_HEADER.DATA_DIRECTORY[2].Size


					self.LinkerVersion =
										self.pe.OPTIONAL_HEADER.MajorLinkerVersion


					self.NumberOfSections =
										self.pe.FILE_HEADER.NumberOfSections


					self.StackReserveSize =
										self.pe.OPTIONAL_HEADER.SizeOfStackReserve


					self.Dll =
										self.pe.OPTIONAL_HEADER.DllCharacteristics

Now we move on to write a small method that constructs a dictionnary for each PE File thus each sample will be represented as a python dictionnary where keys are the features and values are the value of each parsed field .

def
						Construct(self):

        sample = {}


					for attr, k in
									self.__dict__.iteritems():


					if(attr !=
									"pe"):

                sample[attr] = k


					return sample

Since we can write code let's write a script that will loop trough all samples in a folder and process each one of them then dump all those dictionaries into one csv file that we will use .

def
						pe2vec():

"""

    dirty function (handling all exceptions) for each sample

    it construct a dictionary of dictionaries in the format:

        sample x : pe informations

"""

    dataset = {}


					for subdir, dirs, files in os.walk(direct):


					for f in files:

            file_path = os.path.join(subdir, f)


					try:

                pe = pedump.PEFile(file_path)

                dataset[str(f)] = pe.Construct()


					except
							Exception
									as e:


					print e


					return dataset

# now that we have a dictionary let's put it in a clean csv file

def
						vec2csv(dataset):

    df = pd.DataFrame(dataset)

    infected = df.transpose()  # transpose to have the features as columns and samples as rows


					# utf-8 is prefered

    infected.to_csv('dataset.csv',

                    sep=',', encoding='utf-8')

Okay now we are ready to process some data, I advise you to use the code from my Github .

Exploring the data

A Step that is not needed but can be quite eye opening experience it gives a more intuitive idea about the whole data.

In [2]:

import
						pandas
								as
										pd

import
						numpy
								as
										np

import
						matplotlib.pyplot
								as
										plt

malicious = pd.read_csv("bucket-set.csv")

clean = pd.read_csv("clean-set.csv")

In [3]:

print
						"Clean Files Statistics"

clean.describe()

Clean Files Statistics

Out[3]:

DebugRVA

DebugSize

Dll

ExportRVA

ExportSize

IATRVA

ImageVersion

LinkerVersion

NumberOfSections

OSVersion

ResSize

StackReserveSize

clean

count

2.467000e+03

2467.000000

2.467000e+03

2467.000000

2.467000e+03

2467.000000

2.467000e+03

2467.0

mean

1.009835e+05

33.970004

6305.958654

1.473796e+05

1619.046210

4.863884e+04

302.233077

9.051885

3.978111

5.942440

1.690548e+05

3.025229e+05

1.0

std

5.217597e+05

14.873702

12392.766981

5.148365e+05

9275.796269

4.835382e+05

2484.761684

0.651705

1.165679

0.390389

9.364935e+05

1.871939e+05

0.0

min

0.000000e+00

0.000000

0.000000e+00

0.000000

0.000000e+00

0.000000

2.000000

1.000000

0.000000

9.040000e+02

2.621440e+05

1.0

25%

4.416000e+03

28.000000

320.000000

4.304000e+03

74.000000

4.096000e+03

6.000000

9.000000

4.000000

6.000000

1.056000e+03

2.621440e+05

1.0

50%

4.816000e+03

28.000000

320.000000

1.472000e+04

147.000000

4.096000e+03

6.000000

9.000000

4.000000

6.000000

2.040000e+03

2.621440e+05

1.0

75%

2.099400e+04

56.000000

1344.000000

8.676000e+04

287.000000

4.096000e+03

6.000000

9.000000

4.000000

6.000000

2.190800e+04

2.621440e+05

1.0

max

1.769935e+07

84.000000

49472.000000

1.019821e+07

205292.000000

1.786675e+07

21315.000000

14.000000

22.000000

10.000000

2.026722e+07

4.194304e+06

1.0

In [4]:

print
								"Malicious Files Statistics"

malicious.describe()

Malicious Files Statistics

Out[4]:

DebugRVA

DebugSize

Dll

ExportRVA

ExportSize

IATRVA

ImageVersion

LinkerVersion

NumberOfSections

OSVersion

ResSize

StackReserveSize

count

2004.000000

2.004000e+03

2004.000000

2.004000e+03

mean

15453.085828

5.182136

16616.363772

1.933029e+04

3.183463e+05

6.372132e+04

19.202096

7.705589

4.477545

36.024451

4.882199e+04

1.078599e+06

std

50630.027056

12.926161

16693.869293

2.049653e+05

1.283018e+07

9.307602e+04

755.237241

8.081842

1.524306

1225.262134

7.545737e+05

1.011342e+06

min

0.000000

0.000000e+00

0.000000

2.000000

1.000000

0.000000e+00

25%

0.000000

0.000000e+00

8.192000e+03

0.000000

6.000000

3.000000

4.000000

1.104000e+03

1.048576e+06

50%

0.000000

1024.000000

0.000000e+00

2.867200e+04

0.000000

7.000000

4.000000

2.880000e+03

1.048576e+06

75%

0.000000

33088.000000

0.000000e+00

1.187840e+05

5.000000

9.000000

5.000000

3.173800e+04

1.048576e+06

max

396224.000000

213.000000

59669.000000

8.273884e+06

5.704256e+08

1.327168e+06

33795.000000

248.000000

18.000000

54034.000000

3.356242e+07

3.355443e+07

We can see the discrepancies between the two sets especially in the first two features Let's plot some of these features to get a visual idea about those differences

In [6]:

#lets plot

#let's label our dataframes

malicious['clean'] =
										0

clean['clean'] =
										1

import
							seaborn

%matplotlib inline

fig,ax = plt.subplots()

x = malicious['IATRVA']

y = malicious['clean']

ax.scatter(x,y,color='r',label='Malicious')

x1 = clean['IATRVA']

y1 = clean['clean']

ax.scatter(x1,y1,color='b',label='Cleanfiles')

ax.legend(loc="right")

Out[6]:

<matplotlib.legend.Legend at 0x7f7f1e5f83d0>

We can notice the "clustering" of the Malicious samples on a tight centroid while the cleanfiles are sparse over the 'x' line let's try now to plot other features as well to get an overall understanding of what we have here

In [13]:

%matplotlib inline

fig,ax = plt.subplots()

x = malicious['DebugRVA']

y = malicious['clean']

ax.scatter(x,y,color='r',label='Malicious')

x1 = clean['DebugRVA']

y1 = clean['clean']

ax.scatter(x1,y1,color='b',label='Cleanfiles')

ax.legend(loc="right")

Out[13]:

<matplotlib.legend.Legend at 0x7f7f1f570390>

In [14]:

%matplotlib inline

fig,ax = plt.subplots()

x = malicious['ExportSize']

y = malicious['clean']

ax.scatter(x,y,color='r',label='Malicious')

x1 = clean['ExportSize']

y1 = clean['clean']

ax.scatter(x1,y1,color='b',label='Cleanfiles')

ax.legend(loc="right")

Out[14]:

<matplotlib.legend.Legend at 0x7f7f1b402190>

The more we plot and analyze the data the more we understand and get a sense of the overall distribution,of course a problem arises what do I do if I have a high-dimensional dataset well what we have here is fairly low dimensional but a lot of technics can be used to reduce the dimensions to the more "important" features algorithms like PCA and t-SNE can be used to visualize the data on 3D or even 2D plots .

Machine learning application

Enough with the statistics let's do some work, till now we did not do any machine learning work what we did is part of the whole work we took some data, cleaned it and prepared it. Now to start experimenting with Machine Learning, we have to do a few more things:

First, we need to merge our datasets (malicious and clean) into one DataFram
We need to split our DataFrame into two parts the first one will be used for training and later for testing
We will then proceed to apply few algorithms and see what happens

Dataset preparation

In [22]:

dataset = pd.read_csv('malware-dataset.csv')

"""

Add this points dataset holds our data

Great let's split it into train/test and fix a random seed to keep our predictions constant

"""

import
						numpy
								as
										np

from
						sklearn.model_selection
								import train_test_split

from
						sklearn.metrics
								import confusion_matrix

#let's import 4 algorithms we would like to test

#neural networks

from
						sklearn.preprocessing
								import StandardScaler

from
						sklearn.neural_network
								import MLPClassifier

#random forests

from
						sklearn.ensemble
								import RandomForestClassifier

"""

Let's prepare our data

"""

state = np.random.randint(100)

X = dataset.drop('clean',axis =
													1)

y = dataset['clean']

X = np.asarray(X)

y = np.asarray(y)

X = X[:,1:]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =
									0.1,random_state=0)

Now we have 4 Matrices quite big ones X_train and y_train will be used to train our different classifiers, and X_test will be used to predict the labels, and y_test will be used for metrics, in fact, we are going to compare the predictions from X_test to y_test to see how we did perform. We start by using Random Forests which are an ensemble version of Decision Trees they work by creating a lot of decision trees at training time and outputting the class that is the mode of the classes (classification), they are quite performant when it comes to binary classification problems

In [25]:

#let's start with random forests

#we initiate the classifier

clf1 = RandomForestClassifier()

#training

clf1.fit(X_train,y_train)

#prediction labels for X_test

y_pred=clf1.predict(X_test)

#metrics evaluation

"""

tn = True Negative a correct prediction clean predicted as clean

fp = False Positive a false alarm clean predicted as malicious

tp = True Positive a correct prediction (malicious)

fn = False Negative a malicious label predicted as clean

"""

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print
						"TN = ",tn

print
						"TP = ",tp

print
						"FP = ",fp

print
						"FN = ",fn

TN =  697

TP =  745

FP =  6

FN =  4

Notice anything? Well if you have 6 False Positives and 4 False Negatives with no parameter tuning and no modifications are quite good,actually we were able to detect 697 Clean files correctly and 745 Malicious Ones Correctly, guess our small Anti-Virus is working :D.

Let's try this time another classifier, we will build a simple neural network and test it on another randomized split.

According to Wikipedia

A multilayer perceptron (MLP) is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network. MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable.

A Multi-Layer Perceptron is the generalized version of the perceptron which is the basis model of the neuron they are the fundamental building blocks for deep learning methods where we meet larger and deeper networks.

In [26]:

#our usual split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =
									0.3,random_state=0)

#This is a special process called feature engineering where we transform our data into the same scale for better predictions

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

#Here we build a Multi Layer Perceptron of 12 Layers for 12 Features  you can use more if you want but it will turn into a complex zoo

mlp = MLPClassifier(hidden_layer_sizes=(12,12,12,12,12,12))

#Training the MLP on our data

mlp.fit(X_train,y_train)

predictions = mlp.predict(X_test)

#evaluating our classifier

tn, fp, fn, tp = confusion_matrix(y_test,predictions).ravel()

print
						"TN = ",tn

print
						"TP = ",tp

print
						"FP = ",fp

print
						"FN = ",fn

TN =  695

TP =  731

FP =  8

FN =  18

The all mighty Neural Network failed to detect eighteen Threats not only that it detected them as clean files which is a very very bad problem imagine your antivirus detecting a ransomware as a clean file? Well this sounds like AV Evasion on AI but let's not be pessimistic our Neural Network is very primitive we can actually make it more accurate, but this is beyond the scope of this article

Learn Cybersecurity Data Science

Build your skills using machine learning and other cutting-edge tools to perform various cybersecurity tasks.

Start Learning

Conclusion

This is just the beginning. I wanted to show that Malware Classification is indeed a solvable problem if we accept 99% as a good accuracy rate. Of course, building and deploying something like this, in reality, is time-consuming and requires more knowledge and more data. This was merely a preview of the infinite possibilities machine learning and AI, in general, offers us, I hope this was educational, fun and insightful.

Sources

Machine Learning Course by Andrew NG
Course that will make you a deep learning practitioner in 7 weeks only requirement (Python)
Elements of Statistical Learning (Harstie) this is a more theoretical book but quite insightful
Selecting features to classify malware

Posted: March 28, 2017

Achraf Belaarch

View Profile

Achraf Belaarch is an applied Mathematics undergraduate. In his free time, he likes challenging problems while exploring the applications of machine learning and deep learning in cybersecurity. He also enjoys programming and reading research papers.