Intrusion Detection by Different Machine Learning Techniques

Zhepei Wang, Yiqing Cai


Project Introduction back to top

Intrusion detection is an important part of modern computer security systems. An intrusion detection system should be able to decide if a certain user operation is abnormal. Then it should be able to decide if such abnormal behavior is an intrusion attempt according to different features such as protocol used and connection length.

Our project object is to compare and contrast the performanc (speed and accuracy) of different machine learning algorithms to detect intrusion attempts. We also want to explore optimization options for those algorithms. The algorithms we chose for this projects are: Logistic Regression, Linear SVM, rbf Kernel SVM, and MLP. For each of the algorithms, we run both binary classification and multi-class classification.

Data Description back to top

The data we used was the KDD(Knowledge Discovery and Data Mining) Cup data by Special Interest Group in 1999. The dataset contained 41 features and 22 types of attack. We used 10 percent of the training set for a total of 500,000 samples and split the dataset into 75% training data and 25% testing data.

Binary Classification back to top

For binary classification, all we cared was that if the action was an intrusion attempt or not. The algorithms will only tell if an action was safe and will not decide what kind of intrusion attempt it was. Our results were mainly represented by confusion matrices. Class 0 means normal activity and class 1 means intrusion attempt.

Logistic Regression
Fig 1. - Logistic Regression, l2-norm-reg = 0.2

Linear SVM
Fig 2. - Linear SVM, C = 1.0

rbf SVM
Fig 3. - rbf Kernel SVM, C = 1.0

MLP
Fig 4. - MLP, l2-norm-reg=0.01, 3 hidden layers each with 100 perceptrons

We can see that all methods have very high accuracy. We also listed other interesting results for discuss.

Algorithm LR lin_SVM rbf-SVM MLP
Training Accuracy 0.9859 0.9885 1.0000 0.9938
Test Accuracy 0.9858 0.9885 0.9960 0.9935
Precision 0.9984 0.9945 1.0000 0.9994
Recall 0.9839 0.9911 0.9951 0.9925
f1-score 0.9911 0.9928 0.9975 0.9960
Time for training (s) 6.46 (100 itrs) 34.87 2491.05 26.65 (14 itrs)

As we can see from the table, all algorithms did great job for binary classification. This means that the dataset is likely to be linearly separable. Non linear classifiers performs consistently better than linear classifiers but also has much longer training time, especially rbf-SVM.

Multi-Class Classification back to top

In multi-class classification, the algorithms will try to find out what the actual intrusion attempt was instead of just telling if the behavior is an intrusion. The accuracy in this section is calculated according to whether the algorithm returns the correct intrusion method. Confusion matrix is not as helpful for multi-class classification and is thus not provided.

Algorithms LR lin-SVM rbf-SVM MLP
Training Accuracy 0.9862 0.9858 0.9998 0.9873
Test Accuracy 0.9872 0.9854 0.9950 0.9882
Time for training (s) 149.57 (100 itrs) 172.81 (1000 itrs) 5681.33 31.56 (11 itrs)

As we can see from the table, the algorithms worked similarly on multi-class classification as they worked on binary classification with high accuracy and higher performance with non-linear algorithms. The performance was not affected by the demand to distinguish between different types of attacks.

PCA for Binary Classification back to top

Because our dataset has a rather big feature space of 41 dimensions, we decided to use Principle Component Analysis to reduce the dimension. The goal of PCA is to find k dimensional representation that preserves maximal variance. Thus we will be able to speed up the training process without losing too much accuracy.

We set our target variance capture rate to 99% and we managed to reduce the dimension from 41 to 17 while keeping 99% variance.

PCA
Fig 5. - PCA eigenvalue vs variance capture
Algorithm LR lin_SVM rbf-SVM MLP
Training Accuracy 0.99922 0.9878 0.9992 0.9990
Test Accuracy 0.9729 0.9810 0.9891 0.9891
Precision 0.9814 0.9914 0.9996 0.9945
Recall 0.9848 0.9848 0.9868 0.9918
f1-score 0.9831 0.9881 0.9931 0.9932
Time for training (s) 3.87 (39 itrs) 39.00 12.65 42.41 (22 itrs)

From the table, we can see couple of interesting changes and improvements. First of all, algorithms accuracy were affected by the PCA but the change wasmostly nominal. Second, rbf-SVM still has the best accuracy and PCA greatly reduced its training time to 1/20 of its time before optimization. This shows promise for PCA optimization and we will discuss PCA's effectiveness on multi-class classification in next section.

PCA on Multi-class Classification back to top

Algorithms LR lin-SVM rbf-SVM MLP
Training Accuracy 0.9919 0.9888 0.9991 0.9985
Test Accuracy 0.9153 0.7643 0.9817 0.9792
Time for training (s) 152.36 (100 itrs) 590.53 (1000 itrs) 21.05 85.10 (28 itrs)

The table provided similar results as the binary classification: slightly lower accuracy, faster training (rbf-SVM sped up to 1/20). Another interesting point: linear SVM had much lower accuracy and much higher training time. This is likely due to the reduction in dimension: linear SVM was unable to separate the data efficiently.

Discussion and Future Works top

We found that all four models we used yield very high accuracy while nonlinear models yield slightly higher accuracy than linear ones. rbf kernel had the highest accuracy but had the slowest training time. We also found that PCA can help reduce the training time with minimal negative affect on accuracy.

The algorithms and dataset can also be used for binary classification based on attack types (i.e. identify a specific kind of intrusion attempts). We can also remove the labels from the dataset and us it for unsupervised learning such as clustering (e.g. K nearest neighbors).

References and links top

Application of Machine Learning Algorithms to KDD Intrusion Detection within Misuse Detection Context (Sabhnani & Serpen)

Why Machine Learning Algorithms Fail in Misuse Detection on KDD Intrusion Dataset (Sabhnani & Serpen)

Intrusion detection using neural networks and support vector machines (Mukkamala et. al.)

KDD Cup 1999
Scikit-learn
Pandas
For suggestions and improvements, please visit our github: GitHub