The Machine Learning based Intrusion Detection of SPHINX Toolkit: PART III
Continue from PART II
In order to indicate the most suitable and efficient approach for the component’s function three AI pipelines for intrusion detection have been designed, developed and evaluated in an extensive comparative analysis that includes multiple variants of each pipeline with numerous machine-learning (ML) and deep-learning (DL) models.
Pipeline C: Unsupervised approach for anomaly detection
In contrast to the well-known classification setup, where training data is used to train a classifier and test data measures performance afterwards, there are multiple setups possible when talking about anomaly detection. Basically, the anomaly detection setup to be used depends on the labels available in the dataset. Unsupervised anomaly detection is the most flexible setup which does not require any labels. The idea is that an unsupervised anomaly detection algorithm scores the data solely based on intrinsic properties of the dataset. Typically, distances or densities are used to give an estimation what is normal and what is an outlier.
To enable label-free intrusion detection, an unsupervised pipeline has been also designed relying on the latest advances of deep learning and especially autoencoders. The proposed pipeline is organized in the following depicted steps.
The NSL-KDD Train 20 data file (comprising of 25,192 data points) has been used for training and the NSL-KDD Test+ data file (comprising of 22,544 data points) has been utilized for testing. Moreover, simple data encoding techniques have been employed to convert unstructured, textual data into numeric representations which can then be understood by machine learning algorithms (categorical values).
Z-score has been also used here to normalize features to a common scale (having a mean of zeros and a standard deviation of 1).
The hybrid FS technique (as described in Section 3.2.1 B) has been applied to provide a ranking of the available features with respect to their expected discrimination capacity. The output of this FS processing step is the creation of training and testing feature subsets of increasing dimensionality. The autoencoder (step D) has been applied on the each one of the generated feature subsets and the optimal feature subset has been identified.
An autoencoder has been applied on the selected features using only data points of the training set that belong to class 1 (normal activity). The main idea is to use autoencoders to learn the normal behavior of users and then to use them to detect abnormal states (intrusions). After this “normal” data set (class 1) has been obtained the training of the ML model proceeds in unsupervised fashion, without the need of labels. A critical advantage of this method is that it will be able to identify faulty conditions even though these have not been encountered earlier during the training phase. With this method there is no need to inject anomalies (class 2) during the training phase and we do not require intrusion logs or changes to the standard users’ behavior. The learning of this methodology can be organized as given below:
- The autoencoder is trained using only data from class 1 (and specifically using the features selected in step C)
- Once constructed the trained model is used to reconstruct data from both Class 1 and Class 2.
- The differences D between the autoencoder’s input and output are calculated for each data point.
- The generated D values (that actually declare reconstruction errors) for class 1 training data are grouped together forming the group D1. The same process is repeated for Class 2 training data points generating the D2
- Higher the reconstruction error, higher the possibility of that data point being an anomaly. Based on this, a classifier is finally applied to separate groups D1 and D2 and identify whether an input is an anomaly (possibly an intrusion) or not.
The hyperparameters of the autoencoder (number of layers and nodes per layer) have been selected by trial and error.
The testing dataset has been supplied to the trained autoencoder model. The obtained reconstruction errors have been used as input to the trained classifier that classifies them as anomalies or not. The classification accuracy on the testing set has been considered as the final evaluation criterion of the proposed methodology.
More information about the tested pipelines and the Machine Learning based Intrusion Detection component can be found at Deliverable 3.4 that is publicly available here.