sphinx-project.eu / Blog  / The Machine Learning based Intrusion Detection of SPHINX Toolkit: PART II

The Machine Learning based Intrusion Detection of SPHINX Toolkit: PART II

Continue from PART I

In order to indicate the most suitable and efficient approach for the component’s function three AI pipelines for intrusion detection have been designed, developed and evaluated in an extensive comparative analysis that includes multiple variants of each pipeline with numerous machine-leaning (ML) and deep-learning (DL) models.

Pipeline B: Novel feature dimensionality reduction empowered by deep learning towards the extraction of informative risk indicators

The proposed pipeline B includes four processing steps: (i) data pre-processing making use of a  fuzzy allocation scheme  to convert raw data into fuzzy values, (ii) a data transformation technique that generates images comprising of fuzzy memberships, (iii) a novel feature extraction algorithm employing Siamese convolutional neural networks and finally (iv) a learning process for training, and evaluation of the results, as illustrated in the following depiction.

Data preparation: Specifically, a 20% subset of the NSL-KDD training data (the NSL-KDD Train 20 variant) has been used comprising of 25,192 data points, where the NSL-KDD Test+ data file (comprising of 22,544 data points) has been utilized for testing.  Moreover, simple data encoding techniques have been employed to convert unstructured, textual data into numeric representations which can then be understood by machine learning algorithms (categorical values).

Fuzzification and image formulation (Vec2im): First of all, the non-numeric attributes of the dataset have been converted into numeric values. For the efficient training of machine learning algorithms, input data is typically transformed by a number of pre-processing routines with data normalization being the gold standard.  Different algorithms could be used to normalize the input data (such as min-max normalization or normalization with respect to standard deviation), however in this pipeline a fuzzy allocation scheme has been employed as described below.

Fuzzification: To normalize as well as evaluate the classification capabilities of each feature, a simple fuzzy allocation scheme has been applied that assigns varying degrees of patterns to every class. For feature j, the fuzzy membership  indicating the degree to which   belongs to class i is determined by:

where  U i,j=∑ kA_i x_k,jNi is the class i mean along the Xk,j component, Ai is the set of indexes of the training examples belonging to class iNi is the number of class i patterns and b is a fuzzification factor (b = 2 in our experiments). In the testing, every feature component of Xk,j , has been converted to u1 (xk,j) that denotes its fuzzy membership to class 1 (normal / non intrusion class). High values of u1 (xk,j) close to 1 indicate a strong membership to the non-intrusion class whereas low u1 (xk,j) values close to 0 are representative of examples belonging to the malicious class.

Vec2im: At the second phase of processing, the generated memberships have been re-placed in a matrix format resulting to one grey-scale image per example. Specifically, the 41 features Xk,j,j = 1, …,41 have transformed to 41 fuzzy memberships ui (xk,j), j = 1, …,41 and finally a 7×7 image has been created per sample by placing the fuzzy memberships in a matrix as presented in the figure below . Zero values have been also included in the matrix in random cells to fill the eight gaps (given that the dimensionality of the initial feature set was 41 with a total of 49 cells to be filled in the matrix). Fuzzy memberships and zero values have been ordered randomly since it has been concluded that their order has not any significant impact on the final performance of the proposed methodology.

Dimensionality Reduction with Siamese deep learning networks:Deep Siamese Convolutional Neural Networks (SCNN) architecture is a variant of neural networks that was originally designed to solve signature verification problem of image matching. It has also been used for one-shot image classification, face verification where the categories are not known in advance as well as for dimensionality reduction. SCNN consist of two identical symmetric CNN subnetworks that share the same weights. In the pipeline testing, each identical CNN has been built using one convolutional layer followed by three fully connected layers. The rectified linear units (ReLU) nonlinearity has been applied as the activation function for all layers, and adaptive moment estimation (ADAM) optimizer has been utilized to control learning rate. The similarity between images has been calculated by Euclidean distance, and the contrastive loss has been calculated to define the loss function as follows:

where D = ‖ f(I1) – f(I2)‖^2

I1 and I2 are a pair of the generated images fed into each of two identical CNNs. 1(·) is an indicator function to show that whether two images have the same label, where L = 0 represents the images have the same label and L = 1 represents the opposite. W is the shared parameter vector comprising of the weights that both neural networks share each other f(I1) and f(I2) are the latent representation vectors of input I1 and I2, respectively and D is the Euclidean distance between them. The selected SCNN architecture as depicted in the following picture reduces the dimensionality of the 41-dimensional feature space to a single 1-d space.

Decision making on the reduced feature space: To evaluate the discrimination capacity of the extracted features, various machine learning models have been employed, which have been trained to implement the binary classification task on the resulted 1-d space. The test entailed linear discriminant analysis (LDA) and Naïve Bayes to provide a baseline for comparisons with more advanced models. It has also evaluated decision trees, driven by Gini’s diversity index, KNN, as well as non-linear support vector machines (SVM) algorithms with Gaussian kernel, which can deal with the overfitting problems that appear in high-dimensional spaces. The ensemble techniques AdaBoost and Random Forest have been also evaluated using DT models as weak learners.

Validation: To achieve a fair comparison between the different approaches, hyperparameter selection has been performed for each one of the investigated machine algorithms. A validation subset has been held out from the training set (a randomly selected 10%) as a criterion for: (i) selecting the optimum hyperparameters by means of a grid search process as well as (ii) deciding the termination of the SCNN learning.

More information about the tested pipelines and the Machine Learning based Intrusion Detection component can be found at Deliverable 3.4 that is publicly available here.