sphinx-project.eu / Blog  / The Machine Learning based Intrusion Detection of SPHINX Toolkit: PART I

The Machine Learning based Intrusion Detection of SPHINX Toolkit: PART I

Machine Learning based Intrusion Detection (MLID) is part of the Automated Cyber Security Risk Assessment building block of SPHINX Toolkit.

The component draws data from SPHINX Honeypot and classifies them into different categories (classes). The MLID then sends the low-level classification outputs: (i) back to the honeypot, which will take immediate actions (if needed) and (ii) to the Decision Support System (DSS) of SPHINX, which will further process the received information and will transform it into actionable rules. Finally, the outcomes of MLID are communicated to the knowledge base of SPHINX.

In order to indicate the most suitable and efficient approach for the component’s function three AI pipelines for intrusion detection have been designed, developed and evaluated in an extensive comparative analysis that includes multiple variants of each pipeline with numerous machine-leaning (ML) and deep-learning (DL) models.

Pipeline A: Machine learning pipeline for intrusion detection

The first machine learning pipeline has been initially designed and developed to identify intrusion patterns in the selected data classification problem. The proposed methodology is depicted in the following diagram and comprises five (5) processing phases as shown:

Data preparation and handling: The NSL-KDDTrain20 data file (comprising of 25,192 data points) has been used for training and the NSL-KDD Test+ data file (comprising of 22,544 data points) has been utilized for testing. Moreover, simple data encoding techniques have been employed to convert unstructured, textual data into numeric representations which can then be understood by machine learning algorithms (categorical values).

Normalization: Z-score has been used to normalize features to a common scale (having a mean of zeros and a standard deviation of 1. Specifically, the data points π‘₯𝑖,𝑗 were standardized with the following formula: 𝑧𝑖,𝑗 = π‘₯𝑖,π‘—βˆ’π‘₯𝑗 Μ…Μ…Μ… 𝑠𝑗 (4)


– π‘₯𝑖,𝑗 denotes the feature j of data sample π‘₯𝑖

– π‘₯̅𝑗 denotes the mean value of feature j

– 𝑠𝑗 is the standard deviation of feature j

– 𝑧𝑖,𝑗 denotes the normalized version of π‘₯𝑖,𝑗

Feature Selection: A feature selection approach has been employed to identify the most informative features and rank them in order of significance (specifically, the hybrid FS technique that combines the outcomes of multiple well-known FS models to avoid bias).

Machine learning: Six (6) machine learning techniques have been investigated for their suitability in identifying intrusion patterns using the selected features as generated in the previous phase:

– AdaBoost

– Random Forest

– Support Vector Machines

– Nearest neighbor classifier

– Decision Trees

– Discriminant analysis

Hyperparameter selection has been performed to optimize the performance of the ML models. A validation subset has been held out from the training set (a randomly selected 10%) as a criterion for selecting the optimum hyperparameters by means of a grid search process.

Validation The discrimination capacity of the proposed pipeline has been performed on the testing dataset.

More information about the tested pipelines and the Machine Learning based Intrusion Detection component can be found at Deliverable 3.4 that is publicly available here.