Oliwia NOWICKA and This study investigates the effectiveness of various supervised machine learning algorithms in detecting network security incidents, using the CIC-IDS-2017 dataset. The research focuses on four algorithms—Support Vector Machine (SVM), Naive Bayes (Gaussian and Bernoulli variants), Random Forest, and XGBoost—evaluating their performance in both binary and multiclass classification tasks. The dataset was subjected to extensive preprocessing, including cleaning, dimensionality reduction (PCA), and oversampling to address class imbalance. Experiments assessed models based on accuracy, precision, recall, F1-score (macro and weighted averages), and computational efficiency. Results demonstrate that ensemble methods, particularly Random Forest and XGBoost, significantly outperform others in detection accuracy, with F1-scores exceeding 98% in binary and 85% (macro average) in multiclass classification after hyperparameter tuning. SVM showed solid performance but with higher computational costs, while naive Bayes models offered fast training but lower detection effectiveness. The findings confirm the suitability of tree-based ensemble models for intrusion detection systems (IDS), highlighting their robustness, scalability, and accuracy in identifying both known and novel threats.Adrian P. WOŹNIAK
Abstract:
This study investigates the effectiveness of various supervised machine learning algorithms in detecting network security incidents, using the CIC-IDS-2017 dataset. The research focuses on four algorithms—Support Vector Machine (SVM), Naive Bayes (Gaussian and Bernoulli variants), Random Forest, and XGBoost—evaluating their performance in both binary and multiclass classification tasks. The dataset was subjected to extensive preprocessing, including cleaning, dimensionality reduction (PCA), and oversampling to address class imbalance. Experiments assessed models based on accuracy, precision, recall, F1-score (macro and weighted averages), and computational efficiency. Results demonstrate that ensemble methods, particularly Random Forest and XGBoost, significantly outperform others in detection accuracy, with F1-scores exceeding 98% in binary and 85% (macro average) in multiclass classification after hyperparameter tuning. SVM showed solid performance but with higher computational costs, while naive Bayes models offered fast training but lower detection effectiveness. The findings confirm the suitability of tree-based ensemble models for intrusion detection systems (IDS), highlighting their robustness, scalability, and accuracy in identifying both known and novel threats.