Authors:
Jeremiah Pitts, Betim Hodza, Ilhan Gelle, and Abinash Bastola
The University of Texas at Arlington, Team Bytewise
The global connectivity of the Internet demands robust network security to protect systems, making Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) crucial. Traditional IDS/IPS often struggle with real-time threat detection due to reliance on predefined rules and high false positives. Machine learning (ML) offers a promising solution by enabling real-time detection and classification of malicious traffic. This project evaluates ML models using the UNSW-NB15 dataset, which contains diverse real-world traffic characteristics and multiple attack categories. Data preprocessing techniques—such as feature selection, normalization, and handling class imbalances—are applied to improve model performance. The goal is to assess how well various classification algorithms can differentiate between normal and malicious traffic.
This repository contains a Python-based machine learning pipeline for detecting DDoS attacks. The code preprocesses the UNSW-NB15 dataset, balances classes with SMOTE, and trains four models (Decision Tree, Random Forest, Extra Trees, and XGBoost). A stacking classifier is built from these base models, and performance is evaluated using accuracy and weighted F1 scores. In addition, live network traffic can be monitored with PyShark, and, if enabled, suspicious IPs are automatically blocked using OS-specific commands.
Furthermore, the pipeline exports several CSV files (feature distributions, class distributions, hyperparameter tuning results, classifier performance metrics, and confusion matrix data) for external visualization using SAS.
-
main.py
The main pipeline code. It contains functions for data preprocessing, model training/evaluation, live DDoS detection/prevention, and CSV export routines for SAS visualization. -
trained_feature_names.pkl
Pickle file storing the list of features used by the model (exported from the training pipeline). -
trained_stacking_model.pkl
Pickle file for the final stacking classifier. -
trained_scaler.pkl
Pickle file for the RobustScaler used to scale the features. -
trained_label_encoder.pkl
Pickle file for the LabelEncoder used for the target column. -
CSV Files for SAS Visualization:
These files are generated when running the training pipeline with the CSV export functions:feature_distribution.csv
class_distribution.csv
hyperparameter_tuning.csv
classifier_performance.csv
confusion_matrix.csv
-
generate_figures.sas
A SAS script (using relative paths) that imports the above CSV files to generate figures (histograms, bar charts, heatmaps, etc.) and save them as an HTML file.
- Python 3.x
- Packages:
- pandas
- numpy
- scikit-learn
- xgboost
- imbalanced-learn
- pyshark
- joblib
- matplotlib
- seaborn
- A SAS environment (SAS Studio, Enterprise Guide, or similar) to run the SAS visualization script.
-
Clone the repository or download the files.
-
Install the required Python packages using pip:
pip install pandas numpy scikit-learn xgboost imbalanced-learn pyshark joblib matplotlib seaborn
-
Ensure the UNSW-NB15 dataset CSV files (
UNSW_NB15_training-set.csv
andUNSW_NB15_testing-set.csv
) are placed in the working directory.
To train the model and generate the CSV files for SAS visualization, run:
python main.py --action train --train_file ./datasets/UNSW_NB15_training-set.csv --test_file ./datasets/UNSW_NB15_testing-set.csv
This command will:
- Preprocess the data.
- Train the base models and a stacking classifier.
- Evaluate the models (printing accuracy, weighted F1 scores, and confusion matrices).
- Export the following CSV files to the working directory:
feature_distribution.csv
class_distribution.csv
hyperparameter_tuning.csv
classifier_performance.csv
confusion_matrix.csv
To run the live network monitoring (detection or prevention mode), use the following command (replace Wi-Fi
with your network interface name if needed):
python main.py --action monitor --ddos_mode prevention --interface "Wi-Fi" --duration 30 --port 8000
After the CSV files are generated by the training pipeline, run the SAS script to generate figures:
- Ensure the CSV files and
generate_figures.sas
are in the same directory. - Open your SAS environment (e.g., SAS Studio or Enterprise Guide).
- Open the
generate_figures.sas
script. - Run the script. It will produce an HTML file (
Figures.html
) with all the generated figures.
This code supports the research paper titled:
Comparative Analysis of Machine Learning Models for DDoS Attack Detection
Jeremiah Pitts, Betim Hodza, Ilhan Gelle, and Abinash Bastola
The University of Texas at Arlington, Team Bytewise
The paper details the challenges of traditional IDS/IPS, the methodology used (including data preprocessing, model training, and hyperparameter tuning), and the experimental results comparing multiple ML models using the UNSW-NB15 dataset. Please refer to the paper for detailed analysis, tables, and figures that summarize our findings.
- Jeremiah Pitts: jnp2934@mavs.uta.edu
- Ilhan Gelle: ilhan.gelle@mavs.uta.edu
- Betim Hodza: bxh8702@mavs.uta.edu
- Abinash Bastola: axb9775@mavs.uta.edu
Additional contact details are available upon request.
This project is licensed under the MIT License.