dc.description.abstract |
Küresel bir ağ olan interneti her geçen gün daha çok insan bağlanıp kullanmaktadır. Artan kullanıcı sayısı ve uygulamalar ağ güvenliği açısından riskleri de içerlnde barındırmaktadır. Her ne kadar güvenlik duvarları izinsiz erişimleri engellese de özel ağ içerisindeki bir servis internet dünyasına açıldığında hem zararlı hem zararsız kullanıcılar tarafından ulaşılabilir olmaktadır. Burada güvenlik duvarları tarafından verilen izinlerden kaynaklanabilecek saldırıları tespit etmede saldırı tespit sistemleri başarı ile kullanılmaktadır. Ayrıca saldırı tespit sistemleri, artan yapay zeka uygulamaları ile birlikte matematiksel ve istatistiksel yöntemler kullanılarak, mevcut verilerden çıkarımlar yapan, bilinmeyenlere dair tahminlerde bulunan, makine öğrenmesi algoritmaları kullanılarak daha başarılı sonuçlar vermektedir. Bu çalışmada 2020 yılında Yazılım Tanımlı Ağlar için oluşturulmuş olan InSDN veri seti kullanılmıştır. Veri seti içerisinde web atak, DoS, DDoS, bilgi toplama, botnet, kullanıcı hesabının yönetici hesabına yükseltilmesi, kaba kuvvet saldırısı olmak üzere 7 farklı saldırı tipi ve normal trafik verileri bulunmaktadır. Kullanılan veri setini daha anlamlı hale getirmek için veri temizleme, veri dönüştürme, veri normalleştirme ve öznitelik seçim yöntemleri uygulanmıştır. Veri seti üzerindeki çalışmalarda Python ve kütüphaneleri kullanılmıştır. Veriyi daha anlamlı hale getirmek için Pandas, Numpy, Seaborn ve Matplotlib kütüphanelerinden faydalanılmıştır. Saldırı tespiti için kullanılan makine öğrenme algoritmalarında Sklearn ve derin öğretme algoritmalarında ise Keras kütüphanesiden yararlanılmıştır. Deneysel çalışmalar için veri işleme aşamalarından öznitelik seçimi için Ki-Kare, Spearmen, Karşılıklı bilgi, Kendall, Anova, Temel bileşen analizi ve sarmal yöntemlerden elde edilen özellikler Lojistik Regresyon, Karar Ağaçları, K En Yakın Komşu, AdaBoost, XGBoost, Rastgele Orman, Evrişimli Sinir Ağları, Tekrarlayan Sinir Ağları, Kısa Uzun Vadeli Bellek olmak üzere dokuz farklı makine öğrenmesi algoritması ile eğitilmiştir. Deneysel çalışma sırasında kullanılan özellik sayısı azaldıkça derin öğrenme algoritmalarının başarım oranının düştüğü gözlemlenmiştir. Farklı öznitelik seçim yöntemlerindeki özelliklerin kullanılarak eğitildiği bu çalışma da XGBoost algoritması başarım oranı ve diğer metrikler incelendiğinde en iyi performans elde edilen algoritma olduğu görülmüştür. Çalışma sonrasındaki deneysel çıktılar XGBoost algoritmasının InSDN veri seti üzerinde önerilen diğer yöntemlerden daha iyi başarım oranı elde ettiğini göstermektedir. |
|
dc.description.abstract |
The global network, known as the internet, is being accessed and utilized by an increasing number of individuals every day. The growing user base and applications bring along certain risks in terms of network security. While security firewalls prevent unauthorized access, when a service within a private network is exposed to the internet, it becomes accessible to both malicious and benign users. In detecting potential threats arising from permissions granted by security firewalls, intrusion detection systems are successfully employed. Furthermore, intrusion detection systems, in conjunction with the rising applications of artificial intelligence, utilize mathematical and statistical methods. By making inferences from existing data, making predictions about unknowns, and employing machine learning algorithms, they yield more successful results. Software-defined networks consist of two independent components: the transmission of network traffic and the control mechanisms, which are managed at different layers. Software-defined networking allows for more manageable and extensible control of network traffic. It comprises infrastructure, control, and application layers. In this study, the InSDN dataset created for Software-Defined Networks in the year 2020 was utilized. The dataset encompasses 7 different types of attacks, including web attacks, DoS, DDoS, information gathering, botnet activities, elevation of user account to administrator account, and brute force attacks, along with normal traffic data. There are a total of 275,515 attack records and 68,424 normal traffic records within the dataset. To enhance the meaningfulness of the dataset, data cleaning, data transformation, data normalization, and feature selection methods were applied. The dataset comprises a total of 84 features. During the data cleaning stage, features with zero values for Fwd PSH Flags, Fwd URG Flags, CWE Flag Count, ECE Flag Cnt, Fwd Byts/b Avg, Fwd Pkts/b Avg, Fwd Blk Rate Avg, Bwd Byts/b Avg, Bwd Pkts/b Avg, Bwd Blk Rate Avg, and Fwd Seg Size Min were removed from the dataset. Additionally, variables such as Flow ID, Src IP, Dst IP, Timestamp, Src Port, Dst Port, and Protocol were excluded due to their variability, as their inclusion could lead to overfitting in the model. During the data transformation stage, all the data has been labeled as either attack or normal. This study involves binary classification. The label encoder library of the Pandas library in the Python programming language has been employed for this purpose. The performance rates of machine learning algorithms are dependent on the quality and values of the data. Therefore, data normalization has been performed to bring the values into a consistent format. The StandardScaler function from the Python sklearn library has been used for this purpose. Feature selection methods have been applied to choose the minimum number of features that would yield the same result in the dataset. This method aims to select the features that contribute the most to the outcome. Feature selection methods such as Chi-Square, Spearman, Mutual Information, Kendall, ANOVA, Principal Component Analysis, and Recursive Feature Elimination have been employed. The feature output for each method is as follows: Chi-Square 39, Spearman 40, Mutual Information 39, Kendall 27, ANOVA 40, Principal Component Analysis 30, Recursive Forward Selection 50, and Recursive Backward Selection 59 features. The output of each feature selection method has been trained with different machine learning algorithms, and their performance rates have been compared. The studies on the dataset have been conducted using Python and its libraries. To make the data more meaningful, the Pandas, NumPy, Seaborn, and Matplotlib libraries have been utilized. For machine learning algorithms used in intrusion detection, the Scikit-learn (Sklearn) library has been employed, and for deep learning algorithms, the Keras library has been utilized. In the experimental study, the parameters of nine different machine learning algorithms were adjusted. These algorithms include Logistic Regression, Decision Trees, K-Nearest Neighbors, AdaBoost, XGBoost, Random Forest, Convolutional Neural Networks, Recurrent Neural Networks, and Long Short-Term Memory. To observe the models' performance, the dataset was divided into training and testing sets. After training the models, their performance was evaluated using the allocated test data. The highest accuracy rates achieved by the algorithms on the test data, along with the feature selection methods used, are as follows: Logistic Regression, Sequential Forward Selection feature selection algorithm with achieved an accuracy of 0.9967. Decision Trees, Sequential Forward Selection feature selection algorithm, achieved an accuracy of 0.9999. K-Nearest Neighbors, Sequential Backward Selection feature selection algorithm and Chi-Square feature selection algorithm, achieved an accuracy of 0.9998. AdaBoost, Sequential Backward Selection feature selection algorithm, achieved an accuracy of 0.9990. XGBoost, Anova, Chi-Square, Mutual Information, Sequential Backward and Sequential Forward feature selection algorithm, achieved an accuracy of 0.9999. Random Forest, Chi-Square and Sequential Forward Selection feature selection algorithms, achieved an accuracy of 0.9999. Convolutional Neural Networks, Mutual Information feature selection algorithm, achieved an accuracy of 0.8743. Recurrent Neural Networks, Anova and Chi-Square feature selection algorithm, achieved an accuracy of 0.9935. Long Short-Term Memory, Mutual Information feature selection algorithm, achieved an accuracy of 0.9927. These accuracy rates were obtained in the study. During the experimental study, it was observed that as the number of features used decreased, the performance of deep learning algorithms decreased. Convolutional Neural Networks exhibited the lowest performance among the algorithms in this dataset. Recurrent Neural Networks and Long Short-Term Memory achieved lower performance rates compared to machine learning algorithms. Considering the time and performance values obtained after the study, it was observed that the Decision Trees algorithm yielded the highest performance with the least number of features when using the Kendall feature selection algorithm. The Kendall feature selection algorithm demonstrated that the same result could be achieved with 27 features. The Decision Trees algorithm, with 27 features, showed the highest performance rate of 0.9878, making it the most successful algorithm in this context. This study aims to create an intelligent system for software-defined networks, and in this context, the performance of various machine learning algorithms has been investigated. Despite variations in the size, content of the dataset, and parameter settings of the algorithms, it has been observed that deep learning algorithms generally achieve lower performance rates. On the contrary, machine learning algorithms tend to achieve higher performance rates. Specifically, for the design of an intrusion detection system in software-defined networks, the XGBoost algorithm has been identified as a potential choice due to its relatively higher performance. |
|