dc.description.abstract |
Gelişen teknoloji ile internet üzerinde her geçen gün artarak devam eden bir veri trafiği oluşmaktadır. Artan bu veri trafiğinin kontrol edilebilmesi için bazı güvenlik sistemlerine ihtiyaç duyulmaktadır. Gerek donanım ve gerekse yazılım olarak birçok güvenlik sistemi bulunmaktadır. Bu güvenlik sistemlerinden biri ise güvenlik duvarıdır. Güvenlik Duvarı sistemleri, bütün internet trafiğini üzerinden geçiren ve ihtiyaçlara göre var olan kurallar çerçevesinde trafiğe bir sonuç veren bir sistemdir. Tez çalışmasında veri setlerinin elde edildiği Kuruluş olan TDİ'de var olan sistemde oluşan bütün internet trafiği güvenlik duvarı üzerinden yapılmaktadır. Yapılan internet trafiğine güvenlik duvarı tarafından bir sonuç verilmektedir. Her güvenlik duvarının varsayılan kuralları bulunmakla beraber sistem yöneticileri bunları ihtiyaçlara göre düzenleyebilmektedir. Güvenlik duvarları, internet trafiklerinin birçoğunu belli düzen içerisinde kategorize etmektedir. Bu durumda yerel ağdan internete çıkarken bu kategoride bulunan internet sayfası veya uygulamaların engellenmesi beklenmektedir. İnternet üzerinden yerel ağa erişim sağlamak isteyen ve zararlı web sitelerinin veya uygulamalarının erişim sağlaması istenmediğinden dolayı bunlar engellenmektedir. Bu çalışmada, güvenlik duvarı üzerinden 5 farklı gün ve saatlerde elde edilen log kayıtları ile 5 farklı veri seti oluşturulmuştur. Elde edilen veri setleri, veri önişleme aşamasından geçtikten sonra 26 parametre ile bir sonuç üreten veri setleri haline dönüştürülmüştür. Nihai olarak elde edilen veri setlerinde var olan 26 değişkenli, hangilerinin sonuç üzerinde daha fazla etki etki ettiğinin tespiti yapıldı. Hangi log alanının sonuç üzerinde etkisini daha fazla olduğunun tespit edilmesi için Çoklu Doğrusal Regresyon ve Pricipal Componenet Analyzes (PCA) kullanılarak 6 değişkenli yeni veri setleri oluşturulmuştur. Yapay Sinir Ağları algoritması ile güvenlik duvarından elde edilen 26 değişkenli veri setleri ile Çoklu Doğrusal Regresyon ve PCA kullanılarak elde edilen yeni 6 değişkenli veri setlerinin oluşturulmasında elenen değişkenlerin ne derece doğru değişkenler olduklarının tespiti, yapılan sınıflandırmalar ile hem 26 hem de 6 değişkenli veri setlerinin accuracy, precision, recall ve f1-score, specificity değerlerine bakılarak sonuçların kıyaslanması ile yapılmıştır. Çoklu Doğrusal Regresyon ve PCA kullanılarak veri setinde çıkarılan değişkenlerin doğru değişkenler olduklarının tespiti yapılmıştır. Accuracy, f1-score, recall, precision ve specificity değerleri incelendiğinde, Yapay Sinir Ağları algoritması ile yüksek oranda, güvenlik duvarı trafiğinin sonuç tahminin yapılabileceği görülmüştür. |
|
dc.description.abstract |
With the evolving technology, there has been a continuous surge in data traffic over the internet. Machine learning and big data are utilizing this data for their studies among various other fields. The substantial increase in data volume not only augments the research conducted on data but also elevates the outcomes of these studies. While these aspects are burgeoning, there is a significant rise in cyber threats, jeopardizing data privacy and integrity. Hence, the necessity to safeguard data has become imperative. In the present day, cybersecurity is not merely an option but a necessity for businesses. Security firewalls, solutions, and cyber security practices stand as fundamental components of information systems. These not only provide effective protection against evolving threats but are also vital in preserving data privacy and integrity. The escalating data traffic necessitates certain security systems for control. Employing these security systems has become unavoidable. These security solutions effectively secure systems and networks across small, medium, and large-scale operations, providing protection against various system and network attacks. These security solutions are designed across different layers and methods to counter cyber threats. Security systems comprise numerous components, both hardware and software. Moreover, modern security systems employ technologies like artificial intelligence and behavioral analysis to detect and prevent cyber attacks. Security systems such as firewalls, antivirus programs, antispam filters, mail gateways, among others, protect against malicious software, generating alerts. Among these security systems, the firewall stands out as one of the most critical. Firewalls primarily aim to monitor network traffic and prevent or isolate unwanted and harmful content. It acts as a barrier created to protect against unauthorized access and malicious software. Unlike security systems solely preventing malware, firewalls aim not to detect a specific malicious software but rather to protect systems by blocking suspicious traffic seen within the existing internet traffic. Security firewalls analyze internet traffic, determine permitted and blocked traffic based on defined rules, thus safeguarding systems. Firewall systems encompass security systems such as IPS (Intrusion Prevention System) and IDS (Intrusion Detection System). In older systems, IPS and IDS were separate, while modern firewall systems incorporate IPS and IDS solutions within, termed Unified Threat Management. Firewalls operate based on specific rules. They can regulate rules for web, application, DNS, VPN, antivirus, among others, governing traffic based on these rules. Some traffic is blocked, while others are permitted. Traffic monitoring and blocking take place within the firewall. As data traffic grows and threats become more complex, the probability of traditional firewalls becoming inadequate increases. Machine learning intervenes at this point, classifying and analyzing data, detecting abnormal activities, and identifying new threats through learning. Machine learning establishes a critical relationship between security firewalls, security solutions, and cybersecurity. Data analysis through algorithms enhances the effectiveness of these systems, enabling robust protection against increasingly sophisticated threats. Within the organization TDİ, where the datasets are obtained, all internet traffic occurs through the security firewall. As a Public Institution, TDİ differs from standard companies in that it only experiences specific traffic occurring within public institutions. For instance, there is a dedicated internet circuit exclusive to public institutions called KamuNet, which is not accessible from the internet. Services are availed through this KamuNet circuit, such as electronic signature verification and e-government services. Traffic observed in the firewall includes unique data compared to other companies. Additionally, IPSec traffic between different locations and the center exists within log records, presenting IPSec traffic among the logs. The datasets used in the thesis comprise logs obtained from the security firewall traffic. All internet traffic passing through the security firewall receives a result from the firewall. While every firewall has default rules, system administrators can modify them as needed. Firewalls categorize the majority of internet traffic within a certain structure. Thus, when exiting from the local network to the internet, the expectation is to block internet pages or applications within this category. Access is not desired from malicious websites or applications accessed from the internet to the local network, hence these are blocked. The blockages are not limited to web and application only; DNS, URL, and other prohibitions are also possible. In this study, logs obtained from the security firewall on five different days and times were used to create five different datasets. The reason for obtaining datasets on different days was to encompass different users and accessed addresses. Different times were chosen to capture peak internet traffic hours, resulting in increased data traffic. The obtained datasets underwent data preprocessing, involving cleaning, organizing, and preparing the datasets for machine learning models. In data preprocessing, certain log entries were removed from the datasets to obtain their raw form. Log removals include system logs and log entries with some empty variables. Since machine learning models typically work with numerical data, textual expressions need to be converted into numerical formats. Programming processes were applied to represent textual expressions with numerical values. Data preprocessing and digitization significantly impact the performance of machine learning models, allowing for more accurate predictions. Thus, the datasets were entirely transformed into numerical form. After passing through data preprocessing and digitization steps, the datasets were transformed into datasets producing a result based on 26 variables. Examining the variables obtained from the security firewall logs, a total of 219 variables are present for each traffic. Many of these variables contain empty values. Upon scrutinizing the acquired variables, some were eliminated from the traffic, reducing them to 26 variables. Ultimately, it was determined which variables from the acquired dataset had a greater influence on the result. To determine which variables among the obtained dataset had a greater impact on the outcome, Multiple Linear Regression and Principal Component Analysis (PCA) were utilized. Multiple Linear Regression identified the variables that had the most impact on the result. PCA was employed to analyze relationships between the 26 variables, reducing their dimensions. Variables with the most significant impact were also determined using PCA. Among the variables with high impact in both algorithms, six common variables were identified. Subsequently, new datasets with six variables were created. To ascertain the effectiveness of the selected variables in the created new datasets, Artificial Neural Networks algorithm was utilized. The Artificial Neural Networks classification algorithm analyzed datasets obtained from the security firewall with 26 variables, as well as the new datasets with six variables derived from Multiple Linear Regression and PCA. Through this analysis, it was determined whether datasets with six variables obtained through Multiple Linear Regression and PCA were accurate variables. Comparisons were made on the results based on metrics such as accuracy, precision, recall, f1-score, and specificity for both 26 and six-variable datasets using the Artificial Neural Networks algorithm. The comparison showed that the results obtained from the six-variable datasets were mostly superior to or closely matched the outcomes of the 26-parameter datasets. Hence, it was concluded that the variables extracted from the dataset using Multiple Linear Regression and PCA were indeed accurate variables. Examination of accuracy, f1-score, recall, precision, and specificity values revealed that the Artificial Neural Networks algorithm could predict the outcome of security firewall traffic to a high extent. Consequently, it was observed that in the absence of an existing security firewall system, the system's continuity could be maintained through Machine Learning algorithms. |
|