İklimlendirme sistemleri üzerinde makine öğrenmesi ile anomali tespiti = Anomaly detection with machine learning on air conditioning systems

Kibar, Refik

DSpace Home
→
Enstitüler / Institutes
→
Fen Bilimleri Enstitüsü / Instıtute of Scıence and Technology
→
Tez Koleksiyonu
→
2023 Yüksek Lisans Tezleri Koleksiyonu
→
View Item

dc.contributor.advisor	Doktor Öğretim Üyesi Muhammed Fatih Adak ; Doktor Öğretim Üyesi Kevser Ovaz Akpınar
dc.date.accessioned	2024-01-26T12:22:57Z
dc.date.available	2024-01-26T12:22:57Z
dc.date.issued	2023
dc.identifier.citation	Kibar, Refik. (2023). İklimlendirme sistemleri üzerinde makine öğrenmesi ile anomali tespiti = Anomaly detection with machine learning on air conditioning systems. (Yayınlanmamış Yüksek Lisans Tezi). Sakarya Üniversitesi Fen Bilimleri Enstitüsü
dc.identifier.uri	https://hdl.handle.net/20.500.12619/101763
dc.description	06.03.2018 tarihli ve 30352 sayılı Resmi Gazetede yayımlanan “Yükseköğretim Kanunu İle Bazı Kanun Ve Kanun Hükmünde Kararnamelerde Değişiklik Yapılması Hakkında Kanun” ile 18.06.2018 tarihli “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” gereğince tam metin erişime açılmıştır.
dc.description.abstract	Kredi kartı dolandırıcılığı, siber saldırılar, terörist faaliyetleri veya bir sistemin bozulması gibi kötü niyetli faaliyetler gibi çeşitli nedenlerle verilerde anormallikler meydana gelebilir, ancak tüm nedenlerin ortak özelliği analiz için ilgi çekici olmalarıdır. Anomalilerin ilginçliği veya gerçek yaşamla ilgili olması, anomali tespitinin önemli bir özelliğidir. Anomali tespiti konusunda birçok yöntem denenmiştir. Bu tezde ısıtma, havalandırma ve iklimlendirme (HVAC) sistemine ait çok değişkenli bir zaman serisindeki siber saldırıları tespitinde makine öğrenmesi yöntemleri kullanılarak performansları karşılaştırılmıştır. Anomali tespitinde doğrusal regresyon, karar ağaçları, k-en yakın komşu, rastgele orman ve gradyan artırma modelleri kullanılmıştır. Erişime açık HVAC sistemine ait bir zaman serisi olmadığından simülasyon sonucu elde edilen bir veri seti üzerinde modeller eğitilmiştir. Eğitilen modellerin yine aynı sisteme ait içinde 16 farklı siber saldırıyı barındıran bir veriseti üzerinde test edilmiştir. Karşılaştırılan sonuçlarda doğrusal regresyon ve gradyan arttırma modelinin iyi sonuç verdiği fakat anomalilerin tespiti noktasında çok iyi olmadığı gözlemlenmiştir. Büyük verilerde oluşabilecek aşırı yüklemenin modeller üzerinde olumsuz bir etkisi vardır. Veri kümesine ait öz nitelik seçimi önem arz etmektedir. Seçilen niteliklerin veri kümesini temsil etmesi tümüyle temsil etmesi gerekmektedir. Öz nitelik seçiminde temel bileşen analizi veya farklı derin öğrenme teknikleri kullanılmıştır. Korelasyonel ilişkide bu yöntemlerden biridir. Çıkış parametresi ile giriş parametreleri arasındaki ilişki incelenebileceği gibi çok değişkenli zaman serilerinde giriş parametreleri arasındaki ilişkide incelenebilir. Bu çalışmada giriş parametrelerinin birbirleriyle olan korelasyonel ilişkisi incelemiştir. +0,95 ve -0,95 ten yüksek ilişki gösteren değişkenler indirgenerek modeller tekrar eğitilmiştir. Siber saldırıları barındıran veri kümesi üzerinde test edilen algoritmaların sonuçları karışıklık matrisi ile karşılaştırılmıştır. Sonuç olarak, GBR modelinin doğruluğunun kayda değer bir şekilde yükseldiği ve başarılı tespitler gerçekleştirdiği görülmüştür. Doğruluk değeri 0,64'ten 0,996'ya yükselmiş ve aynı zamanda modelin tahmin sonuçlarındaki hata değeri azalmıştır. Bu yaklaşım ile büyük verilerin özelliği göz önüne alınarak zaman serilerinde parametrelerin azaltılmasıyla doğruluğunun arttırılabileceği anlaşılmaktadır. Gelecek çalışmalarda farklı HVAC sistemlere ait veri setleri üzerinde çalışmalar yapılarak daha genel bir ifadeye ulaşılabileceği gibi farklı yöntemler ile parametreler azaltılıp makine öğrenmesi yöntemlerinin performansı karşılaştırılabilir. Ayrıca Django yazılım dili ile geliştirilen web sayfası ile gerçek zamanlı uygulamalara model olarak bir tasarım gerçekleştirilmiştir. Bu çalışma Django ile web tabanlı anomali tespiti için performans değerlendirme çalışmalarına örnek teşkil edilebilir. Farklı veri setleri üzerindeki analiz sonuçları ve işlem süreleri karşılaştırılabilir. Bu sayede, anomali tespit sistemlerinin etkinliğini ve ölçeklenebilirliğini değerlendirmeye yardımcı olacaktır.
dc.description.abstract	Anomalies in data can occur for a variety of reasons, such as credit card fraud, cyberattacks, terrorist activities, or malicious activities such as the disruption of a system, but the common feature of all causes is that they are of interest for analysis. The interestingness of anomalies or their relevance to real life is an important feature of anomaly detection. Anomaly refers to a situation that is normally unexpected or non-ordinary. For example, abnormal network traffic in a network, abnormal production data in a factory, or suspicious transactions in a financial system can be considered anomalies. Anomaly detection is useful in many sectors. For example, it can be used in areas such as network security, financial services, manufacturing, healthcare, and others. Anomaly detection provides the following important advantages such as early diagnosis: Anomaly detection can identify problems in advance by detecting deviations from normal at an early stage. For example, detecting abnormal activity in a network can make a cyber attack noticeable before it starts. Efficiency and cost savings: Anomaly detection can provide efficiency and productivity in data-driven processes. While manually detecting abnormal situations can often be time-consuming and costly, machine learning models can quickly identify abnormal data and focus on problem areas. Fraud detection: Anomaly detection has a significant role in financial services and in the fight against fraud. For example, machine learning models can be used to detect credit card fraud or fraud attempts. Improved security: Anomaly detection plays an important role in network security and the fight against cyber threats. Detecting abnormal network traffic or malware activity allows us to quickly detect vulnerabilities and strengthen protection measures. Machine learning for anomaly detection involves a variety of techniques used to learn normal behavior and then evaluate how abnormal a particular data point is. Learning-based methods allow the model to get better over time and produce more precise results. Many methods have been tried to detect anomalies. Accurate and fast anomaly detection is very important to prevent negative situations that will occur at the end of the system and process. The importance of the situation increases even more when critical infrastructures that such anomalies will know about or that are targeted by cyber attacks are targeted. Critical infrastructure; are large systems that contain subsystems that may cause loss of life and property, great and irreparable economic damage, personal and national security vulnerability and deterioration of public order when the characteristics, confidentiality, integrity or accessibility of the information they process are impaired. However, with the developments in the technologies used, the importance of the security of building management systems (BMS) has increased. Since the Heating, Ventilation and Air Conditioning (HVAC) system in buildings accounts for about 40% of total energy consumption, threats targeting the HVAC system can be quite serious and costly. Therefore, in this thesis study, their performances were compared using machine learning methods in detecting cyber attacks in a multivariate time series of heating, ventilation and air conditioning (HVAC) system. For this study, the data set of a 12-zone HVAC system collected from a simulation model using this thesis transient system was used due to the limitations in accessing a real HVAC system and the lack of general labeled data sets to investigate the cybersecurity of HVAC systems. In the study, the training was carried out on the jupyter notebook version 6.5.2 using the gradient increase model. The model created using Visual Studio Code 17.5 was imported into the project. The system features Intel Core i7-10750H CPU 2.60GHz 2.59 GHz processor and nVIDIA GeForce GTX1650 Ti 4GB GDDR6 128-Bit DX12 graphics card. The model was implemented with Python version 3.9.15. SQLiteStudio 3.4.4 was used. Linear regression, decision trees, k-nearest neighbor, random forest and gradient augmentation models were used for anomaly detection. Detailed comparisons and results of the models are given in Chapter 4. The trained models were tested on a database containing 16 different cyber-attacks belonging to the same system. These 16 different cyber-attacks can be evaluated in 4 different groups. Changing the set points of the control system, Falsifying sensor measurements by freezing their values or creating a bias, Falsifying control signals by freezing their values or creating a bias, Changing command signals to components. In the compared results, it was observed that the linear regression and gradient increase model showed good results, but the detection point of the anomalies was not very good. The normal behavior of anomalies is of great importance in this regard. The overload that may occur in big data has a negative effect on the models. Attribute selection of the dataset is important. The selected attributes must fully represent the dataset. Basic component analysis or different deep learning techniques were used in the selection of attributes. Correlational relationship is one of these methods. Correlational relationship refers to the relationship between variables in the data set. If a dataset contains a large number of input variables, high correlations can be found between these variables. In this case, the reduction of input variables or feature selection provides significant advantages over the dataset. Calculation Efficiency: Fewer input variables can reduce the calculation time of analysis and modeling operations. If there are high correlations between the variables in the data set, most of these variables carry the same information and there is no need to process repeated information in the analysis. Model Simplicity: Working with a small number of input variables increases the model's intelligibility and reduces the risk of overfitting. Complex models often underperform due to the presence of unnecessary or related variables. Feature selection makes the model simpler and more generalizable. Reduced Noise Effect: The relationship between correlated variables can reduce the noise effect on input variables. Variables that are not correlated or have a low correlation can adversely affect the performance of the model and cause misleading results. Feature selection helps to achieve more reliable results by reducing the noise impact. Generalizability: A small number of input variables increases the generalizability of the model. When the model is trained on a limited number of variables, it can better adapt to different data samples or new datasets. This allows the model to find a more general application area. Especially in large datasets, reducing input variables with unnecessary or high correlation can improve the performance of the model and provide more efficient analyses. In the light of this information, the relationship between the output parameter and the input parameters can be examined, as well as the relationship between the input parameters in multivariate time series. In this study, the correlational relationship of input parameters with each other was examined. Models were retrained by reducing variables with a higher relationship than +0.95 and -0.95. As a result of the operations, 30 input parameters were used by subtracting input variables that had no effect on the output value and had similar effects. The results of the algorithms tested on the dataset containing cyber attacks were compared with the confusion matrix. The confusion matrix is a metric table used to evaluate the performance of a model in classification problems. It is a commonly used tool in machine learning and data mining. The confusion matrix allows us to calculate performance measures such as accuracy, precision, recall, and F1 score by comparing the actual class labels with the predicted class labels of the model. As a result, it was observed that the accuracy of the GBR model increased significantly and made successful determinations. A total of 591 data points, consisting of 351 normal and 240 anomaly instances, have been labeled for a time series. While all 351 normal data points were correctly predicted, 238 out of the 240 anomaly values were successfully detected. With these results, the accuracy value increased from 0.64 to 0.996, and at the same time the error value in the prediction results of the model decreased. With this approach, considering the characteristics of big data, it is understood that its accuracy can be increased by reducing the parameters in time series. Machine learning and anomaly detection provide significant benefits in many sectors such as security, efficiency and fraud detection. It is an effective tool for detecting abnormal situations early, responding quickly to problems, and providing a safer environment. In future studies, a more general statement can be reached by studying the data sets of different HVAC systems, and the performance of machine learning methods can be compared by reducing the parameters with different methods. In addition, a design that will be a model for real-time applications was realized with the web page developed with the Django software language. Django provides a powerful infrastructure for web applications. Anomaly detection is a task that usually requires large amounts of data analysis and real-time processing. Django's rapid development features and flexible structure have been used to effectively manage the data flow and processing logic required for anomaly detection. And Django offers a scalable infrastructure for managing high-traffic web applications. Anomaly detection is a task that requires analyzing large amounts of data in real time. Django's scalability features improve system performance and can effectively handle large data flows. The data in the time series are displayed according to the entered index value and applied to the developed GBR* model. Depending on the threshold value, the data of the time series are labeled as normal/anomaly and displayed. With the design realized, the model can be operated with the data read from the sensors and nodes. It is very important to detect the anomaly as soon as possible in order to prevent negative situations that may occur. The developed models can be used as a basis for wider studies with the changes to be made on the threshold value according to the tolerance value of the critical infrastructures to be applied. This study can be an example of performance evaluation studies for web-based anomaly detection with Django. Analysis results and processing times on different datasets can be compared. In this way, it will help to evaluate the effectiveness and scalability of anomaly detection systems.
dc.format.extent	xxii 74 yaprak : şekil, tablo ; 30 cm.
dc.language	Türkçe
dc.language.iso	tur
dc.publisher	Sakarya Üniversitesi
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.rights.uri	info:eu-repo/semantics/openAccess
dc.subject	Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol,
dc.subject	Computer Engineering and Computer Science and Control
dc.title	İklimlendirme sistemleri üzerinde makine öğrenmesi ile anomali tespiti = Anomaly detection with machine learning on air conditioning systems
dc.type	masterThesis
dc.contributor.department	Sakarya Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar ve Bilişim Mühendisliği Anabilim Dalı,
dc.contributor.author	Kibar, Refik
dc.relation.publicationcategory	TEZ