Web kazıma ve makine öğrenmesi yöntemleri kullanılarak fiyat tahminleme: İkinci el araç piyasasında bir örnek = Price prediction using web scraping and machine learning methods: An example in the used car market

Yılmaz, Seda

DSpace Home
→
Enstitüler / Institutes
→
Fen Bilimleri Enstitüsü / Instıtute of Scıence and Technology
→
Tez Koleksiyonu
→
2023 Yüksek Lisans Tezleri Koleksiyonu
→
View Item

dc.contributor.advisor	Doçent Doktor İhsan Hakan Selvi
dc.date.accessioned	2024-07-10T08:28:58Z
dc.date.available	2024-07-10T08:28:58Z
dc.date.issued	2023
dc.identifier.citation	Yılmaz, Seda . (2023). Web kazıma ve makine öğrenmesi yöntemleri kullanılarak fiyat tahminleme: İkinci el araç piyasasında bir örnek = Price prediction using web scraping and machine learning methods: An example in the used car market. (Yayınlanmamış Yüksek Lisans Tezi). Sakarya Üniversitesi Fen Bilimleri Enstitüsü
dc.identifier.uri	https://hdl.handle.net/20.500.12619/102405
dc.description	06.03.2018 tarihli ve 30352 sayılı Resmi Gazetede yayımlanan “Yükseköğretim Kanunu İle Bazı Kanun Ve Kanun Hükmünde Kararnamelerde Değişiklik Yapılması Hakkında Kanun” ile 18.06.2018 tarihli “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” gereğince tam metin erişime açılmıştır.
dc.description.abstract	Teknolojinin gelişmesi ile birlikte günümüzde veri miktarı ve trafiği önemli ölçüde arttı. Dolayısıyla veriyi toplamak ve anlamlandırabilmek de oldukça önemli hale geldi. Ancak ihtiyaç duyulan veriler her zaman bir kaynakta toplanıp sunulmuyor olabilir. Bu nedenle web kazıma yöntemleri kullanılarak, ihtiyaç duyulan veriler ilgili kaynaklardan toplanabilmektedir. Bu çalışmada web kazıma teknikleri kullanılarak toplanan ikinci el araç satış verilerinin makine öğrenmesi algoritmaları kullanılarak analiz edilmesi ve fiyat tahminleme modeli oluşturulması hedeflenmiştir. Analiz için ihtiyaç duyulan veriler belirlenip Selenium ve BeautifulSoup kullanılarak toplanmıştır. Çalışmada kullanılmak üzere aynı özelliklere sahip, farklı sayıda veri içeren iki adet veri seti elde edilmiştir. Her iki veri seti de biri hedef değişken olmak üzere 25 özellikten oluşmaktadır. Veri seti 1, 5557 satır veri içerirken veri seti 2 ise 11688 satır veri içermektedir. İki veri seti de çeşitli veri ön işleme aşamalarından geçirilerek analize hazırlanmıştır. Toplanan veriler özellik seçimi ve boyut indirgeme için Lasso regresyon ve PCA analizi, hiperparametre ayarlaması yapmak için GridSearchCV yöntemi uygulanarak makine öğrenmesi algoritmaları ile değerlendirilmiştir. Lasso regresyon sonrası veri seti 1 için özellik sayısı 11'e indirgenirken veri seti 2 için 9'a indirgenmiştir. PCA analizi sonrası ise her iki veri seti için de özellik sayısı 7'ye indirgenmiştir. Analizde Rastgele Orman Regresyon, K-en Yakın Komşu Regresyon, Gradyan Artırma Regresyon, AdaBoost Regresyon, Destek Vektör Regresyon ve XGBoost Regresyon algoritmaları kullanılmıştır. Elde edilen analiz sonuçları Ortalama Kare Hata (MSE), Kök Ortalama Kare Hata (RMSE) ve Determinasyon Katsayısı (R2) ile birlikte değerlendirilmiştir. Veri seti 1 için sonuçlar incelendiğinde en iyi sonucu veren model 0.973 R2, 0.026 MSE ve 0.161 RMSE değerleri ile XGBoost Regresyon olmuştur. Bu modeli Gradyan Artırma Regresyon modeli takip etmektedir. Veri seti 2 için sonuçlar incelendiğinde ise en iyi sonucu veren model 0.978 R2, 0.021 MSE ve 0.145 RMSE değerleri ile K-en Yakın Komşu Regresyon olmuştur. Bu modeli XGBoost Regresyon modeli takip etmektedir. Analiz sonucunda her iki veri seti için de en kötü sonuçları veren algoritmanın AdaBoost Regresyon olduğu görülmüştür.
dc.description.abstract	With the developing technology, the size of the data and the data traffic are increasing day by day. Therefore, data analysis is of great importance in many different sectors and functions today. When data analysis is done correctly, it provides significant benefits in almost every field. It enables businesses to achieve better results in decision-making. Accordingly, it provides important benefits such as competitive advantage, customer satisfaction, operational efficiency, and better management of risks. Data collection and interpretation are important. However, data that is not collected and interpreted correctly only causes confusion. Therefore, data collection is as important as analysis. The required data may not always be provided directly. Therefore, by using web scraping methods, the needed data can be collected from relevant sources. Websites contain large amounts of data. Instead of collecting this data manually, web scraping makes it possible to collect large amounts of data automatically. With web scraping, the collection of large amounts of data is automated and the data is made more useful to the user. Identifying needs is important in web scraping. First of all, it should be determined which data will be collected from which website. Different tools and libraries are available for web scraping. Libraries like BeautifulSoup and Scrapy in Python are popular and frequently used methods for web scraping. By doing research, the most suitable method for web scraping can be determined. After data is collected, some preprocessing steps may be required before starting the analysis. By examining the data set, missing or incorrect data is corrected or removed. Scaling is done for data of different scales and a clean data set is obtained by removing redundant or repetitive data. Appropriate methods should be determined for the analysis of the obtained data set. These methods vary depending on the characteristics of the dataset and the intended output. Machine learning, statistical analysis, data mining or other analytical methods can be used. In this study, it is aimed to analyze the used car sales data collected using web scraping techniques by using machine learning algorithms and to create a price prediction model. The data used in the study were collected with codes written in the Python programming language from a used car sales website and saved in a database. Selenium and BeautifulSoup were used during web scraping for data collection. In the web scraping process, the URLs of the requested products were first retrieved from the target website via Selenium. The source codes of the pages whose URLs were taken were collected and the desired parts were parsed using BeautifulSoup. The data collected with Numpy and Pandas were manipulated, converted into appropriate format and tabulated. The collected data was written to the database. Used Pymysql and Sqlalchmy for database connection. Data collected; price, brand, model, series, year, km, type, information from, fuel, gear, engine power, engine capacity. In addition, the following information was collected from the template showing the damage status; right rear fender, rear hood, left rear fender, right rear door, right front door, roof, left rear door, left front door, right front fender, engine hood, left front fender, front bumper and rear bumper. While these fields are being filled, 0 values are assigned if the relevant part of the vehicle is original, 1 if it is painted, 2 if it is changed, and 3 if it is not specified. Within the scope of the study, two data sets with different data numbers were obtained in order to make comparisons during the analysis. Both datasets consist of 25 features. While collecting the data, the dates between 10.02.2023 and 31.03.2023 were taken as a basis. Dataset 1 contains 5557 rows of data, while Dataset 2 contains 11688 rows of data. The collected data were analyzed and initially, some fields that were null in the columns reflecting the damage status were filled with the default (unspecified=3) value. Then, categorical data were converted into numerical data using the encoder method to be ready for analysis. The price variable was set as the target variable. Lasso regression was used to determine which of the remaining 24 features would be used in the analysis. Lasso regression is a regression technique used in feature selection and model parameter estimation. Unlike traditional regression analysis, it allows the coefficient of unimportant variables to be reduced to zero. With this feature, it can reduce overfitting problems and make the model simpler and more interpretable. After applying Lasso, 11 features were selected in Dataset 1 result, and 9 features were selected in Dataset 2 result. Afterward, PCA analysis for size reduction was applied and the variables to be included in the analysis were determined. Principal Component Analysis (PCA) is a statistical technique used for dimension reduction in multivariate datasets. PCA is used to analyze the relationships between the variables in the data set and to create new variables that best explain these relationships. As a result of the PCA analysis, the number of features was reduced to 7 for both data sets. Training and test data were separated for use in machine learning algorithms. K-Fold cross-validation was used for this process. In the K-Fold cross-validation method, the dataset is divided into k different subsets. It is ensured that each piece is used as both training and test data. In this way, the errors that may occur due to its distribution are minimized. In this study, analyses were made by choosing k value of 10. Six different machine-learning algorithms were used in the study. Both the data set with the lower number of data and the data set with the higher number of data were run with each of these algorithms. GridSearchCV was used to determine the best parameters in the algorithms. Random Forest Regression, K-Nearest Neighbor Regression, Gradient Boosting Regression, AdaBoost Regression, Support Vector Regression, and XGBoost Regression algorithms were used in the analysis. The obtained analysis results were evaluated together with Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Coefficient of Determination (r-squared), and learning curve graphs. When the results for data set 1 were examined, the model that gave the best results was XGBoost Regression with 0.973 R2, 0.026 MSE, and 0.161 RMSE values. This model is followed by the Gradient Boosting Regression model. When the results for data set 2 were examined, the model that gave the best results was K-Nearest Neighbor Regression with 0.978 R2, 0.021 MSE, and 0.145 RMSE values. This model is followed by the XGBoost Regression model. As a result of the analysis, it was seen that the algorithm that gave the worst results for both data sets was AdaBoost Regression.
dc.format.extent	xxv, 36 yaprak : şekil, tablo ; 30 cm.
dc.language	Türkçe
dc.language.iso	tur
dc.publisher	Sakarya Üniversitesi
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.rights.uri	info:eu-repo/semantics/openAccess
dc.subject	Mühendislik Bilimleri,
dc.subject	Engineering Sciences
dc.title	Web kazıma ve makine öğrenmesi yöntemleri kullanılarak fiyat tahminleme: İkinci el araç piyasasında bir örnek = Price prediction using web scraping and machine learning methods: An example in the used car market
dc.type	masterThesis
dc.contributor.department	Sakarya Üniversitesi, Fen Bilimleri Enstitüsü, Bilişim Sistemleri Mühendisliği Ana Bilim Dalı
dc.contributor.author	Yılmaz, Seda
dc.relation.publicationcategory	TEZ