dc.description.abstract |
With the developing technology, the size of the data and the data traffic are increasing day by day. Therefore, data analysis is of great importance in many different sectors and functions today. When data analysis is done correctly, it provides significant benefits in almost every field. It enables businesses to achieve better results in decision-making. Accordingly, it provides important benefits such as competitive advantage, customer satisfaction, operational efficiency, and better management of risks. Data collection and interpretation are important. However, data that is not collected and interpreted correctly only causes confusion. Therefore, data collection is as important as analysis. The required data may not always be provided directly. Therefore, by using web scraping methods, the needed data can be collected from relevant sources. Websites contain large amounts of data. Instead of collecting this data manually, web scraping makes it possible to collect large amounts of data automatically. With web scraping, the collection of large amounts of data is automated and the data is made more useful to the user. Identifying needs is important in web scraping. First of all, it should be determined which data will be collected from which website. Different tools and libraries are available for web scraping. Libraries like BeautifulSoup and Scrapy in Python are popular and frequently used methods for web scraping. By doing research, the most suitable method for web scraping can be determined. After data is collected, some preprocessing steps may be required before starting the analysis. By examining the data set, missing or incorrect data is corrected or removed. Scaling is done for data of different scales and a clean data set is obtained by removing redundant or repetitive data. Appropriate methods should be determined for the analysis of the obtained data set. These methods vary depending on the characteristics of the dataset and the intended output. Machine learning, statistical analysis, data mining or other analytical methods can be used. In this study, it is aimed to analyze the used car sales data collected using web scraping techniques by using machine learning algorithms and to create a price prediction model. The data used in the study were collected with codes written in the Python programming language from a used car sales website and saved in a database. Selenium and BeautifulSoup were used during web scraping for data collection. In the web scraping process, the URLs of the requested products were first retrieved from the target website via Selenium. The source codes of the pages whose URLs were taken were collected and the desired parts were parsed using BeautifulSoup. The data collected with Numpy and Pandas were manipulated, converted into appropriate format and tabulated. The collected data was written to the database. Used Pymysql and Sqlalchmy for database connection. Data collected; price, brand, model, series, year, km, type, information from, fuel, gear, engine power, engine capacity. In addition, the following information was collected from the template showing the damage status; right rear fender, rear hood, left rear fender, right rear door, right front door, roof, left rear door, left front door, right front fender, engine hood, left front fender, front bumper and rear bumper. While these fields are being filled, 0 values are assigned if the relevant part of the vehicle is original, 1 if it is painted, 2 if it is changed, and 3 if it is not specified. Within the scope of the study, two data sets with different data numbers were obtained in order to make comparisons during the analysis. Both datasets consist of 25 features. While collecting the data, the dates between 10.02.2023 and 31.03.2023 were taken as a basis. Dataset 1 contains 5557 rows of data, while Dataset 2 contains 11688 rows of data. The collected data were analyzed and initially, some fields that were null in the columns reflecting the damage status were filled with the default (unspecified=3) value. Then, categorical data were converted into numerical data using the encoder method to be ready for analysis. The price variable was set as the target variable. Lasso regression was used to determine which of the remaining 24 features would be used in the analysis. Lasso regression is a regression technique used in feature selection and model parameter estimation. Unlike traditional regression analysis, it allows the coefficient of unimportant variables to be reduced to zero. With this feature, it can reduce overfitting problems and make the model simpler and more interpretable. After applying Lasso, 11 features were selected in Dataset 1 result, and 9 features were selected in Dataset 2 result. Afterward, PCA analysis for size reduction was applied and the variables to be included in the analysis were determined. Principal Component Analysis (PCA) is a statistical technique used for dimension reduction in multivariate datasets. PCA is used to analyze the relationships between the variables in the data set and to create new variables that best explain these relationships. As a result of the PCA analysis, the number of features was reduced to 7 for both data sets. Training and test data were separated for use in machine learning algorithms. K-Fold cross-validation was used for this process. In the K-Fold cross-validation method, the dataset is divided into k different subsets. It is ensured that each piece is used as both training and test data. In this way, the errors that may occur due to its distribution are minimized. In this study, analyses were made by choosing k value of 10. Six different machine-learning algorithms were used in the study. Both the data set with the lower number of data and the data set with the higher number of data were run with each of these algorithms. GridSearchCV was used to determine the best parameters in the algorithms. Random Forest Regression, K-Nearest Neighbor Regression, Gradient Boosting Regression, AdaBoost Regression, Support Vector Regression, and XGBoost Regression algorithms were used in the analysis. The obtained analysis results were evaluated together with Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Coefficient of Determination (r-squared), and learning curve graphs. When the results for data set 1 were examined, the model that gave the best results was XGBoost Regression with 0.973 R2, 0.026 MSE, and 0.161 RMSE values. This model is followed by the Gradient Boosting Regression model. When the results for data set 2 were examined, the model that gave the best results was K-Nearest Neighbor Regression with 0.978 R2, 0.021 MSE, and 0.145 RMSE values. This model is followed by the XGBoost Regression model. As a result of the analysis, it was seen that the algorithm that gave the worst results for both data sets was AdaBoost Regression. |
|