Machine Learning Models for predicting Used Car Prices explained: A Beginner’s Guide

How do you estimate the price of used cars? Learn more about machine learning models such as Random Forests, Support Vector Machines or XGBoost .

Jul 29, 2024

You want to sell your used car or buy one, but you need to know how much the car is worth — or what price is justified. To do this, you need a price estimator that gives you the most accurate price possible. Price estimation is not only relevant for car trading but also plays an important role for other goods such as real estate, shares and works of art to ensure fair and transparent transactions for buyers and sellers.

Traditionally, price estimates are based on statistical and heuristic approaches such as the comparative value method, the cost approach or expert valuation. However, these methods are often subjective and inaccurate depending on the available knowledge. For example, they are heavily dependent on the experience of the estimator.

Machine learning and deep learning offer modern alternatives for efficiently processing large volumes of data and recognizing complex patterns in the data. In addition, the models can be continuously updated with current data.

To create a price estimate of used cars with machine learning and deep learning models, I have researched various current studies and compiled the functionality of the most suitable models for this task. This introduction to the basics of price estimation with machine learning and deep learning should help you as much as it has expanded my knowledge in this area. Especially as a beginner in this field, it was initially difficult for me to understand how I should even approach such a task.

Price Estimation with Machine Learning. — Own Visualization

Basics of Price Estimation

Imagine you want to sell your 2016 BMW 3 Series with 78,000 miles on the clock. To determine a fair price, you probably start by scouring the market. You look at the price of similar vehicles that have recently been sold or are currently for sale. You might also calculate the price of the car based on the original purchase cost minus depreciation. Or you can find an expert who can estimate the value of the vehicle based on their expertise and knowledge of the market.

In an online price estimate I did with this model (selling in New York), the car’s selling price was estimated at $13,628. But how accurate are such estimates and what are the methods behind them?

Traditional Methods of Estimating the Price of Used Cars

Traditional methods of price estimation use different approaches based on historical data and expert knowledge.

Comparative value method: This method compares the used car with similar vehicles, taking into account factors such as brand, model, year of manufacture, mileage, condition and equipment. This method is widely used.
Cost approach: This method calculates the current value of the car based on the original purchase cost less depreciation. This takes into account the depreciation of the vehicle over the years.
Expert valuation or valuation tools: Experts estimate the price of the used car based on their experience and knowledge of the market. There are also price lists (e.g. Schwacke in Germany or Kelley Blue Book in the USA), which provide regularly updated values for various vehicle types and models.

Limitations of Traditional Methods and Advantages of Machine Learning and Deep Learning

However, these traditional approaches have their limitations. They are often subjective and heavily dependent on the individual experience and opinion of the estimator. They are often based on a limited amount of comparative data, which can lead to inaccurate estimates. Especially with complex products, it is difficult to take all relevant factors and their interactions fully into account. In addition, these methods do not react quickly enough to rapid market changes, which limits their effectiveness in dynamic markets.

Machine Learning & Deep Learning models can address precisely these challenges. Such models can efficiently process large amounts of data and recognize (complex) patterns in the data that are difficult for humans to grasp. Machine learning and deep learning models offer more objective and consistent results as they can integrate extensive data from different sources and are based on algorithms. Such models thus reduce subjectivity and increase the accuracy of the price estimate. In addition, these models can be continuously updated with new data, allowing them to react more quickly to market changes.

Step-by-step guide to Price Estimation with Machine Learning and Deep Learning

Data collection and data cleansing
1.1 — Data set: Collect historical data from various sources such as online databases, market analyses or CSV files.
1.2 — Data types and variables: Determine the relevant characteristics of the data set such as brand, model, year of manufacture, mileage, engine power, fuel type, etc.
1.3 — Data cleansing: Decide whether to remove missing values, replace them with average values or use other methods. Identify outliers and decide whether and how to deal with them. Consider whether you need to normalize or scale the data to make it more usable for certain algorithms. For example, this is not necessary for decision trees, whereas normalization and scaling are required for support vector regression.
Exploratory data analysis (EDA)
2.1 — Data visualization: create box plots, histograms and other charts to identify the distribution of the data and the key features that influence the price (e.g. calculate correlations and visualize with a heat map or use a pair plot).
2.2 — Statistical analysis: Identify the most important statistical properties of the data and check the relationships between the variables (e.g. with correlation matrices).
Feature engineering
3.1 — Selection of relevant features: Select the features that improve the model accuracy and remove irrelevant and redundant variables. Pay attention to the feature importance, for example for Random Forest with the entropy or for XGBoost based on the reduction of the loss function.
3.2 — Creation of new features: Create new, meaningful features from the existing data that can increase the prediction accuracy (e.g. by combining existing features or calculating new metrics).
Model selection and implementation
4.1 — Selection: Select suitable machine learning models such as linear regression, random forest or gradient boosting machines or use deep learning models such as multilayer perceptron or long short-term memory networks to model complex patterns and non-linear relationships in the data.
4.2 — Implementation: Implement the models using libraries such as Scikit-Learn, PyTorch or TensorFlow.
In one of my last articles, you can find an introduction to PyTorch: ‘How I started with PyTorch — Essential Tips and Steps for Beginners’
Training & evaluation of the models
5.1 — Training: train the models with the collected data from step 1 and optimize the hyperparameters to obtain the most accurate price estimates.
5.2 — Evaluation: Evaluate the model performance using methods such as the Mean Squared Error (MSE) and the R² score to check the accuracy and reliability of the predictions. Also use cross-validation to ensure model stability.

Cross-Validation is a method to check how well a model performes on unseen data. — Own Visualization

Possible Models for Price Estimation

The task for the various models is to estimate the price of used cars (second-hand cars) as accurately as possible based on the available data. Possible characteristics are brand, model of the car, year of manufacture, mileage, engine power, fuel type, etc. This task is a regression problem, as the value to be estimated (price of the car) is continuous.

Which machine learning and deep learning models are suitable for this task?

Machine Learning Models for the Prediction of Prices

Linear Regression
Linear regression is often used as a basic model. Its great advantage is that it is easy to understand. It tries to find a line that describes the relationship between the input data (e.g. kilometers already driven in a used car) and the target value (e.g. car price). Imagine you have many points in a diagram and try to draw a line that comes as close as possible to these points. This line then helps you to predict the price of the car.

Decision Trees
The decision tree divides the data recursively into smaller subsets. Recursive means that this process of splitting is repeated again and again. Each decision node in the tree represents a condition on a feature, while the leaves represent the predicted values.

Example of a decision node: Is the year of manufacture of the car before 2015?
Example of a leaf: Estimated price of the car

It is important that the most optimal features are selected and that these are pre-processed accordingly. The reduction in gini impurity or the increase in information gain is often used to select the features. These measure how well the data is divided into homogeneous groups.

The decision tree requires sufficient data to enable meaningful splits. To prevent the tree from being adapted too much to training data (overfitting), you can use the pruning technique. This involves removing unimportant parts of the tree to improve the generalization capability.

Advantages of Decision Trees

Simple visualization and interpretation.
Data does not have to be pre-processed (no normalization or scaling of the data necessary).

Disadvantages of Decision Trees

Decision trees are prone to overfitting.
Instability: small changes in the data can lead to completely different trees.

Example for the instability of decision trees: You have a data set with 4 cars.

A decision tree now selects, for example, the year of manufacture as the first feature and splits the data accordingly.

We then add another car to the data set:

This small change can lead to the decision tree choosing a completely different structure and, for example, selecting the mileage as the first feature. This changes the entire structure of the tree.

Random Forest
This model is basically a combination of many decision trees that work together to make more accurate predictions. The individual trees are trained on different random subsets of the data. When it comes to making a prediction (for example, the price of a car), all the trees make a prediction, and the random forest combines these predictions to get a final result. In a regression, such as predicting a price, the average of the predictions of all trees is taken. In a classification task, on the other hand, you would take the majority of the predictions.

You need sufficient computing power and memory for the application. Training can take a lot of time and memory, especially if the data set is large or many trees are used. It is important to have many decision trees in your random forest to achieve good results. This means you need to apply hyperparameter tuning to optimize the number of trees and other parameters of the model to achieve the best performance. Feature Importance allows you to recognize which features are most important for the predictions. The Random Forest model calculates the importance of each feature by measuring how much the accuracy of the predictions is improved when it is used.

Advantages of Random Forest Models

Achieve higher accuracy by aggregating multiple trees.
Reduce overfitting as each decision tree is trained on a different random part of the dataset and thus learns different aspects of the data.
Are robust to noise and outliers as the prediction is based on the average of the predictions of many decision trees.

Disadvantages of Random Forest Models

Higher complexity and longer training time.
More difficult to interpret than a single decision tree.

Gradient Boosting Machines (GBM)
The model works in several steps to make accurate predictions. GBM gradually builds new decision trees. Each new tree attempts to correct the errors of the previous trees. Each new tree is added to make the overall result more accurate.

It is important that the hyperparameters are well-selected. There are parameters such as the learning rate, number of trees, depth of trees, etc. By keeping the learning rate low and monitoring the number of iterations (= number of trees), you can avoid overfitting. Before you train the model, the data must be cleaned. This means that missing values need to be filled in, outliers dealt with and irrelevant features removed.

Learning Rate, Number of Trees and Depth of Trees are Parameter to set. — Own Visualization

Advantages of GBM

High accuracy and performance.
Can model complex non-linear relationships. This sets it apart from simpler models such as linear regression, which can only model linear relationships.
The model is flexible and customizable. You have many hyperparameters (learning rate, number of trees, depth of trees) to adjust the model to your data.

Disadvantages of GBM

Long training times and high computational requirements — especially with large data sets.
More complex and more difficult to interpret than a decision tree, for example. The many trees and parameters make it difficult to understand how the model arrives at its predictions.

Extreme Gradient Boosting (XGBoost)
This model is an enhancement of the Gradient Boosting Machine model. XGBoost is an optimized version of Gradient Boosting designed to work faster and more efficiently. It integrates techniques such as regularization, sparse learning and parallel training. These adaptations make XGBoost faster and more accurate than conventional GBM models.

To apply the model, you need strong computational resources and the hyperparameters need to be extensively customized for the model to achieve optimal results. The most important hyperparameters include the learning rate, the number of trees and the depth of the trees. Advanced optimization techniques such as Sparsity-Aware and Weighted Quantile Sketch help to make the calculations more efficient.

Regularization, Sparse Learning and Parallel Training are techniques. — Own Visualization

Advantages of XGBoost

Shows improved speed and efficiency compared to traditional gradient boosting machines.
Shows high accuracy and robustness of predictions.
To avoid overfitting, the model has integrated regularization.

Disadvantages of XGBoost

Especially for beginners, it is rather complex and more difficult to implement the model. It contains many advanced techniques such as sparsity-aware and optimizations that are difficult for beginners to understand and apply.
Many hyperparameters need to be fine-tuned.

Support Vector Regression (SVR)
This model is used to predict continuous values. SVR searches for the optimal hyperplane that minimizes the deviations of the data points. The model tries to find a line or surface that describes the data points as well as possible.

For the implementation, it is important that you choose the right kernel. A kernel helps to transform the data into a higher dimensional space where it can be better separated or fitted. A linear kernel uses a straight line if the relationship between the variables is linear. A polynomial kernel uses curved lines for complex relationships. A radial basis function kernel is suitable for very complex and non-linear relationships between variables. You must clean the data before using it. It is also important to scale the data because SVR is based on the distances between the data points. This means that you have to transform the characteristics so that they have similar scales. For example, you can standardize them (mean = 0, standard deviation = 0) or adjust them with min-max scaling (values between 0 and 1).

Advantages of SVR

Is well suited to high-dimensional data — when the data set has many features.
Is robust against overfitting in high-dimensional spaces. SVR can avoid overfitting, especially if the data has many dimensions. The reason for this is that the model looks for a clear boundary (hyperplane) between the data points, which minimizes the deviations.

Disadvantages of SVR

Is computationally intensive and requires long training times for large data sets.
Especially for beginners, it is difficult to choose the right kernel and parameters. It can be challenging to find the right kernel (linear, polynomial, RBF) and the best hyperparameters (e.g. C, gamma for RBF kernel). These parameters need to be fine-tuned to achieve the best performance of the model. This requires either a lot of try & error or a great understanding of the underlying data and models.

Deep Learning Models for the Prediction of Prices

Various deep learning models such as Multilayer Perceptron (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTM) and Autoencoder are used for the price estimation of used cars. An explanation of these models follows in the next article.

Conclusion

If you want to sell your used car or buy a new one, it is important to know the fair value of the vehicle. Price estimation with machine learning or deep learning models helps to minimize the subjectivity and inaccuracy of traditional methods (e.g. comparative value method, cost approach, expert valuation) by efficiently processing large amounts of data and recognizing complex patterns. In the next article, I will present the functionality as well as the advantages and disadvantages of different deep learning models to perform a price estimation.

Where is the best place to continue learning?

Study — Price Prediction of Used Cars Using Machine Learning
Study — Price Prediction for Pre-Owned Cars Using Ensemble Machine Learning Techniques
Study — Machine Learning Techniques To Predict The Price Of Used Cars: Predictive Analytics in Retail Business
Study — Used Car Price Prediction using Machine Learning: A Case Study
Subscribe to Data Science Espresso to receive weekly machine learning and deep learning guides. Let me know if a specific topic would help you.
DataCamp XGBoost-Tutorial (free)
DataCamp Support Vector Machines-Tutorial (free)
DataCamp Gradient Boosting-Tutorial (free)
DataCamp Tree-Based Models in Python-Course (free and paid parts)

Data Science Espresso by Sarah Lea