Flight Price Prediction using ML Techniques Written From Scratch
Date:
This project explores the relationship between airline ticket prices and various factors such as airline, destination, flight times, booking period, and flight class. We developed custom-built machine learning models using fundamental Python libraries (numpy, pandas) to predict ticket prices based on these features.
Dataset Description
The dataset contains the price of plane tickets from six Indian airline companies for flights between six major cities in India. The features include:
- Airline: Categorical values for airline names.
- Flight: Categorical flight codes (removed during preprocessing).
- Source City: Categorical values for the departure cities.
- Departure Time: Categorical time labels for departure time.
- Stops: Categorical values for the number of stops.
- Arrival Time: Categorical time labels for arrival time.
- Destination City: Categorical values for the destination cities.
- Class: Categorical values indicating if the flight is “economy” or “business” class.
- Duration: Continuous feature representing flight duration.
- Days Left: Continuous feature representing how many days before the flight the ticket was bought.
- Price: Continuous variable and target for prediction.
Data Analysis
- Correlation Matrix: Analyzed relationships between features using a heatmap.
- Box Plots: Examined relationships between flight prices and airlines, number of days before buying the ticket, and ticket classes.
- Violin Plots: Compared ticket prices for different flight classes.
- Line Plots: Visualized how ticket prices change as the departure date approaches.
Applied Methods
Linear Regression
Implemented linear regression using numpy to fit the model. The following methods were used:
- Ordinary Least Squares (OLS): Basic linear regression method.
- Ridge Regression: Added L2 regularization to reduce overfitting.
- Lasso Regression: Added L1 regularization to produce a sparse model.
Performance metrics:
- RMSE: Root Mean Squared Error
- R²: Coefficient of Determination
K-Nearest Neighbors (KNN)
Used KNN to assign predictions to new data points based on the average response values of the k-nearest training data points.
Decision Trees
Constructed decision trees to segment the predictor space into finite regions, each associated with a prediction value. Applied bagging to improve performance by averaging over multiple trees.
Neural Networks
Built and trained neural networks to capture nonlinear relationships in the data. Implemented stochastic gradient descent for training.
Results
- Linear Regression:
- OLS: RMSE = 21973.28, R² = 0.061
- Ridge Regression: RMSE = 21961.37, R² = 0.0609
- Lasso Regression: RMSE = 21965.48, R² = 0.0605
- KNN: RMSE = 3780.50
- Decision Trees:
- Simple Decision Tree: RMSE = 4494.55
- Bagging Decision Tree: RMSE = 4341.66
- Random Forest: RMSE = 22661.89
- Neural Networks:
- Single Perceptron: RMSE = 13893.22
- Neural Network without Hidden Layer: RMSE = 4422.97
- Neural Network with 1 Hidden Layer: RMSE = 4636.13
Challenges Faced
- Underfitting: Due to a large number of data points and a relatively small number of predictors.
- Training Time: Extensive training times for more complex models.
Conclusion
Among the methods tested, the KNN and decision tree models performed the best. Linear regression methods were limited by their inability to capture nonlinear relationships in the data, while neural networks showed promise but required extensive training time to optimize.
Authors
- Kemal Enes Akyüz - 22003521
- Efe Tarhan - 22002840
References
- Bathwal, Shubham. “Flight Price Prediction.” Kaggle. Accessed November 17, 2023. [Online]. Available: Kaggle.
- E. Dougherty, “Deciphering the Digits in Your Flight Number,” Blue Sky PIT News Site, 09 Mar, 2020. Accessed: 18 Nov 2023. [Online]. Available: Blue Sky PIT
- Newcastle University, “Coefficient of Determination (R squared),” 2023. Accessed: 19 Nov 2023. [Online]. Available: Newcastle University
- D. Witten and G. James, An introduction to statistical learning with applications in R. Springer publication, 2013.