Team
Yihe Chen
Harry Huang
Junting Chen
Kehan Yu
Mentor
Cantay Caliskan
Abstract
Predictive Analytics for Demand Responsive Para- transportation
Vision & Goal
● Create a productive schedule for Demand Responsive Para-transportation by predicting the customers’ cancellation.
● Provide executable Python code and classification model.
● Discover best performance metrics.
● Generate well-organized supporting
Data Overview
Internal data
● We acquired the internal data from our sponsor
● Our original dataset contains 102754 observations, and 21 explanatory variables from May 17th, 2021 to December 5th, 2021.
External Data
● We acquired the From NOAA (National Oceanic and Atmospheric Administration)
● Acquired daily weather information
Data Visualization
Feature engineering
● Created a label for the cancellation (1 for canceled, 0 for performed)
● Transformed ‘date’ variables into informative variables (e.g.. month, day, weekday)
● Encoded categorical variables
● Aggregated passengers by type (with children, need lift)
Modeling
Handle the Class Imbalance
Random Forest with SMOTE
Accuracy: 81.5% -> 84.8%
Precision: 32.2% -> 62.1%
Precision: 53.9 -> 57.7%
Precision is significantly improved by 92.8%, while Recall and Accuracy are slightly improved
Weighted Random Forest Classifier
Compared to the Random Forest Classifier, Weighted Random Forest Classifier penalizes the misclassification of minority class more
Confusion matrix
Actually Canceled | Actually Uncanceled | |
Predicted Canceled | TP = 1808 | FP = 598 |
Predicted Uncanceled | FN = 2207 | TN = 15938 |
XGBoost Classifier
Accuracy: 81.5% -> 86.1%
Precision: 32.2% -> 65.2%
Recall: 53.9% -> 62.3%
Confusion matrix
Actually Canceled | Actually Uncanceled | |
Predicted Canceled | TP = 2503 | FP = 1336 |
Predicted Uncanceled | FN = 1512 | TN = 15200 |
Key Insights
● Our sponsor(RTS) has an extra bus on standby to cover any missing cases.
● During busy hours (from 8 am to 3 pm):
○ Excessively running the extra bus is costly when the prediction is not precise ○ It’s better to use the Weighted Random Forest Classifier, which gives the highest precision
● During other times:
○ It’s less costly for running extra buses since fewer clients use the service
○ It’s better to use XGBoost Classifier, which balances recall (covering more
canceled trip) and precision (making fewer errors when predicting the cancellation)