Rochester Transit Service - Senior Design Day

Team

Yihe Chen

Harry Huang

Junting Chen

Kehan Yu

Mentor

Cantay Caliskan

Abstract

Predictive Analytics for Demand Responsive Para- transportation

Vision & Goal

● Create a productive schedule for Demand Responsive Para-transportation by predicting the customers’ cancellation.

● Provide executable Python code and classification model.

● Discover best performance metrics.

● Generate well-organized supporting

Data Overview

Internal data

● We acquired the internal data from our sponsor

● Our original dataset contains 102754 observations, and 21 explanatory variables from May 17th, 2021 to December 5th, 2021.

External Data

● We acquired the From NOAA (National Oceanic and Atmospheric Administration)

● Acquired daily weather information

Data Visualization

Feature engineering

● Created a label for the cancellation (1 for canceled, 0 for performed)

● Transformed ‘date’ variables into informative variables (e.g.. month, day, weekday)

● Encoded categorical variables

● Aggregated passengers by type (with children, need lift)

Modeling

Handle the Class Imbalance

Random Forest with SMOTE

Accuracy: 81.5% -> 84.8%

Precision: 32.2% -> 62.1%

Precision: 53.9 -> 57.7%

Precision is significantly improved by 92.8%, while Recall and Accuracy are slightly improved

Weighted Random Forest Classifier

Compared to the Random Forest Classifier, Weighted Random Forest Classifier penalizes the misclassification of minority class more

Confusion matrix

	Actually Canceled	Actually Uncanceled
Predicted Canceled	TP = 1808	FP = 598
Predicted Uncanceled	FN = 2207	TN = 15938

XGBoost Classifier

Accuracy: 81.5% -> 86.1%

Precision: 32.2% -> 65.2%

Recall: 53.9% -> 62.3%

Confusion matrix

	Actually Canceled	Actually Uncanceled
Predicted Canceled	TP = 2503	FP = 1336
Predicted Uncanceled	FN = 1512	TN = 15200

Key Insights

● Our sponsor(RTS) has an extra bus on standby to cover any missing cases.

● During busy hours (from 8 am to 3 pm):

○ Excessively running the extra bus is costly when the prediction is not precise ○ It’s better to use the Weighted Random Forest Classifier, which gives the highest precision

● During other times:

○ It’s less costly for running extra buses since fewer clients use the service
○ It’s better to use XGBoost Classifier, which balances recall (covering more

canceled trip) and precision (making fewer errors when predicting the cancellation)