The data for this project includes survey responses from 129,880 customers. It includes data points such as class, flight distance, and in-flight entertainment, among others. In a previous activity, you utilized a binomial logistic regression model to help the airline better understand this data. In this activity, your goal will be to utilize a decision tree model to predict whether or not a customer will be satisfied with their flight experience.
1. Predicting whether a future customer would be satisfied with their services given previous customer feedback about their flight experience.
2. Construct and evaluate a model to determine which features are most important to customer satisfaction.
Importing packages and loading data
Exploring the data and completing the cleaning process
Building a model to determine which features are most important to customer satisfaction:
Knowing that Decision trees require no assumptions regarding the distribution of underlying data and don't require scaling of features though it's susceptible to overfitting. In this work, I would like to use Decision Tree Model to determine the features of most important for customer satisfaction.
Tuning hyperparameters using GridSearchCV. Using hyperparameter tuning and grid search to ensure this prevent overfitting in the model.
Evaluating a decision tree model using a confusion matrix and various other plots
In this section, I will be preparing the data to be suitable for decision tree classifiers. This includes:
Exploring the data
Checking for missing values
Encoding the data
Renaming a column
Creating the training and testing data
Checking the data type of each column. Not forgetting that decision trees expect numeric data.
Having in mind that, the sklearn decision tree implementation does not support missing values. Hence checking for missing values in the rows of the data is important.
There are 129880 rows and 22 columns as features in the dataset. Since the 393 missing rows represent a small percentage of the entire dataset, would like to drop these rows with missing values and assign the DataFrame a new name.
From the confusion matrix, there are a high proportion of true positives and true negatives, where the matrix accurately predicted that the customer would be satisfied or dissatified, respectively.
The matrix also had a relatively low number of false positives and false negatives. Indicating that, the matrix innacurately predicted that the customer would be satisfied or dissatified, respectively.
In a decision tree model, knowing that the tree has a potential of growing further without knowing the limit. An intruduction of tuning the model will help increase the model performance by determing the best values for hyperparameters max_depth and min_samples_leaf using grid search and cross validation will help determine how far the tree should grow.
Print out the decision tree model's accuracy, precision, recall, and F1 score. This task can be done in a number of ways.
The F1 score for the decision tree that was not hyperparameter tuned is 0.940940 whiles the F1 score for the hyperparameter-tuned decision tree is 0.945422. Though ensuring that overfitting doesn't occur is necessary for some models, however hyperparameter tunning didn't make a meaningful difference in improving this model
In summary:
The decision tree model accurately predicted satisfaction of over 94 percent of the time.
Precision score shows that the model predicted over 95% positives that are truly positive. This indicates that out of the total number of customers that claimed to be satisfied, over 95% are trualy satisfied.
The Recall score also shows that the actual positives that the model correctly identified is over 93%. This score helps the Airline not waste their resources trying to improve the customer experience of an already satisfied customer.
The confusion matrix is also useful as it shows a similar number of true positives and true negatives indicating a higher probabaility of the model predicting accurately.
The visualization of the decision tree and the feature importance graph both suggest that 'Inflight entertainment', 'Seat comfort', and 'Ease of Online booking' are the most important features in the model.
Recommendation:
Customer satisfaction is highly tied to 'Inflight entertainment', 'Seat comfort', and 'Ease of Online booking'. Improving these experiences should lead to better customer satisfaction.
The success of the model suggests that the airline should invest more effort into model building and model understanding since this model semed to be very good at predicting customer satisfaction.