Demystifying Data Science: A Case Study on Customer Churn
Article published: March 11, 2021
Customer churn -- the loss of customers using your product or subscribing to your services -- can be a death blow for subscription businesses. No matter how many customers you sign on in a given month, if they can’t be retained, your business will fail.
While a seemingly obvious metric to track, understanding the root causes of customer churn requires in-depth analysis.
Churn analysis involves using data to gain insights into:
- How many or what proportion of your customers are leaving
- Which customers are more likely to leave
- Where in the pipeline customers tend to leave
- Why people stop using your product or service
By answering the first question, you’ve got a useful indicator of the overall health of your company, month to month.
By answering the remaining three questions, you’ll likely reveal specific opportunities to increase customer retention, allocate resources more effectively, and potentially even improve your product (depending upon your sector).
We’ll walk through how a data scientist might approach this analysis using an example dataset.
Our dataset represents customer churn for a cellular service provider over a one month period. Let’s start with the basics. How many customers churned this month? To answer this question, we’ll need to use the ‘Churn’ column of the dataset, which is in the form of a binary variable. It’s numerically encoded as either 1, the customer churned, or 0, the customer was retained. Taking the sum of the entire column reveals that 20,000 customers churned!
Obviously that’s not good news, but we need to put this raw count into context. What proportion of the total customers we had at the beginning of the month does this lost 20,000 represent? In other words, what was our churn rate?
We need to divide the number of customers lost by the total number of customers we had at the beginning of the month. The graph below visualizes our result:
Luckily our dataset also contains numerous variables that give us information about each customer’s behavior and demographics. This will help us understand what are potentially leading indicators of a customer’s propensity to churn. Here is a preview showing several columns of the first few rows:
We’ve got data on customer behavior including things like:
- Mou — mean monthly minutes of use
- Changem — percent change in minutes of use
- Custcare — number of calls between the customer and the customer care team
- Incalls — average number of inbound calls per month
- Outcalls — average number of outbound calls per month
We also have information on customer demographics, as well as how long they’ve been with our company and how old their current equipment is. The descriptions for all the variables we’ll use can be found here.
All this information could potentially be useful for predicting whether a customer will churn. But before we go into that, we ought to explore our data a little more. This will not only give us an idea of which features (a fancy word for variables) might be useful predictors but also inform another extremely important step in the process: data cleaning.
The quality of your model’s predictions is largely determined by the quality of the data we feed into the model. Important steps for preparing a high-quality dataset for modeling include:
- Dealing with missing values
- Investigating and handling outliers
- Deriving new features based on information you already have
We won’t spend a whole lot of time on this here so that we can focus on the modeling. However, keep in mind that a good chunk of the time you spend on a churn analysis will involve an iterative process of exploring and cleaning your data.
Building a Classification Model
Our task here is what’s known as binary classification. Given an input of various information about a customer, we must build a model that classifies the customer as churned (1) or not (0). The input consists of what are known as predictors or predictive features. Essentially, these are the variables (columns) from our dataset.
Once we have a model, we can hone in on the predictive features the model relies on most heavily to make its classifications. Those features have the most potential to provide insight into which customers are churning and why.
So our target variable is Churn, and information about all the predictors included in our model can be found here. We’ll use these to build what’s known as a Random Forest Classifier. If you don’t know exactly what that means, don’t worry too much.
Basically, think of a random forest model as a process flow in which you ask a yes/no question about a customer and the logic branches based on the responses. Has the customer been with the company for more than 6 months? No? Ok, does the customer make on average 20 or more calls per month? Yes? Next branch, next question. All these branches make up a “tree”. At some point you get to the end of the last branch, and there on that leaf you find your answer in the form of 1 or 0 (i.e., churned or retained). Now imagine a whole forest of these “trees”, where each tree essentially gets a vote. Whichever output gets the majority of votes, that is our random forest classifier’s prediction for that customer.
Here’s how we’ll set up our random forest model (Note: If at any point your eyes start to cross while reading the blocks of code, feel free to ignore them. Or you can just read the comments marked with ‘##’ to get an idea of what each step is accomplishing.):
1. Read in our dataset and separate our predictive features (model input) from our target (churn). Next, do a 75/25 train-test split of the full dataset (40,000 customers). This split enables us to evaluate the performance of our model on data it’s never “seen” before after it has “learned” from most of the available data. We’ll use 75% of the data to train our model and then test how well it performs on the remaining 25%.
|## Read in dataset from csv file|
|model_data = pd.read_csv('Data/model_data.csv')|
|## Define target variable|
|target = 'Churn'|
|## Separate features (X) and target (y) for train-test split|
|X = model_data.drop(columns=[target], axis=1).copy()|
|y = model_data[target].copy()|
|## Define random seed to use for train-test-split and|
|## classifiers for reproducibility|
|random_seed = 319|
|## Split the data into training and test sets prior to preprocessing|
|X_train, X_test, y_train, y_test = train_test_split(X, y,|
2. Next, we set up a pipeline (basically just a series of steps) that does some minimal cleaning to convert our data into a form that a machine learning algorithm can work with. Each column is a predictive feature from our dataset. Any categorical features need to be encoded as numbers rather than strings of letters. Some of our numerical columns have missing values, so we’ll just impute (a fancy word for “fill-in”) the median for each of those variables. We’ll also standardize the numerical variables so all values fall within a similar range.
|## Define single categorical column|
|cat_col = ['Csa3_grp']|
|## List of numerical columns to be transformed|
|num_cols = ['Mou', 'Recchrge', 'Roam', 'Changem', 'Custcare',|
|'Dropvce', 'Blckvce', 'Unansvce', 'Threeway', 'Outcalls',|
|'Incalls', 'Peakvce', 'Callwait', 'Months', 'Eqpdays']|
|## List of remaining columns|
|rem_cols = ['Credita', 'Creditaa', 'Creditb', 'Creditc', 'Creditde',|
|'Creditgy', 'Creditz', 'Prizmrur', 'Prizmub', 'Prizmtwn', 'Refurb',|
|'Occprof', 'Occstud', 'Occhmkr', 'Occret', 'Occself',|
|'Refery', 'Marryyes', 'Mailord', 'Mailres', 'Income', 'Incmiss']|
|## Create a pipeline for transforming categorical columns|
|cat_transformer = Pipeline(steps = [('encoder', OneHotEncoder(handle_unknown='error',|
|## Create a pipeline for transforming numerical columns|
|num_transformer = Pipeline(steps = [('impute', SimpleImputer(strategy='median')),|
|preprocessing = ColumnTransformer(transformers=[('cat', cat_transformer, cat_col),|
|('num', num_transformer, num_cols)],|
3. Now we actually fit the preprocessing pipeline to our training data and use it to transform both the training and test set of the predictors. Once transformed, we need to save the feature names in the order we fed the columns into the pipeline.
|## Preprocess training and test data|
|X_train_tf = preprocessing.fit_transform(X_train)|
|X_test_tf = preprocessing.transform(X_test)|
|## Get the feature names in the order they appear in preprocessed data|
|feature_names = preprocessing.named_transformers_['cat'].named_steps['encoder'].get_feature_names(cat_col)|
|feature_names = np.r_[feature_names, num_cols, rem_cols]|
4. Finally, we fit our model to the training data and evaluate how our model performs making classifications for customers in the unseen test data.
|## Instantiate random forest classifier|
|rf = RandomForestClassifier(random_state=random_seed,|
|## Fit random forest classifier to training data|
|## Evaluate how the model performs on the test set|
|eval_clf(rf, X_test_tf, y_test);|
The classification report shows us that, of the 10,000 customers in the test set, our model classified 60% of them correctly. In other words, our overall accuracy is 60%. So on average, our model classifies 6 out of every 10 customers correctly. The recall (the number of true positives divided by the sum of the true positives and false negatives) for both classes is also 60%. This tells us that the rate of 6 correct out of 10 is constant across both classes. Let’s look at this in a slightly different way:
The confusion matrix above shows the proportion of cases from each class which our model classified correctly vs. incorrectly. For example, the bottom right corner shows us that the model correctly classified 60% of the customers who actually did churn, the model correctly classified 60% of the customers as churned while incorrectly classified 40% of churned customers as retained (bottom left corner). Its performance happens to be identical for the customers who were, in fact, retained.
The ROC curve plots the true positive rate (or recall, essentially the “catch rate” for correctly identifying customers who churned) over the false positive rate. Our model’s performance is shown by the blue line, whereas the red line essentially represents how well we could expect to do if we just randomly guessed whether a customer churned or not. Ideally, we want lots of true positives and very few false positives. This would lead the blue line in the plot to hug the top left corner. Obviously, that’s not what we see here, but the fact that our blue line is above the red line means our model is performing better than random guessing!
You may be underwhelmed by the performance of our model because 60% accuracy is not fantastic. Keep in mind that model performance could likely be improved substantially with only a little extra effort. However, as we will see in the next section, even an imperfect model can be very useful in identifying business opportunities.
Let’s take a look at the features our model relied on most heavily to classify whether or not a customer churned.
|## Obtain and sort feature importances from fitted model|
|feature_importances = (rf.feature_importances_)|
|sorted_idx = feature_importances.argsort()|
|importance = pd.Series(feature_importances, index=feature_names)|
|## Plot top 10 most predictive features|
|fig = importance.sort_values().tail(10).plot(kind='barh')|
|fig.set_title('Top 10 Most Predictive Features',|
This plot shows us the top 10 features most predictive of churn in descending order of importance (the longer the bar, the more weight the model gives to the predictor). Referencing our model variable documentation, we see that the number of days a customer has had their current equipment is the most predictive feature. Another variable that involves timing, Months, also makes the top 5. Most of the other variables deal with customer behavior, except for Recchrge (the mean total recurring charge paid by the customer for our company’s service) and Dropvce (the mean number of dropped voice calls).
Exploring the timing of churn, there’s a big spike around 365 Eqpdays and at 12 Months, presumably because this is when most contracts come up for renewal. We only see a very slight spike in customers churning 2 years into their contract, and none at 3 years.
These general trends can be informative and present an opportunity for business users like us to figure out why customers who stick around for at least two years are more likely to continue with our company. But those customers are not the focus of this analysis.
Let’s zoom in on the customers our model predicted were most likely to churn. To do this, we’ll divide our customers into deciles based on churn probability. Creating deciles entails:
- Obtaining the churn probability from our model for each customer in our test data
- Sorting all 10,000 customers in the test data in order of increasing churn probability
- Dividing the ordered set of customers into 10 groups of roughly equal size.
Then we’ll focus on the highest probability decile (ie. the top ten percent of customers most likely to churn) for the rest of our analysis.
|## Get a Pandas Series that contains the predicted likelihood|
|## (probability) that each customer in the test set would churn|
|churn_probs = |
|for obsv in rf.predict_proba(X_test_tf):|
|churn_probs = pd.Series(data=churn_probs)|
|## Reset the index of y_test set|
|y_test_ri = y_test.reset_index(drop=True)|
|## Create a new DataFrame that includes our predictors (X) from the|
|## test set, our target (y) from the test set, and the probability|
|## of churn output by the model for each customer|
|prob_df = X_test.copy().reset_index(drop=True)|
|prob_df['Churn'] = y_test_ri|
|prob_df['churn_prob'] = churn_probs|
|## Split the test set into deciles based on predicted churn probability|
|prob_df['decile_rank'] = pd.qcut(prob_df['churn_prob'],|
Now let’s look at how many customers in the top decile actually churned, keeping in mind that our model classified all these customers as churned.
Not bad! Even though our model only had 60% accuracy overall, when we focus on the top 10% of “high risk” customers we see a substantial improvement. Within the top decile, 72% of customers are classified correctly. From here, we’ll investigate some of our most predictive features by comparing customers in the most-likely-to-churn decile with the rest of the customers.
Investigating those two time variables again, we see that customers who are more likely to churn have had the same cellular equipment for longer. The relationship between churn risk and the number of months with the company is not as straightforward.
The plots below paint a picture of customers who are at a high risk of churning and are disengaged with our product. Customers who are likely to churn this month tend to show a considerable drop in the number of minutes of use compared to previous months. They also use substantially fewer minutes per month on average.
Two more of our top predictive features, Recchrge (cost of service) and Dropvce (quality of service), could potentially drive customer dissatisfaction. However, we see the opposite trend we might expect if these factors were causing customers to leave our company. In general, high risk customers pay less and experience fewer dropped calls than other customers. Granted this may be driven at least partially by the fact they appear to be using fewer minutes.
While our model could certainly use some work, we’ve already obtained some useful insights.
Timing is important, both in terms of how long a customer has had their current equipment and how long they’ve been with the company. Obviously, we can’t alter contracts, so there will inevitably be some churn at 12 months. However, we can:
- Offer targeted deals on new equipment to customers who have had their current phone for a while.
- Provide incentives for people to renew after their first year, but don’t waste time or lose money incentivizing customers at 2 and 3 years. We don’t see dramatic spikes in churn beyond that first year.
- Customers who are more likely to churn are not actively engaged in using our services. We could potentially deploy a push notification system based on our model to automatically offer deals or incentives when a customer crosses a threshold into “high risk territory” (the highest probability decile).
- We need further investigation into why customers are leaving. It doesn’t appear to be high prices or poor quality cell service driving individuals to churn, so we still need to pinpoint the ultimate cause(s).
As demonstrated, we can use an imperfect model to generate real business impact. Our analysis and model identified a segment of customers with the highest probability of churning, leading indicators of customer churn, and areas for further study. The actions suggested above could save the company valuable customers and revenue.
If it seems like your company could benefit from a thorough customer churn analysis, Propheto can help. Whether you’re not sure where to start, or you simply don’t have the time within your department to take on yet another project, Propheto will partner your team with an academic who can fill the gap.
Learn more about how you could benefit from using an expert data scientist to solve churn in this Propheto case study.
Contact us today to set up a scoping call. We’ll work with you to help assess your company’s specific needs and how to address them by leveraging your data.
Max Steele is a data scientist with a background in biological research and education, and experience with machine learning and data visualization. You can learn more about or contact them here. You can view the code used for this analysis in it's github repo here.