Model Selection for SaaS Churn Prediction Using Machine Learning

This is a post in a series about churn and customer satisfaction. If you want churn prediction and management without more work, checkout Keepify. If you want more details, email away.

Recently I have been developing machine-learning systems that will predict SaaS churn. Churn prediction has many desirable business benefits and applications, but here I will focus on the technical details of selecting a durable model for predicting churn and some of the lessons I’ve learned along the way.

Beginnings

Most learning problems should be attacked initially with a linear model. I tried two versions of a linear approach in the early days. The first was an attempt to predict the number of months a user would stay using linear regression. This was a terrible failure. It was essentially 90% wrong. The root mean square error was absurdly high. I think this was the wrong approach with the wrong data, but it was a fun initial experiment to get some momentum. The last was an attempt to classify users as churners or non-churners using logistic regression. I’ll address that one more in the next few sections.

Literature Review

After my initial failure, I decided to fire up Google Scholar like my old days in graduate school and try to find some meaningful research on a similar subject. It turns out that a lot of subscription-based services like cable, Internet, and periodical publications fund both academic and industry research in churn prediction. There isn’t any apparent research on SaaS specifically, but the foundations of predicting churn for a newspaper subscription should be similar. In fact, I thought that SaaS should have far superior data to use in prediction.

The research says that the most successful models are Logistic Regression and Random Forests. Many people have shown the efficacy of Support Vector Machines to fall in between these popular options. Neural networks are another popular option with varied, but solid results [1]. My later experiments tried to use some of this insight and focus on models that had the most promise.

Experiments

I decided to use Weka to try a lot of different experiments quickly on the same data set. I was careful about separating my data into strict train and test segments, but I was happy use various datasets to experiment with different learning hypotheses. Weka performed beautifully for me and came with an additional benefit, the JVM. I was processing some of the data transformation in Ruby and I wanted to integrate this system into a Rails application. JRuby made working with Weka and Rails incredibly easy.

It was easy to transform my existing data to ARFF file formats for Weka and I managed to test out nearly all of the relevant classifiers that Weka supports. I have not used SVM or Neural Networks for reasons I explain in the next section. Bayesian Nets and AdaBoost show promise as classifiers for churn prediction in my experiments, but they don’t show up much in the literature.

Classifier Comparisons and Selection

Random Forests dominate the research landscape as the model of choice and my experiments bear that out. Random Forests win. A lot. The intuition to explain why is two-fold. Random Forests are extremely robust without performing feature selection. They do their own version of feature selection that works well for this problem. Random Forests are based on decision trees that classify data pretty well across a small number of known classes. They’re especially effective when certain feature values correlate highly with certain classes. Decision trees (and Logistic Regression) share a final benefit. They show how the classifier works internally in an understandable way. If your customers churn when they use feature X only once per month then you can see that in how the decision tree is structured. This is powerful insight.

Logistic Regression works really well, if not quite as well as Random Forests. It not only presents a model that explains how it works, but it does so with more emphasis on how sure it is whether a customer falls into one class or another.

I didn’t use Neural Networks in any experiments in large part because it isn’t something I could do out of the box with my data in Weka and it famously does not lend any insight into how the classification works. Neural Nets are a black box. Ideally, my classification engine for Keepify will be able to provide more insight for customers than classification alone.

Support Vector Machines are a very cool combination of linear classifiers that optimize a hyperplane. They are a sexy choice, but the performance is not quite so nice as Random Forests, they don’t show their work like Neural Nets, and they are really slow. I can generate predictions for thousands of customers with hundreds of features using a Random Forest in less than a few seconds. SVM might take minutes or worse.

In the end, I decided to use Random Forests and Logistic Regression. I do plan to experiment further with AdaBoost, however, as it is effective at eliminating bias from data sets that have classes with low prevalence.

[1] http://cjou.im.tku.edu.tw/bi2009/DM-usage.pdf

How To Predict Customer Churn Using Machine Learning

This is the first post in a series about churn and customer satisfaction. If you want churn prediction and management without more work, checkout Keepify. If you want more details, email away.

Last year, Rob Walling gave a great talk at LessConf that helped me really click with the idea of Customer Lifetime Value. It also connected a few parts of my past with an interesting idea I could tinker with. He spoke at length about Hubspot’s Customer Happiness Index and reducing churn to improve CLV. It turns out that CLV varies inversely with churn. Churn prediction could dramatically move the needle for a lot of online businesses. It sounded like a cool thing to explore, but how do you predict when someone will leave? What factors are in play? What methods work?

The first step is data collection. You need to start collecting digital information about customer purchases, actions, support, and visits. More data probably won’t hurt you, but I’ve got experiments that show sparse data can work well. You likely need to collect 30-90 days of data before performing your first attempts at classification. This depends quite a bit on customer activity and the number of customers you see in that time period. The data can be in many formats, but you need at least a customer id, event id, and timestamp.

There are many choices for basic classification like this. You could use Excel, R, NumPy, Weka, Mahout, or any of a number of options. Most will be well served using a familiar tool. Failing that, try out Weka for its GUI, community, and documentation.

Once you have tools in mind, you’ll need to transform your data for the tool to consume. For Weka, this means ARFF files, which are well documented, but can also be frustrating. The ARFF parser is less than descriptive about parsing problems. For R or Python you will probably use CSV or something akin to JSON formats. The details of the transformation depend on your tool of choice as well as source and destination formats. Transformation also implies some kind of feature selection (which you should experiment with, but is beyond the scope of this post). This will likely require additional computation from the source events.

Good tools and well-organized data make running experiments really easy these days. You should be able to try out all kinds of linear classifiers without any additional effort, but be aware that the most popular models for this task are Logistic Regression and Random Forests.

Finally, you need to take the output of each classification result and use it for predicting customer churn. A test data set or cross-fold validation experiment will give you a clear idea of the efficacy for your model, but since we are interested in predicting future churn you will want to target your marketing efforts on the apparent false positives. Those users are the ones your model thinks will churn based on the data available.