Last year, Rob Walling gave a great talk at LessConf that helped me really click with the idea of Customer Lifetime Value. It also connected a few parts of my past with an interesting idea I could tinker with. He spoke at length about Hubspot’s Customer Happiness Index and reducing churn to improve CLV. It turns out that CLV varies inversely with churn. Churn prediction could dramatically move the needle for a lot of online businesses. It sounded like a cool thing to explore, but how do you predict when someone will leave? What factors are in play? What methods work?
The first step is data collection. You need to start collecting digital information about customer purchases, actions, support, and visits. More data probably won’t hurt you, but I’ve got experiments that show sparse data can work well. You likely need to collect 30-90 days of data before performing your first attempts at classification. This depends quite a bit on customer activity and the number of customers you see in that time period. The data can be in many formats, but you need at least a customer id, event id, and timestamp.
There are many choices for basic classification like this. You could use Excel, R, NumPy, Weka, Mahout, or any of a number of options. Most will be well served using a familiar tool. Failing that, try out Weka for its GUI, community, and documentation.
Once you have tools in mind, you’ll need to transform your data for the tool to consume. For Weka, this means ARFF files, which are well documented, but can also be frustrating. The ARFF parser is less than descriptive about parsing problems. For R or Python you will probably use CSV or something akin to JSON formats. The details of the transformation depend on your tool of choice as well as source and destination formats. Transformation also implies some kind of feature selection (which you should experiment with, but is beyond the scope of this post). This will likely require additional computation from the source events.
Good tools and well-organized data make running experiments really easy these days. You should be able to try out all kinds of linear classifiers without any additional effort, but be aware that the most popular models for this task are Logistic Regression and Random Forests.
Finally, you need to take the output of each classification result and use it for predicting customer churn. A test data set or cross-fold validation experiment will give you a clear idea of the efficacy for your model, but since we are interested in predicting future churn you will want to target your marketing efforts on the apparent false positives. Those users are the ones your model thinks will churn based on the data available.