Hidden Technical Debts in Churn Prediction Systems

This poster was accepted at the Montreal AI Symposium 2018. It presents our framework for addressing critical gaps in production churn prediction systems.

Co-Authors: Nikhil Saldanha, Shivam Shakti, Raman Shrivastava

Abstract

Traditionally, businesses have defined churn simply as the loss of customers of a product or service that they provide. This definition is vague since it fails to specify who a customer is and the event that defines churn. The use of a lagging indicator of churn, such as cancellation of a subscription, renders the prediction inactionable since it is too late to prevent the churn. Further, if the business provides a non-contractual service, identification of an event that intersects well with business and users intent becomes tricky.

We have improved on this traditional definition by creating a framework for defining churn across various businesses. We start by defining the active users, these are the users who have performed a key revenue-generating event for the business in the past. The next step is to select the critical event for churn, we formulate churn as the absence of this event over a period of time. However, this definition also has its drawbacks since the constant time interval across users may not be very accurate. In order to solve this, we can formulate the problem as a ranking problem where users are ranked by the time taken to perform the critical event. Users who take longer to perform the critical event are likely to be "more churned" than others who take lesser time.

We propose to model churn prediction as a supervised machine learning problem where we predict the churn risk (probability that a user is likely to churn) given a time series data of the user's behavior and the model parameters. Features relevant to user behaviour from the platform are fed into a RNN. In order to feed the data in a RNN, the features are aggregated over smaller time windows based on frequency of user activity. Absence of the critical event in a period following the prediction is labeled as churn and presence is labeled as not churn.

The churn prediction itself is not very useful for a business since ultimately, the company's goal is to retain these users by taking a set of actions. To effectively take actions, the right context must come from the predictions themselves. In addition to this, the cost of retention must be considered due to a limited budget. All this means that churn predictions need to be trustworthy and correlate well with the features that are used to make the prediction in order for them to be actionable. In the past, LIME has been proposed as a way to attribute predictions to specific features in simple Feedforward and CNNs and build trust in the predictions. We propose variations to LIME and DeepLIFT to better suit time series tabular data for RNNs.

Conventionally, models are evaluated using metrics like accuracy, ROC AUC, F1 Score for predictions on a hold out set. While these metrics give a good overview of model performance, in practice, they tend to be misleading due to business constraints. Since churn predictions are a means to take preventive actions, their success depends on the success of the actions. The model may predict very accurately for users who are the hardest to retain but is ultimately limited by the effectiveness of the actions. Very often the model performs well on low-value users but fails to perform well on high-value users, resulting in a net loss of revenue. Models with good performance on average may have hidden failure modes that are especially insidious when used in production which may introduce long-term biases in the model.

Slicing metrics allows us to analyze the performance of a model on a more granular level. Usually, metrics are sliced by a particular feature value, which highlights the performance of the model on data having only that subset of feature values. We propose that along with slicing metrics by feature value, slicing metrics by customer segment becomes crucial to the evaluation of churn prediction models due to the variation in data distribution that can be observed across these segments. This manner of slicing allows us to develop specific features for specific customer segments and reduce negative bias towards outlying or minority segments which are usually the highest value users of any product.

Key Contributions

1. Classification → Ranking

Traditional churn models output binary predictions or probability scores evaluated against accuracy metrics. We argue this misses the point—what businesses actually need is a ranking of users most likely to churn who are also worth retaining. A user with 80% churn probability isn't actionable information until you know how they compare to other users and what intervention resources are available.

2. LSTM on Time-Series Events

Rather than using static feature snapshots, we model user behavior as sequences of interactions over time. This captures patterns like declining engagement, irregular usage, or behavioral shifts that static features miss. The temporal structure of user behavior contains signals that aggregate statistics destroy.

3. LIME for Multivariate Time-Series

Black-box models are deployment liabilities. We adapted LIME (Local Interpretable Model-agnostic Explanations) for multivariate time-series inputs, allowing stakeholders to understand why a specific user was flagged as high-risk. This enables customer success teams to take targeted, meaningful actions rather than generic retention campaigns.

4. Segment-Aware Evaluation

Average metrics across all users can hide critical failures. A model performing well "on average" might completely fail on high-value enterprise customers while excelling at predicting churn for low-value segments that wouldn't receive intervention anyway. We advocate for evaluating model performance across meaningful business segments.

Why This Matters

Churn prediction systems often accumulate technical debt that isn't visible in standard evaluation metrics. A model can achieve high accuracy while being practically useless—or worse, systematically biased against the customer segments that matter most to the business.

The core insight is that churn prediction is fundamentally a resource allocation problem, not a classification problem. You're deciding where to spend limited retention budget and customer success time. This reframing changes everything about how you build, evaluate, and deploy these systems.

Model explainability isn't just a nice-to-have—it's what makes the difference between a model that generates dashboards and one that drives action. When a customer success manager can see that a user's churn risk spiked because they stopped opening audiobooks after consistently using them for six months, they can craft a meaningful intervention.

These ideas have become more mainstream since 2018, with model explainability and fairness now considered essential components of production ML systems. The lessons about segment-aware evaluation apply broadly to any ML system where aggregate metrics can mask disparate performance across subgroups.