Feature selection in churn prediction - machine-learning

I have built a churn prediction model for a e-commerce company data. In the model, churn criterion is to be inactive for 12 months from the last available date in the data. While building the model, I created some calculated features to consider the activity in the prediction. I added last 3 and 6 months activities of the customers as binary. Their correlation with the churn is 0.5 and 0.7 respectively. When I checked the other churn prediction models on the web, I saw similar metrics in some projects and some others do not include such a metric.
My models' accuracy is around 90% and I am concerned that if I am doing it wrong by putting the last 3 and/or 6 month activities of customers as an input to the model. Moreover, should I be worried about the correlation between 3m activity and 6m activity? I used PCA for the feature extraction keeping the 0.95 of the variance but is it enough to avoid the correlation problem?

Related

Best algorithm for time series prediction?

I would like to ask you some suggestions about a time series prediction problem. In particular, I have to predict on a daily basis the total water demand in a certain area, creating a model based on 4 CVSs files containing:
water demand in aggregated form (time series with daily granularity, 2 years data)
amount of water entering the area's cistern (time series with daily granularity, 2 years data)
amount of water leaving the area's cistern (time series with daily granularity, 2 years data)
water request from 4,000 measurements points across the area (time series with daily granularity, 2 years data).
In your opinion, what is the best model for having a good prediction of the water demand in the area, using the available data and features? I can only think of LSTMs or MLP, I don't know if something like ARIMA or (SARIMA) could be useful in this case, seeing that I have many features but not many days.
Thank you in advance for you help :)
Forecasting is inevitably a domain-specific problem because you can often make better decisions about model and methods when you know something about the system or process you are trying to forecast.
There are quite a few academic papers on forecasting domestic water demand which you could look at if you have access:
E.g.
Demand Forecasting for Water Distribution Systems by Chen and Boccelli (2014)
Urban Water Demand Forecasting: Review of Methods and Models by Donkor et al (2014)
Predicting water demand: a review of the methods employed and future possibilities by de Souza Groppo et. al (2019)
I'm not an expert in this domain so you should probably wait for someone who is to answer the question but I think using an auto-regressive model (e.g. ARIMA), as you have suggested, is a good start because demand is essentially due to aggregate human activity which is inherently driven by daily / weekly routines, and seasonal effects.
There are various routines to fit such models to data. Jason Brownlee has a nice tutorial here using Python's statsmodels.tsa package.
You could also see what people have used for residential energy consumption forecasting as the problem is probably very similar to water demand forecasting.

Is there a way to predict that a customer will churn in next 30 days

I have trained a classifier that classify customers as churn or not churn and data contains transaction amount, last transaction date and demographics. Also the definition for churn is that there is no transaction initiated by customer for 12 months. The classifier works well, Now coming to Question:
I want to see will a customer churn in next 30 days or not. As per my understanding a ML classifier just puts the cases in respective classes.
please share any recommendation as we can classify in future or not

How to count the number of multiplies in a TensorFlow model?

For many machine learning models it is desirable to compare models that have a similar number of multiplies as to limit the total inference time for the finished product (e.g. when such an algorithm is to be released on a mobile device). Many neural network libraries have functionality that supports easily calculating the total number of multiplies in a model, however I could not a similar feature for TensorFlow.
Hence my question is: how would one go about calculating the total number of multiply operations? Is there possibly a tool that someone has developed for that purpose?

How easy/fast are support vector machines to create/update?

If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.
Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).
You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.

Weighted Naive Bayes Classifier in Apache Mahout

I am using Naive Bayes classifier for my sentiment analysis on customer support. But unfortunately I don't have huge annotated data sets in the customer support domain. But I have a little amount of annotated data in the same domain(around 100 positive and 100 negative). I have the amazon product review data set as well.
Is there anyway can I implement a weighted naive bayes classifier using mahout, so that I can give more weight to the small set of customer support data and small weight to the amazon product review data. A training on the above weighted data set would drastically improve accuracy I guess. Kindly help me with the same.
One really simple approach is oversampling. Ie just repeat the customer support examples in your training data multiple times.
Though it's not the same problem you might get some further ideas by looking into the approaches used for class imbalance; in particular oversampling (as mentioned) and undersampling.

Resources