Churn rate prediction based on sequence data

Churn rate prediction based on sequence data - machine-learning

I am trying to build a machine learning model that can predict if a certain user will churn based on its historical static and dynamic data. The data looks like below:
1) timestamp, user1, user info (static), event info(dynamic), 0
2) timestamp, user1, user info (static), event info(dynamic), 0
3) timestamp, user1, user info (static), event info(dynamic), 1
4) timestamp, user2, user info (static), event info(dynamic), 0
5) timestamp, user2, user info (static), event info(dynamic), 0
6) timestamp, user2, user info (static), event info(dynamic), 0
7) timestamp, user2, user info (static), event info(dynamic), 0
8) timestamp, user2, user info (static), event info(dynamic), 0
There are many user_id in the dataset and each could be a varied-length sequence. The features for each user are two parts. One is user info, which can be regarded as constant for each user. The other part is the event info, which includes features that change over time.
Please let me know what should be my approach to handle such a problem either in machine learning/deep learning, it is better to have a detailed step by step tutorial in pytorch.

"user_id" is divided into two parts "user_info" and "event_info" where "user_info" is constant part so we need not to worry about it and "event_info" is dynamic in nature which include features which sometimes contain null/NA values and other time have legit values.
Because we don't know the nature of features, we're taking both cases continuous and categorical.
Cases:
If dynamic features are continuous values-
you can opt for zero, -1 or any other integer (depends upon the present values of feature(s)) whenever you find null/NA value in the features - by this your model(ML or DL) will able to work efficiently with the dynamic features.
Second case when its categorical values-
For this just include one more category in each feature(s) which represents, may be the Null/NA value.
That's how i handled this situation back and it worked pretty well. and for this you don't need any pytorch tutorial, python will do the work.

Related

Choosing Content based recommendation model for prediction

I am trying to build a content based recommendation model but I am stuck on how to proceed with which algorithm to choose from. Basically, my features are user_id, user_age, gender, location, show_id, show_type(eg: series, movies etc.), show_duration, user_watched_duration, genre, rating. And my model has to predict top 5 recommendation shows for certain users.
The proportion of users to shows is hugh. Like there are only around 10k users but the shows that are mapped to each user is approx 150. So each user has an history of 150 shows mapped to them on an average. So total records are 10k x 150 = 15,00,000
Now, I am confused like which algorithm to proceed with this scenario. I read content based method is the ideal for my scenario. But when I checked SVD from surprise library, it is only taking 3 features as input DataSet - "user_id", "item_id", "rating" and fitting the model. But I have to consider fitting other features like user_watched_duration out of the show_duration and give preference if he/she has fully watched the show. And similarly, I want the model to consider recommending based on the Gender and age also. Like for example, for young men, (male < 20 years old) have watched a show and given a higher rating, then my recommendation to user's of similar category of young men should be given.
Can I use to train a normal classifical model like KNN for this? I tried to think of using sparse matrix using csr_matrix with row consisting of user_id and col consisting of show_id. And then transposing using (user_show_matrix.T * user_show_matrix) , so that I can use this to get counts of shows watched for that particular user. But the problem with this approach is that I cannot map other features with this, right?
So please suggest how to proceed. I already did data cleaning, label encoded categories etc. Will I be able to use any classification algorithms for this? Appreciate any references on similar approaches. Thank you!

Device Delete event Handling in Rule chain being able to reduce the total device count at Customer Level

I am using total count of devices as the "server attributes" at customer entity level that is in turn being used for Dashboard widgets like in "Doughnut charts". Hence to get the total count information, I have put a rule chain in place that handles "Device" Addition / Assignment event to increment the "totalDeviceCount" attribute at customer level. But when the device is getting deleted / UnAssigned , I am unable to get access to the Customer entity using "Enrichment" node as the relation is already removed at the trigger of this event. With this I have the challenge of maintaining the right count information for the widgets.
Has anyone come across similar requirement? How to handle this scenario?

Has anyone come across similar requirement? How to handle this scenario?
What you could do is to count your devices periodically, instead of tracking each individual addition/removal.
This you can achieve using the Aggregate Latest Node, where you can indicate a period (say, each minute), the entity or devices you want to count, and to which variable name you want to save it.
This node outputs a POST_TELEMETRY_REQUEST. If you are ok with that then just route that node to Save Timeseries. If you want an attribute, route that node to a Script Transformation Node and change the msgType to POST_ATTRIBUTE_REQUEST.

How to recognise input ranges that result customer dissatisfaction?

I'm looking for a machine learning method to recognize input ranges that result customer dissatisfaction.
For instance, assume that we have a database of customer's age, customer's gender, date and time that customer stops by, person who is in charge of providing service to customer, etc. and finally a number in range of 0 to 10 which stands for customer satisfaction (Extracted from customer's feedback).
Now I'm looking for a method to determine input ranges which results dissatisfaction. For example male customers who are stopping by John, between 10-12pm are mainly dissatisfied.
I believe there already is a kind of clustering or neural network method for this purpose. Could you help me?

This is not a clustering problem. You have training data.
Instead, you may be looking for a decision tree.

There is more than one method to do it (correlation analysis for ex.)
One simple way is to classify your data by the degree of satisfaction (target)
Classes:
0-5 DISSATISFIED
6-10 SATISFIED
Than look for repetition along features in each cluster.
For example:
if you are interested by one feature, ex: the person who stopped clients, than just get the most frequent name within the two classes to get a result like 80% of unsatisfied client was stopped by jhon
if you are interested by more than one feature, ex: the person who stopped the client AND the time of the day, in this case you can consider the couple of features us one and do the same thing as the first case, than you will get something like 30% of unsatisfied client was stopped by jhon between 10 and 11 am

What do you want to know? Which service person to fire, what are the best hours to provide the service, or sth. else? I mean what are your classes?
Provided, you what to evaluate the service person - the classes are the
persons. In SVM (and I think for NN applies the same) I would split all not purely numerical data in boolean attributes.
Age: unchanged, number
Gender: male 1/0, female 1/0
Date: 7 features for days of week, possibly the number of experience days of the service person. for each special date an attribute e.g. national holiday 1/0
Time: split the time-span in reasonable ranges e.g. 15 min. Each range is a feature
Satisfaction: unchanged - number 1-10
With this model you could predict the satisfaction index for each service person for given date, time, gender, age.

I guess, you can try using anomaly detection algorithms. Basically if you consider the satisfaction level as the dependent variable, then you can try to find the samples which are located away from the majority of the samples in the euclidean space. These away samples could signify dissatisfaction.

How do CEP rules engines store time data?

I'm thinking about designing an event processing system.
The rules per se are not the problem.
What bogs my is how to store event data so that I can efficiently answer questions/facts like:
If number of events of type A in the last 10 minutes equals N,
and the average events of type B per minute over the last M hours is Z,
and the current running average of another metric is Y...
then
fire some event (or store a new fact/event).
How do Esper/Drools/MS StreamInsight store their time dependant data so that they can efficiently calculate event stream properties? ¿Do they just store it in SQL databases and continuosly query them?
Do the preprocess the rules so they can know beforehand what "knowledge" they need to store?
Thanks
EDIT: I found what I want is called Event Stream Processing, and the wikipedia example shows what I would like to do:
WHEN Person.Gender EQUALS "man" AND Person.Clothes EQUALS "tuxedo"
FOLLOWED-BY
Person.Clothes EQUALS "gown" AND
(Church_Bell OR Rice_Flying)
WITHIN 2 hours
ACTION Wedding
Still the question remains: how do you implement such a data store? The key is "WITHIN 2 hours" and the ability to process thousands of events per second.

Esper analyzes the rule and only stores derived state (aggregations etc., if any) and if needed by the rule also a subset of events. Esper allows defining contexts like described in the book by Opher Etzion and Peter Niblet. I recommend reading. By specifying a context Esper can minimize the amount of state it retains and can make queries easier to read.

It's not difficult to store events happening within a time window of a certain length. The problem gets more difficult if you have to consider additional constraints: here an analysis of the rules is indicated so that you can maintain sets of events matching the constraints.
Storing events in an (external) database will be too slow.

Classifying customer churn

For an academic project I have to analyse a customer database of an insurance company.
This insurance company would like to identify a couple things, first of all classifying customers who leave the company in order to make them some offers or such..
Then they also would like to know on which customers to make upselling or cross-selling, as well as finding risky customers, in terms of insurance claims.
So I am focusing on the customer cancellations as it seems the most important one.
The attributes provided by the insurance company are:
Bundled/Unbundled, Policy Status, Policy Type, Policy Combination, Issue Date, Effective Date, Maturity Date, Policy Duration, Loan Duration, Cancellation Date, Reason for cancellation, Total Premium, Splitter Premium, Partner ID, Agency ID, Country Agency, Zone ID, Agency potential, Sex Contractor, Birth Year Contractor, Job Contractor, Sex Insured, Job Insured, Birth Year Insured, Year Claim, Claim Status, Claim Provision, Claim Payments
The database is composed of ~200k records and there are many missing values for some attributes.
I started using Rapid Miner to mine the dataset.
I cleaned the dataset a bit, removing incoherent or wrong values.
I then tried applying decision trees, adding a new attribute derived from Policy Status (which can be issued,renewed or cancelled) called isCanceled, and using it as the label of the decision tree.
I tried changing every single parameter of the decision tree, but I either get a tree with only 1 leaf node and no splits, or some tree that is completely irrelevant since it has leaf nodes with almost the same number instances of the 2 classes.
This is getting really frustrating.
I'd like to know what the usual procedures to make churn analysis are, possibly using Rapid Miner..can anybody help me?

In my experience most data mining or machine learning activities spend most of their time cleaning, tidying, formatting and understanding the data.
Assuming this has been done, then as long as there is a relationship between some or all of the attributes and the label to be predicted it will be possible to perform some sort of churn analysis.
There are lots of ways to determine this relationship of course but a quick way is to try one of the Weight By operators. This will output a set of weights for each attribute with those near 1 being potentially more predictive of the label.
If you determine there are attributes of value, you can use Decision Trees or another operator to build a model that can be used to predict. The attributes you have are a mix of nominal and numeric types so Decision Trees will work and anyway this operator is easier to visualize. The tricky part is getting the parameters right and the way to do this is to observe the performance of a model on unseen data as the parameters are varied. The Loop Parameters operator can help with this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart