I have dataset of customers from 2019-2022 . My goal is to predict customer Churn at a specific point in time , say exactly 3 months from the observation point using Logistic Regression.
So if I look at my customer base at Jan-2022(Say Month0) , i can tag churners as Customers who churned exactly at month3(April) and non churners as Customers who stayed Active at Month3(April).
The issue that I was thinking of was there could be a group of customers that churned at Month-1 or Month2 .
I wouldn't be able to include them in the training dataset because technically they did not churn at Month-3 but before(Feb or March) . Is excluding these customers the right approach to model this problem?
There are enough articles on modelling churn within a specific window(say within 3 months) using logistic Regression , but since I would be modelling churn at a specific point in time(Exactly at 3 months) , any guidance on the query is helpful. Thanks
Related
I want to build a model to help me build a team in fantasy premier league. There are two parts to the problem:
1) Predicting the player performances next week given the data for the last week and for the least season.
2) Using the result of the predictive model to build a team within a price of 100million euros.
For part 2), I was thinking of using either a 6D knapsack algorithm (2D for weight and number of items and the other 4 dimensions to make sure the appropriate number of players are picked from each category) or to use min cost max flow (not sure how I can add categories or restrict the number of players from each category).
For part 1) the only examples and papers I have come across either use models to predict whether or not a team will win or just classify the players as "good" or "bad". The second part of my problem requires that I predict a specific value for each player. At the moment I am thinking of using regression but I am not sure what kind of features I should use in this.
Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.
Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.
I have a market transactions dataset including time stamps and goods as follow.
John always buy milk and bread in Super Market. Besides that, he also buys some goods like the following:
On Monday, John bought milk, bread {beer, chocolate}.
On Tuesday, John bought milk, bread {potato}.
On Wednesday, John bought milk, bread {chocolate, avocado, peanuts}.
Can we answer the question: "What will he buy on Thursdays?".
For example: He will buy {beer, avocado} besides milk and bread on Thursdays.
I think it is a kind of multiple regression. Which model can I use to predict a set of goods in this case?
If I correctly understand your question than it is a Multi-Label classification.
You have some input features (dayofweek, HasBoughtMilk HasBoughtBread etc). And you want to predict several other labels (Beer, Avocado) based on them. You could do this with sklearn easily, it supports multilabel classification.
If you want to consider what was bought on previous days (since it could affect your labels) you could do this in 2 ways:
1) Add synthetic features like binaries which show 'HasBoughtBread this week already'
2) Or use RNNs which are good at handling time series.
The problem you are exposing seems to be a textbook case for Random Forests. The inference rules you are trying to express fit really well in decision trees. Random Forests would provide you with a flexible model and fast to train.
Of course this not the only way, you could use SVMs or some deep learning like RNNs, but it feel like using a bazooka to swat a fly for me.
Cheers,
Quentin
This depends on the actual factors you're trying to model. Are some items dependent on one another? Is there an actual time element in the data, or are we just conditioned to infer it?
Assuming that you have a time element, you will definitely want some order of time-series analysis, a sequencing of purchases, perhaps with actual time lags. For instance, if John doesn't go to the store one day, what happens to his purchases? Do we need to learn how often some things get bought? Does one product purchase hasten or delay another?
These considerations suggest either pre-processing the data (for time lags) or a RNN, LSTM, or Q-net delay of some sort. Naive Bayes or Random Forest might be of some help, but you'd still need to pre-process the time relations first.
I need to classify incoming car rentals, but my historic data that I could use for training is in "grouped" form and I can't see how I could train a classification model.
My incoming data is a list of car model, quantity and unit price:
Chevrolet Spark, 1, 196.91
Fiat 500, 1, 196.91
Toyota Prius Hybrid, 3, 213.73
This incoming data is currently manually classified and saved grouped by class and total price per group (Chevy and Fiat is Economy, Prius is Hybrid):
Economy, 393.82
Hybrid, 641.19
This problem should be solvable by machine learning but I can't figure out how to build a training set for a supervised classifier. Any guidance appreciated.
Thanks
A naive bayes classifier should do what you are trying to do... You can use the price as the feats to use and learn from what is already tagged.
However i don't get how you can have consistent data using the TOTAL price to classify since you don't always have as many objects from one group to another... You would have to use the unit price.
There are lots of algorithms that provide multiclass classification, but could you explain more about what you're trying to predict? From what you've written, it sounds more like a scenario for an ETL process than a machine learning model.
If I understand your example correctly, an incoming record with a car model of "Chevy Spark" or "Fiat 500" would always be labeled "Economy", while an incoming record with a car model of "Toyota Prius Hybrid" would be labeled "Hybrid". A simple lookup table would do the job here - no need for fancy machine learning mathematics. :)