I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?
Related
I am a beginner in data science and need help with a topic.
I have a data set about the customers of an institution. My goal is to first find out which customers will pay to this institution and then find out how much money the paying customers will pay.
In this context, I think that I can first find out which customers will pay by "classification" and then how much will pay by applying "regression".
So, first I want to apply "classification" and then apply "regression" to this output. How can I do that?
Sure, you can definitely apply a classification method followed by regression analysis. This is actually a common pattern during exploratory data analysis.
For your use case, based on the basic info you are sharing, I would intuitively go for 1) logistic regression and 2) multiple linear regression.
Logistic regression is actually a classification tool, even though the name suggests otherwise. In a binary logistic regression model, the dependent variable has two levels (categorical), which is what you need to predict if your customers will pay vs. will not pay (binary decision)
The multiple linear regression, applied to the same independent variables from your available dataset, will then provide you with a linear model to predict how much your customers will pay (ie. the output of the inference will be a continuous variable - the actual expected dollar value).
That would be the approach I would recommend to implement, since you are new to this field. Now, there are obviously many different other ways to define these models, based on available data, nature of the data, customer requirements and so on, but the logistic + multiple regression approach should be a sure bet to get you going.
Another approach would be to make it a pure regression only. Without working on a cascade of models. Which will be more simple to handle
For example, you could associate to the people that are not willing to pay the value 0 to the spended amount, and fit the model on these instances.
For the business, you could then apply a threshold in which if the predicted amount is under a more or less fixed threshold, you classify the user as "non willing to pay"
Of course you can do it by vertically stacking models. Assuming that you are using binary classification, after prediction you will have a dataframe with target values 0 and 1. You are going to filter where target==1 and create a new dataframe. Then run the regression.
Also, rather than classification, you can use clustering if you don't have labels since the cost is lower.
We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!
Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.
Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.
I have become familiar with many various approaches to machine learning, but I am having trouble identifying which approach might be most appropriate for my given fun problem. (IE: is this a supervised learning problem and if so, what is my input x and output y?)
A magic the gathering draft consists of many actors sitting around a table holding a pack of 15 cards. the players pick a card and pass the remainder of the pack to the player next to them. They open a new pack and do this again for 3 total rounds (45 decisions). People end up with a deck which they use to compete.
I am having trouble understanding how to structure the data I have into trials which are needed for learning. I want a solution that 1) builds knowledge about which cards are picked relative to the previous cards that are picked 2) can then be used to make a decision about which card to pick from a given new pack.
I've got a dataset of human choices I'd like to learn from. It also includes info on cards they ended up picking but discarding ultimately. What might be an elegant way to structure this data for learning, aka, what are my features?
These kind of problems are usually tackled by reinforcment learning, planning and markov decision processes. Thus this is not oa typical scheme of supervised/unsupervised learning. This is rather about learning to play something - to interact with the environment (rules of the game, chances etc.). Take a look at methods like:
Q-learning
SARSA
UCT
In particular, great book by Sutton and Barto "Reinforcement Learning: An Introduction" can be a good starting point.
Yes, you can train a model do handle this -- eventually -- with either supervised or unsupervised learning. The problem is the quantity of factors and local stable points. Unfortunately, the input at this point is the state of the game: cards picked by all players (especially the current state of the AI's deck) and those available from the deck in hand.
Your output result should be, as you say, the card chosen ... out of those available. I feel that you have a combinatorial explosion that will require either massive amounts of data, or simplification of the card features to allow the algorithm to extract a meaning deeper than "Choose card X out of this set of 8."
In practice, you may want the model to score the available choices, rather than simply picking a particular card. Return rankings, or fitness metrics, or probabilities of picking each particular card.
You can supply some supervision in choice of input organization. For instance, you can provide each card as a set of characteristics, rather than simply a card ID -- give the algorithm a chance to "understand" building a consistent deck type.
You might also wish to put in some work to abstract (i.e. simplify) the current game state, such as writing evaluation routines to summarize the other decks being built. For instance, if there are 6 players in the group, and your RHO and his opposite are both building burn decks, you don't want to do the same -- RHO will take the best burn cards in 5 of 6 decks passed around, leaving you with 2nd (or 3rd) choice.
As for the algorithm ...
A neural network will explode with this many input variables. You'll want something simpler that matches your input data. If you go with abstracted properties, you might consider some decision-tree algorithm (Naive Bayes, Random Forest, etc.). You might also go for a collaborative filtering model, given the similarities in situations.
I hope this helps launch you toward designing your features. Do note that you're attacking a complex problem: one of the features that can make a game popular for humans is that it's hard to automate the decision-making.
Every single "pick" is a decision, with the input information as A:(what you already have), and B:(what the available choices are).
Thus, a machine that decides "whether you should pick this card", can be a simple binary classifier given the input of A+B+(the card in question).
For example, the pack 1 pick 2 of someone basically provides 1 "yes" (the card picked) and 13 "no" (the cards not picked), total 14 rows of training data.
We may want to weight these training data according to which pick it is. (When there are less cards left, the choice might be less important than when there are more choices.)
We may also want to weight these training data according to the rarity of cards.
Finally, the main challenge is that the input data (the features), A+B+card, is inappropriate unless we do a smart transformation. (Simply treating the card as a categorical variable and 1-hot coding them leads to something that is too big and very low information density. That will definitely fail.)
This challenge can be resolved by making it a 2-stage process. First we vectorize the cards, then build features based on the vectors.
http://www.cs.toronto.edu/~mvolkovs/recsys2018_challenge.pdf