Assistance regarding model choice - machine-learning

Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.

This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.

Related

Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)?

The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.

How can less amount of data lead to overfitting?

I am studying Machine learning course by Andrew Ng and in it he says that more number of features and less amount of data can lead to overfitting. Can someone elaborate on this.
In general, the less data you have the better your model can memorize the exceptions in your training set which leads to high accuracy on training but low accuracy on test set since your model generalizes what it has learned from the small training set.
For example, consider a Bayesian classifier. We want to predict the math grades of students based on
their grade on science
their last years math grade
their height
As we know the last feature is probably irrelevant. provided we have enough data, our model will learn that this data is irrelevant since there will be a people with different heights getting different grades if we the dataset is big enough.
now consider a very small dataset (e.g. only one class). in this case it's very unlikely that students grades are uncorrelated with their heights (e.g. the tall students will be better or less than average). so our model will be able to make use of that feature. the problem is our model has learned a correlation between grade and height that does not exist outside training dataset.
It could also go the other way, our model might learn that everyone who got a good grade last semester will get a good grade this semester (since that might hold in small datasets) and not use other features at all.
A more general reason, as I mentioned earlier, is that the model can memorize the dataset. There are always outlayer samples, which can't be classified easily. When data size is small the model can find a way to detect these outlayers since there are only few of them. However the it will not be able to predict the real outliers in the test set.

Incorporating prior knowledge to machine learning models

Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.

Machine Learning: How to detect the independent variables that are generating a dependent boolean value

I'm Trying to use machine learning in my job, but I can't find a way to adapt it to what I need. And I don't know if it is already a known problem or if I'm working with something that doesn't have a known solution yet.
Let's say that I have a lot of independent variables, encoded as onehot, and a dependent variable with only two status: True (The result had an error) and False (The result was successful)
My independent variables are the parameters I use for a query in an API, and the result is the one that returned the API.
My objective is to detect a pattern where I can see in a dataset in a certain timeframe of a few hours, the failing parameters, so I can avoid to query the API if I'm certain that it could fail.
(I'm working with millions of queries per day, and this mechanism is critical for a good user experience)
I'll try to make an example so you can understand what I need.
Suppose that I have a delivery company, I count with 3 trucks, and 3 different routes I could take.
So, my dummy variables would be T1,T2,T3,R1,R2 and R3 (I could delete T3 and R3 since there are considered by the omission of the other 2)
Then, I have a big dataset of the times that the delivery was delayed. So: Delayed=1 or Delayed=0
With this, I would have a set like this:
T1_|_T2_|_T3_|_R1_|_R2_|_R3||Delayed
------------------------------------
_1_|_0__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_0_|_1__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_1_|_0__|_0__|_1__|_0__|_0_||____0__
Not only I want to say "in most cases, truck 1 arrives late, it could have a problem, I shouldn't send it more", that is a valid result too, but I also want to detect things like: "in most cases, truck 1 arrives late when it goes in the route 1, probably this type of truck has a problem on this specific route"
This dataset is an example, the real one is huge, with thousand of dependent variables, so it could probably have more than one problem in the same dataset.
example: truck 1 has problems in route 1, and truck 3 has problems in route 1.
example2: truck 1 has problems in route 1, and truck 3 has problems in any route.
So, I would make a blacklist like:
example: Block if (truck=1 AND route=1) OR (truck=3 AND route=1)
example2: Block if (truck=1 AND route=1) OR truck=3
I'm actually doing this without machine learning, with an ugly code that makes a massive cartesian product of the independent columns, and counts the quantity of "delayed". Then I choose the worst delayed/total proportion, I blacklist it, and I iterate again with new values.
This errors are commonly temporary, so I would send a new dataset every few hours, I don't need a lifetime span analysis, except that the algorithm considers these temporary issues.
Anyone has a clue of what can I use, or where can I investigate about it?
Don't hesitate to ask for more info if you need it.
Thanks in advance!
Regards
You should check out the scikit-learn package for machine learning classifiers (Random Forest is an industry standard). For this problem, you could feed a portion of the data (training set, say 80% of the data) to the model and it would learn how to predict the outcome variable (delayed/not delayed).
You can then test the accuracy of your model by 'testing' on the remaining 20% of your data (the test set), to see if your model is any good at predicting the correct outcome. This will give you a % accuracy. Higher is better generally, unless you have severely imbalanced classes, in which case your classifier will just always predict the more common class for easy high accuracy.
Finally, if the accuracy is satisfactory, you can find out which predictor variables your model considered most important to achieve that level of prediction, i.e. Variable Importance. I think this is what you're after. So running this every few hours would tell you exactly which features (columns) in your set are best at predicting if a truck is late.
Obviously, this is all easier said than done and often you will have to perform significant cleaning of your data, sometimes normalisation (not in the case of random forests though), sometimes weighting your classifications, sometimes engineering new features... there is a reason this is a dedicated profession.
Essentially what you're asking is "how do I do Data Science?". Hopefully this will get you started, the rest (i.e. learning) is on you.

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Resources