I am trying to apply machine learning methods to predict/ analyze user's behavior. The data which I have is in the following format:
data type
I am new to the machine learning, so I am trying to understand what I am doing makes sense or not. Now in the activity column, either I have two possibilities which I am representing as 0 or 1. Now in time column, I have time in a cyclic manner mapped to the range (0-24). Now at a certain time (onehot encoded) user performs an activity. If I use activity column as a target column in machine learning, and try to predict if at a certain time user will perform one activity or another, does it make sense or not?
The reason I am trying to predict activity is that if my model provides me some result about activity prediction and in real time a user does something else (which he has not been doing over the last week or so), I want to consider it as a deviation from normal behavior.
Am I doing right or wrong? any suggestion will be appreciated. Thanks.
I think your idea is valid, but machine learning models are not 100 % accurate all the time. That is why "Accuracy" is defined for a model.
If you want to create high-performance predictive models then go for deep learning models because its performance improves over time with the increase in the size of training data sets.
I think this is a great use case for a Classification problem. Since you have only few columns (features) in your dataset, i would say start with a simple Boosted Decision Tree Classification algorithm.
Your thinking is correct, that's basically how fraud detection AI works in some cases, one option to pursue is to use the decision tree model, this may help to scale dynamically.
I was working on the same project but in a different direction, have a look maybe it can help :) https://github.com/dmi3coder/behaiv-java.
Related
I'm now in the middle of the semester and trying to understand the background of the algorithms and features.
I would like to understand some theory.
If I have a dataset with N samples.
each sample has 5 features for example.
I have done 3 kinds of classifications algorithms for example : SVM, decision tree and kMeans.
In all 3, I got nice results
In a mystery way, a new feature added to the dataset. The value of the features for every sample selected randomly.
I restarted the algorithms on the dataset ( with the new feature)
Are the classification results gonna change from the first results without the new feature? If yes, why are they gonna change and by how much ?
In addition, if I do not have the dataset how can I know how to recognize that new feature?
The results of your classification algorithm are going to either change or stay the same depending on how much information the model gains from the feature. If the feature for instance is random noise then it will have little to no effect on your model, other than slowing it down. If it contains useful information it might be able to increase parameters such as recall and precision. Hope this might help.
I'm Trying to use machine learning in my job, but I can't find a way to adapt it to what I need. And I don't know if it is already a known problem or if I'm working with something that doesn't have a known solution yet.
Let's say that I have a lot of independent variables, encoded as onehot, and a dependent variable with only two status: True (The result had an error) and False (The result was successful)
My independent variables are the parameters I use for a query in an API, and the result is the one that returned the API.
My objective is to detect a pattern where I can see in a dataset in a certain timeframe of a few hours, the failing parameters, so I can avoid to query the API if I'm certain that it could fail.
(I'm working with millions of queries per day, and this mechanism is critical for a good user experience)
I'll try to make an example so you can understand what I need.
Suppose that I have a delivery company, I count with 3 trucks, and 3 different routes I could take.
So, my dummy variables would be T1,T2,T3,R1,R2 and R3 (I could delete T3 and R3 since there are considered by the omission of the other 2)
Then, I have a big dataset of the times that the delivery was delayed. So: Delayed=1 or Delayed=0
With this, I would have a set like this:
T1_|_T2_|_T3_|_R1_|_R2_|_R3||Delayed
------------------------------------
_1_|_0__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_0_|_1__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_1_|_0__|_0__|_1__|_0__|_0_||____0__
Not only I want to say "in most cases, truck 1 arrives late, it could have a problem, I shouldn't send it more", that is a valid result too, but I also want to detect things like: "in most cases, truck 1 arrives late when it goes in the route 1, probably this type of truck has a problem on this specific route"
This dataset is an example, the real one is huge, with thousand of dependent variables, so it could probably have more than one problem in the same dataset.
example: truck 1 has problems in route 1, and truck 3 has problems in route 1.
example2: truck 1 has problems in route 1, and truck 3 has problems in any route.
So, I would make a blacklist like:
example: Block if (truck=1 AND route=1) OR (truck=3 AND route=1)
example2: Block if (truck=1 AND route=1) OR truck=3
I'm actually doing this without machine learning, with an ugly code that makes a massive cartesian product of the independent columns, and counts the quantity of "delayed". Then I choose the worst delayed/total proportion, I blacklist it, and I iterate again with new values.
This errors are commonly temporary, so I would send a new dataset every few hours, I don't need a lifetime span analysis, except that the algorithm considers these temporary issues.
Anyone has a clue of what can I use, or where can I investigate about it?
Don't hesitate to ask for more info if you need it.
Thanks in advance!
Regards
You should check out the scikit-learn package for machine learning classifiers (Random Forest is an industry standard). For this problem, you could feed a portion of the data (training set, say 80% of the data) to the model and it would learn how to predict the outcome variable (delayed/not delayed).
You can then test the accuracy of your model by 'testing' on the remaining 20% of your data (the test set), to see if your model is any good at predicting the correct outcome. This will give you a % accuracy. Higher is better generally, unless you have severely imbalanced classes, in which case your classifier will just always predict the more common class for easy high accuracy.
Finally, if the accuracy is satisfactory, you can find out which predictor variables your model considered most important to achieve that level of prediction, i.e. Variable Importance. I think this is what you're after. So running this every few hours would tell you exactly which features (columns) in your set are best at predicting if a truck is late.
Obviously, this is all easier said than done and often you will have to perform significant cleaning of your data, sometimes normalisation (not in the case of random forests though), sometimes weighting your classifications, sometimes engineering new features... there is a reason this is a dedicated profession.
Essentially what you're asking is "how do I do Data Science?". Hopefully this will get you started, the rest (i.e. learning) is on you.
Im creating my first predictive model and its results are absolutely awful.
Im in need of some help identifying how i troubleshoot this.
Im doing linear regression & logistic regression classification, to predict if a student will pass a course, 1 for yes, 0 for no.
The dataset is tiny, as we only have complete data for one class, 16 features just under 60 rows, 35 passed and 25 failed.
I'm wondering if my dataset is simply too small.
I dont want to share the dataset just yet, but will clean it up so its completely anonymous.
The ROC is very very jagged and mostly (for log regression), and predicts more false positives than anything else.
Id appreciate some general troubleshooting advice for a beginner that i can try before we hire in a professional.
Thanks for any help provided.
Id suggest some tips:
In Azure ML theres a module called "filter based feature selection", you can use it to score your features and check if there is really predictive power in them or even select just the ones with the highest score.
If you haven't ,splitt in train/cross validation set and evaluate your model in both and use it as a diagnosis to identify underfitting(high bias) or overfitting(high variance), and depending on the diagnosis perform actions like:
For overfitting: get more data, use less features, use a less complex model , add or increase regularization
For underfitting: add more features, use a more complex model, decrease regularization.
And don't forget ,before start training to explore and evaluate your data, use scatter plots to see if indeed its separable, perform feature engineering and preprocessing for this ask yourself: given this features, would a human expert be able to perform predictions?, if your answer is not, transform or drop features so that the answer is positive
Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.
I'm working on a project where I need to predict future stats based on past stats of basketball players. I would like to be able to predict next season's statistics based on the statistics of the past three seasons (if there are three previous seasons to choose from). Does anyone have a suggestion for a good prediction algorithm I could use? The data is continuous and there can be anywhere between 5-14 dimensions (age, minutes, points, etc.)
Thanks!
Note: I'd really like to use the program Weka to do this.
Out of the box, random forest would likely give you a strong baseline, so I would start with this.
You can also try try linear regression, which is a simple yet relative effective method, but depending on the data might require a bit more tweaking (for example transforming some of the input and/or out variables).
Gradient boosting regression is another strong predictor, but typically also needs more tweaking to work well.
All of these algorithms have Weka implementations.
There obviously isn't one correct answer, but for anyone looking to do something similar, I'll better describe my problem and the solution that I've found. I created a csv file where each row is a different season, and each column contains a different attribute. For each attribute that I would like to predict, I have the stats for the current season and then another column for the stats for the previous season. The first (rookie) season will have 0 for all 'previous season' columns. With this data set, I loaded it into Weka and used a Multilayer Perceptron with the test-option set to Cross-Validation. I set the number of folds to somewhere between 80-90% of the number of seasons available.
Finally, to predict the next season's statistics, you add one more row to the end and input the last-season values with "?" in the columns that you would like to predict. If anyone would like a deeper example, I'd be glad to provide one.
I think also if you truly want to create an accurate prediction you have to look at player movement and if a player moves to a team with a losing record, do they increase their minutes to have a larger role which would inflate stats or move to a winning team for a lesser role where they could see a decrease in stats.