How to model my machine learning problem? - machine-learning

I'm working on generating alerts in motorcycle competition. In this competition, we have approximately 100 competitors. we have data about rider position, speed, timestamp ..., In this project i need to create a machine learning algorithm that predicts the time needed for a rider to go from a specific point to another in the course based on the rider histories in the course.
do you have any suggestion about how can i modelize my problem, or do you have some research or proposition that may help me?

Depends how much sophisticated model you need.
For a start, assuming that
You have riders position A and want to estimate when he'll arrive to a position B
There is a rich history of this rider driving on this track
I'd simply
Pick a set of historical points around A and around B.
Figure out which points form pairs (i.e. they correspond to the same lap in the same race)
For each pair compute the time difference
return the average

Related

Right Methode for ML Modell

I m making my first steps in AI and ML.
I choose myself a project, I want to fix with ML, but I m unsure which methode to use.
Business Case: A Customer can put offers and set a date he wants to receive his products.
He is able to change the amount of products he buys at every time.
I have to deal with the costs of unbuyed products and missing profit, in case I produced less than he wanted.
I have plenty of data from past transactions contianing the original amount of products ordered and the amount I sent to the costumer.
My goal is to get a predicitve analytics model which is able to tell me after a costumer changed the number of products from an order, how probably this change is final.
I m really new to this topic and are not quite getting all the information for the different methodes. I know classification and regression are the big players and can be implemented in different ways. But is one of those approaches fitting for my problem?
Many Thanks in advance.
You can go with a classification based approach. Since you goal is to predict whether the order change is final or not. The probability of happening that change can be derived from the accuracy/F1 score of your model. Higher the values, higher successful predictions. In laymen's terms think this as classifying whether the order is final or not.
You have to go for a regression approach if you're trying to predict a value based on the order change. For example if you want to predict what is the cost for the next order change, then you have to use regression.
As I understood your use case matches with the first scenario.

What is a good approach to clustering multi-dimensional data?

I created a k-means clustering for clustering data based on 1 multidimentional feature i.e. 24-hour power usage by customer for many customers, but I'd like to figure out a good way to take data which hypothetically comes from matches played within a game for a player and tries to predict the win probability.
It would be something like:
Player A
Match 1
Match 2
.
.
.
Match N
And each match would have stats of differing dimensions for that player such as the player's X/Y coordinates at a given time, time a score was made by the player, and such. Example, the X/Y would have data points based on the match length, while scores could be anywhere between 0 and X, while other values might only have 1 dimension such as difference in skill ranking for the match.
I want to take all of the matches of the player and cluster them based on the features.
My idea to approach this is to cluster each multi-dimensional feature of the matches to summarize them into a cluster, then represent that entire feature for the match with a cluster number.
I would repeat this process for all of the features which are multi-dimensional until the row for each match is a vector of scalar values and then run one last cluster on this summarized view to try to see if wins and losses end up in distinctive clusters, and based on the similarity of the current game being played with the clustered match data, calculate the similarity to other clusters and assign a probability on whether it is likely going to become a win or a loss.
This seems like a decent approach, but there are a few problems that make me want to see if there is a better way
One of the key issues I'm seeing is that building model seems very slow - I'd want to run PCA and calculate the best number of components to use for each feature for each player, and also run a separate calculation to determine the best number of clusters to assign for each feature/player when I am clustering those individual features. I think hypothetically scaling this out over thousands to millions of players with trillions of matches would take an extremely long time to do this computation as well as update the model with new data, features, and/or players.
So my question to all of you ML engineers/data scientists is how is my approach to this problem?
Would you use the same method and just allocate a ton of hardware to build the model quickly, or is there some better/more efficient method which I've missed in order to cluster this type of data?
It is a completely random approach.
Just calling a bunch of functions just because you've used them once and they sound cool never was a good idea.
Instead , you first should formalize your problem. What are you trying to do?
You appear to want to predict wins vs. losses. That is classification not clustering. Secondly, k-means minimizes the sum-of-squares. Does it actually !ake sense to minimize this on your data? I doubt so. Last, you begin to be concerned about scaling something to huge data, which does not even work yet...

User behavior prediction/analysis

I am trying to apply machine learning methods to predict/ analyze user's behavior. The data which I have is in the following format:
data type
I am new to the machine learning, so I am trying to understand what I am doing makes sense or not. Now in the activity column, either I have two possibilities which I am representing as 0 or 1. Now in time column, I have time in a cyclic manner mapped to the range (0-24). Now at a certain time (onehot encoded) user performs an activity. If I use activity column as a target column in machine learning, and try to predict if at a certain time user will perform one activity or another, does it make sense or not?
The reason I am trying to predict activity is that if my model provides me some result about activity prediction and in real time a user does something else (which he has not been doing over the last week or so), I want to consider it as a deviation from normal behavior.
Am I doing right or wrong? any suggestion will be appreciated. Thanks.
I think your idea is valid, but machine learning models are not 100 % accurate all the time. That is why "Accuracy" is defined for a model.
If you want to create high-performance predictive models then go for deep learning models because its performance improves over time with the increase in the size of training data sets.
I think this is a great use case for a Classification problem. Since you have only few columns (features) in your dataset, i would say start with a simple Boosted Decision Tree Classification algorithm.
Your thinking is correct, that's basically how fraud detection AI works in some cases, one option to pursue is to use the decision tree model, this may help to scale dynamically.
I was working on the same project but in a different direction, have a look maybe it can help :) https://github.com/dmi3coder/behaiv-java.

Assistance regarding model choice

Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.

Prediction Algorithm for Basketball Stats

I'm working on a project where I need to predict future stats based on past stats of basketball players. I would like to be able to predict next season's statistics based on the statistics of the past three seasons (if there are three previous seasons to choose from). Does anyone have a suggestion for a good prediction algorithm I could use? The data is continuous and there can be anywhere between 5-14 dimensions (age, minutes, points, etc.)
Thanks!
Note: I'd really like to use the program Weka to do this.
Out of the box, random forest would likely give you a strong baseline, so I would start with this.
You can also try try linear regression, which is a simple yet relative effective method, but depending on the data might require a bit more tweaking (for example transforming some of the input and/or out variables).
Gradient boosting regression is another strong predictor, but typically also needs more tweaking to work well.
All of these algorithms have Weka implementations.
There obviously isn't one correct answer, but for anyone looking to do something similar, I'll better describe my problem and the solution that I've found. I created a csv file where each row is a different season, and each column contains a different attribute. For each attribute that I would like to predict, I have the stats for the current season and then another column for the stats for the previous season. The first (rookie) season will have 0 for all 'previous season' columns. With this data set, I loaded it into Weka and used a Multilayer Perceptron with the test-option set to Cross-Validation. I set the number of folds to somewhere between 80-90% of the number of seasons available.
Finally, to predict the next season's statistics, you add one more row to the end and input the last-season values with "?" in the columns that you would like to predict. If anyone would like a deeper example, I'd be glad to provide one.
I think also if you truly want to create an accurate prediction you have to look at player movement and if a player moves to a team with a losing record, do they increase their minutes to have a larger role which would inflate stats or move to a winning team for a lesser role where they could see a decrease in stats.

Resources