Prediction Algorithm for Basketball Stats - machine-learning

I'm working on a project where I need to predict future stats based on past stats of basketball players. I would like to be able to predict next season's statistics based on the statistics of the past three seasons (if there are three previous seasons to choose from). Does anyone have a suggestion for a good prediction algorithm I could use? The data is continuous and there can be anywhere between 5-14 dimensions (age, minutes, points, etc.)
Thanks!
Note: I'd really like to use the program Weka to do this.

Out of the box, random forest would likely give you a strong baseline, so I would start with this.
You can also try try linear regression, which is a simple yet relative effective method, but depending on the data might require a bit more tweaking (for example transforming some of the input and/or out variables).
Gradient boosting regression is another strong predictor, but typically also needs more tweaking to work well.
All of these algorithms have Weka implementations.

There obviously isn't one correct answer, but for anyone looking to do something similar, I'll better describe my problem and the solution that I've found. I created a csv file where each row is a different season, and each column contains a different attribute. For each attribute that I would like to predict, I have the stats for the current season and then another column for the stats for the previous season. The first (rookie) season will have 0 for all 'previous season' columns. With this data set, I loaded it into Weka and used a Multilayer Perceptron with the test-option set to Cross-Validation. I set the number of folds to somewhere between 80-90% of the number of seasons available.
Finally, to predict the next season's statistics, you add one more row to the end and input the last-season values with "?" in the columns that you would like to predict. If anyone would like a deeper example, I'd be glad to provide one.

I think also if you truly want to create an accurate prediction you have to look at player movement and if a player moves to a team with a losing record, do they increase their minutes to have a larger role which would inflate stats or move to a winning team for a lesser role where they could see a decrease in stats.

Related

Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)?

The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.

What is a good approach to clustering multi-dimensional data?

I created a k-means clustering for clustering data based on 1 multidimentional feature i.e. 24-hour power usage by customer for many customers, but I'd like to figure out a good way to take data which hypothetically comes from matches played within a game for a player and tries to predict the win probability.
It would be something like:
Player A
Match 1
Match 2
.
.
.
Match N
And each match would have stats of differing dimensions for that player such as the player's X/Y coordinates at a given time, time a score was made by the player, and such. Example, the X/Y would have data points based on the match length, while scores could be anywhere between 0 and X, while other values might only have 1 dimension such as difference in skill ranking for the match.
I want to take all of the matches of the player and cluster them based on the features.
My idea to approach this is to cluster each multi-dimensional feature of the matches to summarize them into a cluster, then represent that entire feature for the match with a cluster number.
I would repeat this process for all of the features which are multi-dimensional until the row for each match is a vector of scalar values and then run one last cluster on this summarized view to try to see if wins and losses end up in distinctive clusters, and based on the similarity of the current game being played with the clustered match data, calculate the similarity to other clusters and assign a probability on whether it is likely going to become a win or a loss.
This seems like a decent approach, but there are a few problems that make me want to see if there is a better way
One of the key issues I'm seeing is that building model seems very slow - I'd want to run PCA and calculate the best number of components to use for each feature for each player, and also run a separate calculation to determine the best number of clusters to assign for each feature/player when I am clustering those individual features. I think hypothetically scaling this out over thousands to millions of players with trillions of matches would take an extremely long time to do this computation as well as update the model with new data, features, and/or players.
So my question to all of you ML engineers/data scientists is how is my approach to this problem?
Would you use the same method and just allocate a ton of hardware to build the model quickly, or is there some better/more efficient method which I've missed in order to cluster this type of data?
It is a completely random approach.
Just calling a bunch of functions just because you've used them once and they sound cool never was a good idea.
Instead , you first should formalize your problem. What are you trying to do?
You appear to want to predict wins vs. losses. That is classification not clustering. Secondly, k-means minimizes the sum-of-squares. Does it actually !ake sense to minimize this on your data? I doubt so. Last, you begin to be concerned about scaling something to huge data, which does not even work yet...

User behavior prediction/analysis

I am trying to apply machine learning methods to predict/ analyze user's behavior. The data which I have is in the following format:
data type
I am new to the machine learning, so I am trying to understand what I am doing makes sense or not. Now in the activity column, either I have two possibilities which I am representing as 0 or 1. Now in time column, I have time in a cyclic manner mapped to the range (0-24). Now at a certain time (onehot encoded) user performs an activity. If I use activity column as a target column in machine learning, and try to predict if at a certain time user will perform one activity or another, does it make sense or not?
The reason I am trying to predict activity is that if my model provides me some result about activity prediction and in real time a user does something else (which he has not been doing over the last week or so), I want to consider it as a deviation from normal behavior.
Am I doing right or wrong? any suggestion will be appreciated. Thanks.
I think your idea is valid, but machine learning models are not 100 % accurate all the time. That is why "Accuracy" is defined for a model.
If you want to create high-performance predictive models then go for deep learning models because its performance improves over time with the increase in the size of training data sets.
I think this is a great use case for a Classification problem. Since you have only few columns (features) in your dataset, i would say start with a simple Boosted Decision Tree Classification algorithm.
Your thinking is correct, that's basically how fraud detection AI works in some cases, one option to pursue is to use the decision tree model, this may help to scale dynamically.
I was working on the same project but in a different direction, have a look maybe it can help :) https://github.com/dmi3coder/behaiv-java.

How much prediction accuracy of SVM (or other ML models) depend on the way features are encoded?

Suppose that for a given ML problem, we have a feature which car the person possesses. We can encode this information in one of the following ways:
Assign an id to each of the car. Make a column 'CAR_POSSESSED' and put feature id as value.
Make columns for each of the car and put 0 or 1 according to whether that car is possessed by the considered sample or not. Columns will be like "BMW_POSSESSED", "AUDI_POSSESSED".
In my experiments the 2nd way performed much better than 1st one, when tried with SVM.
How does the encoding way affects the model learning, and are there some resources in which affect of encoding has been studied? Or do we need to do hit and trials to check where it performs best?
The problem with the first way is that you use arbitrary numbers to represent the features (e.g. BMW=2, etc.) and SVM take those numbers seriously, as if they have order: e.g. it may try to use cases with CAR_OWNED>3 for the prediction.
So the second way is better.
Chapter 2.1 Categorical Features:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
You'll find many more if you search for "svm Categorical Features"

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Resources