I have a problem set to predict winners of Running races. The data I got was groups of data, in each group I have the attributes of each runner and their best time ranking in the corresponding group. For each group, the number of runners is different. So there may be 2 runners in a group and the slowest one rank 2 but the time he achieved could be ranking 10th if he is in a group of 12 people. But I don't have the exact timing of each runner. As such, which machine learning model would be useful that could be trained based on a group of runner, instead of being trained by each individual runner?
Related
Most tutorials and RL courses focuses on teaching how to apply a model (e.g. Q-Learning) to an environment (gym environments) one can input a state in order to get some output / reward
How it is possible to use RL for historical data, where you cannot get new data? (for example, from a massive auction dataset, how can I derive the best policy using RL)
If your dataset is formed, for example, of time series, you can set each instant of time as your state. Then, you can make your agent to explore the data series for learning a policy over it.
If your dataset is already labeled with actions, you can train the agent over it for learning the a police underlying those actions.
The trick is to feed your agent with each successive instant of time, as if it were exploring it on real time.
Of course, you need to model the different states from the information in each instant of time.
I am working on a personal project in which I log data of the bike rental service my city has in a MySQL database. A script runs every thirty minutes and logs data for every bike station and the free bikes each one has. Then, in my database I average the availability of each station for each day at that given time making it, as today, an approximate prediction with 2 months of data logging.
I've read a bit on machine learning and I'd like to learn a bit. Would it be possible to train a model with my data and make better predictions with ML in the future?
The answer is very likely yes.
The first step is to have some data, and it sounds like you do. You have a response (free bikes) and some features on which it varies (time, location). You have already applied a basic conditional means model by averaging values over factors.
You might augment the data you know about locations with some calendar events like holiday or local event flags.
Prepare a data set with one row per observation, and benchmark the accuracy of your current forecasting process for a period of time on a metric like Mean Absolute Percentage Error (MAPE). Ensure your predictions (averages) for the validation period do not include any of the data within the validation period!
Use the data for this period to validate other models you try.
Split off part of the remaining data into a test set, and use the rest for training. If you have a lot of data, then a common training/test split is 70/30. If the data is small you might go down to 90/10.
Learn one or more machine learning models on the training set, checking performance periodically on the test set to ensure generalization performance is still increasing. Many training algorithm implementations will manage this for you, and stop automatically when test performance starts to decrease due to overfitting. This a big benefit of machine learning over your current straight average, the ability to learn what generalizes and throw away what does not.
Validate each model by predicting over the validation set, computing the MAPE and compare the MAPE of the model to that of your original process on the same period. Good luck, and enjoy getting to know machine learning!
I have a dataset (1M entries) on companies where all companies are tagged based on what they do.
For example, Amazon might be tagged with "Retail;E-Commerce;SaaS;Cloud Computing" and Google would have tags like "Search Engine;Advertising;Cloud Computing".
So now I want to analyze a cluster of companies, e.g. all online marketplaces like Amazon, eBay, etsy, and the like. But there is no single tag that I can look for, but I have to use a set of tags to quantify the likelihood for a company to be a marketplace.
For example tags like "Retail", "Shopping", "E-Commerce" are good tags, but then there might be some small consulting agencies or software development firms that consult / build software for online marketplaces and have tags like "consulting;retail;e-commerce" or "software development;e-commerce;e-commerce tools", which I want to exclude as they are not online marketplaces.
I'm wondering what is the best way of identifying all online market places from my dataset. What machine learning algorithm, is suited to select the maximum amount of companies which are in the industry I'm looking for while excluding the ones that are obviously not part of it.
I thought about supervised learning, but I'm not sure because of a few issues:
Labelling needed, which means I would have to go through thousands of companies and flag them on multiple industries (marketplace, finance, fashion, ...) as I'm interested in 20-30 industries overall
There are more than 1,000 tags associated with the companies. How would I define my features? 1 feature per tag would lead to a massive dimensionality.
Are there any best practices for such cases?
UPDATE:
It should be possible to assign companies to multiple clusters, e.g. Amazon should be identified as "Marketplace", but also as "Cloud Computing" or "Online Streaming".
I used tf-idf and kmeans to identify tags that form clusters, but I don't know how to assign likelihoods / scores to companies that indicate how good the company fits into the cluster based on its tags.
UPDATE:
While tf-idf in combination with kmeans delivered pretty neat clusters (meaning the companies within a cluster were actually similiar), I also tried to calculate probabilities of belonging to a cluster with Gaussian Mixture Models (GMMs), which led to completely messed up results where companies within a cluster were more or less random or came from a handful different industries.
No idea why this happened though...
UPDATE:
Found the error. I applied a PCA before the GMM to reduce dimensionality, however, this apparently led to the random results. Removing the PCA improved the results significantly.
However, the resulting posterior probabilities of my GMM are 0. or 1. exactly 99.9% of the time. Is there a parameter (I'm using a sklearn BayesianGMM) that needs to be adjusted to get more valuable probablilities that are a little bit more centered? Because right now everything < 1.0 is not part of a cluster anymore, but there's also few outliers that get a posterior of 1.0 and are thus assigned to an industry. For example, a company with "Baby;Consumer" gets assigned to the "Consumer Electronics" cluster, even though only 1 out of 2 tags may be suggesting this. So I'd like this to get a probability of < 1. such that I can define a threshold based on some cross-validation.
I have no experience or knowledge on machine learning. I want to build a system for classifying RTS game replays by their build orders. Build order is a set of tasks (constructing buildings, training soldiers, workers) in a timely basis. A replay consists of the timings of all the tasks done by the two players. There are standard build orders usually players are following which defines the strategy of the games.
So, the problem is to label/classify replays based on their build orders. Since every replay is a unique game, it is not easy to identify build orders by the timings or order of tasks. Here you can see a standard build order defined here http://lotv.spawningtool.com/build/52587/ for a Starcraft 2 game. It is not possible to match the timings or order of the builds or unit counts since they are different in every game and the build orders are evolving and new build orders are created by the time.
To solve this problem, I believe building a machine learning system is necessary.
The question is what should be the easy and simple way to approach this problem? I wonder if neural networks are good for such classifications? or any other concept that is more relevant to such problem?
I am trying to predict from my past data which has around 20 attribute columns and a label. Out of those 20, only 4 are significant for prediction. But i also want to know that if a row falls into one of the classified categories, what other important correlated columns apart from those 4 and what are their weight. I want to get that result from my deployed web service on Azure.
You can use permutation feature importance module but that will give importance of the features across the sample set. Retrieving the weights on per call basis is not available in Azure ML.