I am using Brain.js to develop a neural network for a project of mine. The idea is to have something like this:
{ input: [day, month, hour, minute], output: some categoryID }
Basically, I will train the network to learn user's habits over time, so that later the network can predict a user's favourite category when given a time.
My main problem is that these "categories" is not a fixed array. I will most likely, over time, add, change or remove categories from the project. Each category has it own categoryID, which is auto-incremental.
How can I - in the best way - output a categoryID in a neural network?
Please keep in mind that the project may have thousands of categories in the future. The network must be able to handle that.
Thanks for your help!
Related
I'm trying to build a model on labeled data that can cluster on a specific field.
Example data:
The field I want to cluster on is class_id. I want to be able to give the model (class_id, date, class_time) and get an estimated time in minutes that a student stays in my class for a specific date and time. I want to cluster by class_id because each class is different in its own way. Is there a model or way that can do this? Thanks!
You are not looking for clustering.
Instead, what you ask for is trivial SQL: GROUP BY class_id.
It classes are indeed that different, then you'll likely want to learn a separate regression for each class, after partitioning.
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
I'm looking for a machine learning method to recognize input ranges that result customer dissatisfaction.
For instance, assume that we have a database of customer's age, customer's gender, date and time that customer stops by, person who is in charge of providing service to customer, etc. and finally a number in range of 0 to 10 which stands for customer satisfaction (Extracted from customer's feedback).
Now I'm looking for a method to determine input ranges which results dissatisfaction. For example male customers who are stopping by John, between 10-12pm are mainly dissatisfied.
I believe there already is a kind of clustering or neural network method for this purpose. Could you help me?
This is not a clustering problem. You have training data.
Instead, you may be looking for a decision tree.
There is more than one method to do it (correlation analysis for ex.)
One simple way is to classify your data by the degree of satisfaction (target)
Classes:
0-5 DISSATISFIED
6-10 SATISFIED
Than look for repetition along features in each cluster.
For example:
if you are interested by one feature, ex: the person who stopped clients, than just get the most frequent name within the two classes to get a result like 80% of unsatisfied client was stopped by jhon
if you are interested by more than one feature, ex: the person who stopped the client AND the time of the day, in this case you can consider the couple of features us one and do the same thing as the first case, than you will get something like 30% of unsatisfied client was stopped by jhon between 10 and 11 am
What do you want to know? Which service person to fire, what are the best hours to provide the service, or sth. else? I mean what are your classes?
Provided, you what to evaluate the service person - the classes are the
persons. In SVM (and I think for NN applies the same) I would split all not purely numerical data in boolean attributes.
Age: unchanged, number
Gender: male 1/0, female 1/0
Date: 7 features for days of week, possibly the number of experience days of the service person. for each special date an attribute e.g. national holiday 1/0
Time: split the time-span in reasonable ranges e.g. 15 min. Each range is a feature
Satisfaction: unchanged - number 1-10
With this model you could predict the satisfaction index for each service person for given date, time, gender, age.
I guess, you can try using anomaly detection algorithms. Basically if you consider the satisfaction level as the dependent variable, then you can try to find the samples which are located away from the majority of the samples in the euclidean space. These away samples could signify dissatisfaction.
Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.
I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)
I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.
I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.
So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.
However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)
Thanks!
EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).
That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.
I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.
Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)
I have some similar system - a good base for source data is football-data.co.uk.
I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.
One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.
Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).
Hopefully I gave you some ideas, for more feel free to ask.
Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.
I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...
Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.