I'm trying to build a model on labeled data that can cluster on a specific field.
Example data:
The field I want to cluster on is class_id. I want to be able to give the model (class_id, date, class_time) and get an estimated time in minutes that a student stays in my class for a specific date and time. I want to cluster by class_id because each class is different in its own way. Is there a model or way that can do this? Thanks!
You are not looking for clustering.
Instead, what you ask for is trivial SQL: GROUP BY class_id.
It classes are indeed that different, then you'll likely want to learn a separate regression for each class, after partitioning.
Related
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
I'm looking for a machine learning method to recognize input ranges that result customer dissatisfaction.
For instance, assume that we have a database of customer's age, customer's gender, date and time that customer stops by, person who is in charge of providing service to customer, etc. and finally a number in range of 0 to 10 which stands for customer satisfaction (Extracted from customer's feedback).
Now I'm looking for a method to determine input ranges which results dissatisfaction. For example male customers who are stopping by John, between 10-12pm are mainly dissatisfied.
I believe there already is a kind of clustering or neural network method for this purpose. Could you help me?
This is not a clustering problem. You have training data.
Instead, you may be looking for a decision tree.
There is more than one method to do it (correlation analysis for ex.)
One simple way is to classify your data by the degree of satisfaction (target)
Classes:
0-5 DISSATISFIED
6-10 SATISFIED
Than look for repetition along features in each cluster.
For example:
if you are interested by one feature, ex: the person who stopped clients, than just get the most frequent name within the two classes to get a result like 80% of unsatisfied client was stopped by jhon
if you are interested by more than one feature, ex: the person who stopped the client AND the time of the day, in this case you can consider the couple of features us one and do the same thing as the first case, than you will get something like 30% of unsatisfied client was stopped by jhon between 10 and 11 am
What do you want to know? Which service person to fire, what are the best hours to provide the service, or sth. else? I mean what are your classes?
Provided, you what to evaluate the service person - the classes are the
persons. In SVM (and I think for NN applies the same) I would split all not purely numerical data in boolean attributes.
Age: unchanged, number
Gender: male 1/0, female 1/0
Date: 7 features for days of week, possibly the number of experience days of the service person. for each special date an attribute e.g. national holiday 1/0
Time: split the time-span in reasonable ranges e.g. 15 min. Each range is a feature
Satisfaction: unchanged - number 1-10
With this model you could predict the satisfaction index for each service person for given date, time, gender, age.
I guess, you can try using anomaly detection algorithms. Basically if you consider the satisfaction level as the dependent variable, then you can try to find the samples which are located away from the majority of the samples in the euclidean space. These away samples could signify dissatisfaction.
Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.
I am a newbee in the field of data mining. I am working on very interesting Data Minign problem. Data description is as follows:
Data is time sensitive. Item attributes are dependent on time factor as well as its class label. I am grouping weekly data as one instance of training or test record. Each week, some of the item attributes may change along with its Popularity(i.e. Class label).
Some sample data as below:
IsBestPicture,MovieID,YearOfRelease,WeekYear,IsBestDirector,IsBestActor,IsBestActress,NumberOfNominations,NumberOfAwards,..,Label
-------------------------------------------------
0_1,60000161,2000,1,9-00,0,0,0,0,0,0,0
0_1,60004480,2001,22,19-02,1,0,0,11,3,0,0
0_1,60000161,2000,5,13-00,0,0,0,0,0,0,1
0_1,60000161,2000,6,14-00,0,0,0,0,0,0,0
0_1,60000161,2000,11,19-00,0,0,0,0,0,0,1
My research advisor suggested to use Naive Bayes algorithm which can adapt such dynamic data that is changing with time.
I am using data from 2000-2004 as Training an 2005 as Testing. If i include Week-Year attribute in my items data set, then it will cause 0 probability in Naive Bayes. Is it ok to omit this attribute from my data set after organizing my data in chronological order?
Moreover, how to adapt my model as i read new test cases ? as the new test cases might cause change in Class label ?
Can you provide a little more insight into your methods? For instance, are you using R, SPSS, Python, SQL Server 2008R2, or RapidMiner 5.2? And if you can include a very small (3-4 row segment) of some of your data, that would help people figure out how to tackle this.
One immediate approach to get an idea of what you are looking at would be to do a Random Forest/Decision Tree and K-Means clustering in order to determine common seperation points in the data. Have you begun by a quick glance at the data's histograms, averages, and outliers?
I am a newbie learning mahout.
I learned that there are five recommenders in mahout. User-based, Item-based,...
The datasets I used is movielens 100K
I am thinking implement a little different movie recommender from user based one. i.e., instead of taking user id as an input to recommend movies to only one user, I want to take user demographic information, e.g., age range, gender, occupation, and zip code.
But the problem is how do I create my own user similarity method (The original one is taking two long type user id as parameters) and how do I combine u.user file and u.data file together?
I understand your question now. I think the simplest thing is to temporary create a dummy user with the demographic properties you are querying for, and then recommend for that dummy user.
Yes, you would have to write a UserSimilarity that implements whatever similarity rule you want on top of the demographic data.
Maybe there is another solution.
I implement my own Rescorer to deal with u.user file and input (gender, age range, ...). If each piece of information is equal, then I put the according user id into a FastIDSet.
Then, in the rescore method, I will check if the current user id is in FastIDSet, if yes, the augment the score.
In my own Recommender, I will use PlusAnoymousUserDataModel to get a temp id, and call the method recommen(id, howMany, rescorer)
However, after I tried different dataset file, I get 0 recommended item.
I am thinking whether it is the right way to use PlusAnoymousUserDataModel.