I'm at the end of my project and my company asked me to evalute the model without metrics. In brief, after obtaing the best 10 reccomendation, I should see if these reccomandation are between the movies that the new user wanted to see. I don't understand how I can do it if I'm doing and algorithm for prediction these movies.
Slowly, I've found a possibile ansewer to my question. A user said that a possibile approach could be hide a few data points at random for every user, make recommendations using your algorithm, and then uncover the hidden data and see how many of those matched the recommendations.
But I still don't have clear ideas. Could anyone help me?
Here is how you can do evaluation:
Filter the users who have more than 20 ratings with value 5 (the exact numbers will depend on your dataset);
Randomly select two movies per user;
That’s our test set — it won’t be used during the training, but these movies should appear in the top recommendations for selected users accordingly.
You can find more details and practical implementation in the article about building recommendation system based Bayesian Personalized Ranking.
Related
From the course "Text Retrieval and Search Engines" on Coursera I learnt some feedback algorithms in information retrieval system, like Rocchio. But I still can't understand how feedback is used in practical.
Why all feedback algo update the query vector instead of updating the document rank directly?
Are the document click through feedback stored in Postings list?
Thanks
But I still can't understand how feedback is used in practical.
Since you've studied the Rocchio feedback, I'll try to explain with reference to this particular approach although this will be applicable to any other feedback methods as well, e.g. relevance modeling.
The Rocchio algorithm first modifies the current query representation (by adding new terms and re-weighting initial query terms). It then performs a 2nd pass retrieval and obtains a new ranked list.
Why all feedback algo update the query vector instead of updating the document rank directly?
This is because if the initial query representation is not good enough, the initial ranked list wont have a high recall. This means that even reranking the results won't be much useful (unless of course you're doing a highly precision oriented task and all you care about is P#10). Additional terms in the query will often have a significant impact in retrieving more relevant documents in top-1000.
Are the document click through feedback stored in Postings list?
No, the postings list may additionally contain per-document statistics for a particular term (the head of the list), e.g. term positions etc. The information of whether a document was clicked or not is a global information, not pertaining to a specific term.
Also, user clicks are not used to modify the ranking of the current query. They could be used, rather, to build user profiles of interest.
I’m working on web app where users can ask questions. These questions should be categorized by some criteria based on question content, title, user data, region and so on. Next these questions should be processed in so way: for some additional information requests should be sent, others should be deleted or marked as spam and some – sent directly to some specialist.
The problem is that users can’t choose the right category themselves, it’s pretty complex things and users can cheat.
Are there any approaches how to do that automatically? For now a few persons do this job filtering questions. Perhaps some already done solutions exist.
This is a really complex task. You should take a look at supervised machine learning classification algorithms. You can try to use similar to some spam filtering algorithm (https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)
Gather some number of questions categorized before (labeled examples).
Gather some number of words (vocabulary) used for questions classifications (identify group).
Process question text removing “stop words” and replace words with their stems.
Map question text, title, user data and so to some numbers (question vector).
Use some algorithm like SVM to create and use classifier (model)
But it’s like very general approach you can look at. It’s hard to say something more specific without additional details. I don’t think you can find already done solution, it’s pretty specific task. But of cause you can use a lot of machine-learning frameworks.
I'm looking for some advice in the problem of classifying users into various groups based on there answers to a sign up process.
The idea is that these classifications will group people with similar travel habits, i.e. adventurous, relaxing, foodie etc. This shouldn't be a classification known to the user, so isn't as simple as just asking what sort of holidays they like ( The point is to remove user bias/not really knowing where to place yourself).
The way I see it working is asking questions such as apps they use, accounts they interact with on social media (gopro, restaurants etc) , giving some scenarios and asking which sounds best, these would be chosen from a set provided to them, hence we have control over the variables. The main problem I have is how to get numerical values associated to each of these.
I've looked into various Machine learning algorithms and have realised this is most likely a clustering problem but I cant seem to figure out how to use this style of question to assign a value to each dimension that will actually give a useful categorisation.
Another question I have is whether there is some resources where I could find information on the sort of questions to ask users to gain information that'd allow classification like this.
The sort of process I envision is one similar to https://www.thread.com/signup/introduction if anyone is familiar with it.
Any advice welcomed.
The problem you have at hand is that you want to calculate a similarity measure based on categorical variables, which is the choice of their apps, accounts etc. Unless you measure the similarity of these apps with respect to an attribute such as how foodie is the app, it would be a hard problem to specify. Also, you would need to know all the possible states a categorical variable can assume to create a similarity measure like this.
If the final objective is to recommend something that similar people (based on app selection or social media account selection) have liked or enjoyed, you should look into collaborative filtering.
If your feature space is well defined and static (known apps, known accounts, limited set with few missing values) then look into content based recommendation systems, something as simple as Market Basket Analysis can give you a reasonable working model.
Else if you really want to model the system with a bunch of features that can assume random states, this could be done with multivariate probabilistic models, if the structure (relationships and influences between features) is well defined, you could benefit from Probabilistic Graphical Models, such as Bayesian Networks.
You really do need to define your problem better before you start solving it though.
You can use prime numbers. If each choice on the list of all possible choices is assigned a different prime, and the user's selection is saved as a product, then you will always know if the user has made a particular choice if the modulo of selection/choice is 0. Beauty of prime numbers, voila!
So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.
Let's say we have 10-D data from class of students. The data involves parameters like Name, Grades, Courses, No. of hours of lectures, etc. of all the students of the class. Now, we want to analyze the impact of No. of hours of lectures on Grades.
If we closely watch our parameters, Name of the student has nothing to do with Grades, but Courses taken by student "might" have impact on Grades.
So, there could be parameters which are dependent on each other while some others can be totally independent. My question is, how do we decide that which parameter has impact on our classification/regression problem and which don't?
PS: I am not looking for exact solutions. If someone can just show me the right direction or keywords for google search, that should be sufficient.
Thanks.
The technique you're looking for is called dimension reduction. The Stanford machine learning class goes over one method (principal component analysis).
This is the problem of independent component analysis. ICA a family of methods for finding the statistically independent components of data sets. This is a difficult problem, and there exists a large variety of algorithms for finding good solutions. A popular algorithm is FastICA.
There are also related concepts of whitening and decorrelation.