information retrieval feedback in practical - search-engine

From the course "Text Retrieval and Search Engines" on Coursera I learnt some feedback algorithms in information retrieval system, like Rocchio. But I still can't understand how feedback is used in practical.
Why all feedback algo update the query vector instead of updating the document rank directly?
Are the document click through feedback stored in Postings list?
Thanks

But I still can't understand how feedback is used in practical.
Since you've studied the Rocchio feedback, I'll try to explain with reference to this particular approach although this will be applicable to any other feedback methods as well, e.g. relevance modeling.
The Rocchio algorithm first modifies the current query representation (by adding new terms and re-weighting initial query terms). It then performs a 2nd pass retrieval and obtains a new ranked list.
Why all feedback algo update the query vector instead of updating the document rank directly?
This is because if the initial query representation is not good enough, the initial ranked list wont have a high recall. This means that even reranking the results won't be much useful (unless of course you're doing a highly precision oriented task and all you care about is P#10). Additional terms in the query will often have a significant impact in retrieving more relevant documents in top-1000.
Are the document click through feedback stored in Postings list?
No, the postings list may additionally contain per-document statistics for a particular term (the head of the list), e.g. term positions etc. The information of whether a document was clicked or not is a global information, not pertaining to a specific term.
Also, user clicks are not used to modify the ranking of the current query. They could be used, rather, to build user profiles of interest.

Related

How do I identify most related parameters in statistical modeling

I have data about car mechanic company which allow mechanics to apply for there garrage on freelance basis.
I have previous mechanic job history and based on this historical data, I want to recommend best possible location to mechanics so that he can get good job and company gets maximum acceptance.
I manually checked various parameters like location_ID, lang, lat of the job location, mechanic_Exp_years, open_position, mechanic_specialization etc.
Also tried to see relation using chart like this
https://imgur.com/a/jxmTXty
I am adding link because can not upload image due to less then 10 point
Is there any standards technique available which can statistically says that out of this 100 parameters this paramteres are good to be considered for prediction/training?
Any reference link or library much appreciated. I did checked many articles but no luck
There are many methods to do that. If you're using python, I would recommend scikit-learn's FeatureSelection module. There are many methods listed, but my pick would be Recursive Feature Elimination or short RFE. The RFE works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
Other then that, you can also try to use PCA (Principal Component Analysis) to reduce your features to only useful ones that bring some information to your model.

Categorize customer questions based on content

I’m working on web app where users can ask questions. These questions should be categorized by some criteria based on question content, title, user data, region and so on. Next these questions should be processed in so way: for some additional information requests should be sent, others should be deleted or marked as spam and some – sent directly to some specialist.
The problem is that users can’t choose the right category themselves, it’s pretty complex things and users can cheat.
Are there any approaches how to do that automatically? For now a few persons do this job filtering questions. Perhaps some already done solutions exist.
This is a really complex task. You should take a look at supervised machine learning classification algorithms. You can try to use similar to some spam filtering algorithm (https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)
Gather some number of questions categorized before (labeled examples).
Gather some number of words (vocabulary) used for questions classifications (identify group).
Process question text removing “stop words” and replace words with their stems.
Map question text, title, user data and so to some numbers (question vector).
Use some algorithm like SVM to create and use classifier (model)
But it’s like very general approach you can look at. It’s hard to say something more specific without additional details. I don’t think you can find already done solution, it’s pretty specific task. But of cause you can use a lot of machine-learning frameworks.

Advice on classifying users in machine learning scenario

I'm looking for some advice in the problem of classifying users into various groups based on there answers to a sign up process.
The idea is that these classifications will group people with similar travel habits, i.e. adventurous, relaxing, foodie etc. This shouldn't be a classification known to the user, so isn't as simple as just asking what sort of holidays they like ( The point is to remove user bias/not really knowing where to place yourself).
The way I see it working is asking questions such as apps they use, accounts they interact with on social media (gopro, restaurants etc) , giving some scenarios and asking which sounds best, these would be chosen from a set provided to them, hence we have control over the variables. The main problem I have is how to get numerical values associated to each of these.
I've looked into various Machine learning algorithms and have realised this is most likely a clustering problem but I cant seem to figure out how to use this style of question to assign a value to each dimension that will actually give a useful categorisation.
Another question I have is whether there is some resources where I could find information on the sort of questions to ask users to gain information that'd allow classification like this.
The sort of process I envision is one similar to https://www.thread.com/signup/introduction if anyone is familiar with it.
Any advice welcomed.
The problem you have at hand is that you want to calculate a similarity measure based on categorical variables, which is the choice of their apps, accounts etc. Unless you measure the similarity of these apps with respect to an attribute such as how foodie is the app, it would be a hard problem to specify. Also, you would need to know all the possible states a categorical variable can assume to create a similarity measure like this.
If the final objective is to recommend something that similar people (based on app selection or social media account selection) have liked or enjoyed, you should look into collaborative filtering.
If your feature space is well defined and static (known apps, known accounts, limited set with few missing values) then look into content based recommendation systems, something as simple as Market Basket Analysis can give you a reasonable working model.
Else if you really want to model the system with a bunch of features that can assume random states, this could be done with multivariate probabilistic models, if the structure (relationships and influences between features) is well defined, you could benefit from Probabilistic Graphical Models, such as Bayesian Networks.
You really do need to define your problem better before you start solving it though.
You can use prime numbers. If each choice on the list of all possible choices is assigned a different prime, and the user's selection is saved as a product, then you will always know if the user has made a particular choice if the modulo of selection/choice is 0. Beauty of prime numbers, voila!

semantic search that finds sentences that could be expressed visually

Let's say I want to build a search engine that goes through a text and finds sentences or paragraphs that could be turned into an image, video or 3d-animation. So sentences that contain information that could be expressed visually.
Ideally, this search engine would get better over time.
Is there already search engine that could to that?
If not, which type of things would I need to look at/consider? My point here being that I don't really know much about machine learning and search engines. I am trying to get a feeling of which areas of machine learning, information retrieval and so forth I would need to look at.
I don't expect long answers here, just things like "well, take a look at this type of machine learning" or "this part of information retrieval theory may be relevant".
Just to get a broad overview of what I would need to look at.
Natural Language Understanding
I don't know about any existing search engine doing that. But this can be done with the help of Natural Language Understanding and Semantic Parsing.
Have a look at Stanford's Natural Language Understanding course (discussion of the text-to-scene problem can be found here) for further details.
How semantic search works is, it analysis data and put them into a 3-D vector space. One it's done with the help of bid data and knowledge graph the algorithm will try to find data points that connect to the article, the authority of the author, website relevance, and a couple of other factors. Once these factors are factored in, it then tries to create co-relate data to create a layer of information interconnected to each other. Once these information's are gathered then it is used to arrive at a conclusion to decide how relevant the data is.

Twitter data topical classification

So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.

Resources