"categorisation engine"? - search-engine

Can any one explain "Categorization Engine" in search engine domain?
I have googled it, but could not find any satisfactory explanations.Even reference links would help!
P.S. : Thanks in advance!

It would be easier if you could provide more context, but generally I think you are referring to the domain of Natural Language Processing known as Categorization or Text Categorization.
That discipline is about parsing natural language text (e.g. English or whatever) and assigning that text to one or more categories. Was the speaker taking about cars, new medical products, the latest fashion trends, etc.
Some references:
Classification of entire documents:
http://en.wikipedia.org/wiki/Document_classification
Search for concepts in documents:
http://en.wikipedia.org/wiki/Concept_Mining
Automatic text categorization:
http://nlp.hivefire.com/articles/11632/fully-automatic-text-categorization-by-exploiting-/
Commercial categorization engine:
http://www.sightup.com/en/produits_sightis.html
If you want to use a search engine to find further references, I would suggest searching on "natural language processing" categorization

Related

Text recommendation based on keywords

I need some advice on the following problem.
I'm given a set of weighted keywords (by percentage) and need to find a text in a database that best matches those keywords. I will give an example.
I'm presented with these keywords
Sun(90%)
National Park(85% some keywords contain 2 words)
Landmark(60%)
Now lets say my database contains 3 entries of texts e.g
Going-to-the-Sun Road is a scenic mountain road in the Rocky Mountains of the western United States, in Glacier National Park in Montana.
Everybody has a little bit of the sun and moon in them. Everybody has a little bit of man, woman, and animal in them.
A hybrid car is one that uses more than one means of propulsion - that means combining a petrol or diesel engine with an electric motor.
Obviously the first text is the one that best describes the given set of keywords so this is what I want to recommend to the user. Following the second text that somewhat relates with the "sun" keyword and that could be an acceptable choice too.
The 3rd text is totally irrelevant and shall only be recommended as a last resort when everything else fails.
I'm totally new to that kind of stuff so I need some advice as to which technologies/algorithms I should use. Seems like there is some machine learning (nlp) involved or some kind of fuzzy logic. I'm not really sure.
You need to use a combination of query terms boosting and synonyms
Look into Is there a way to do fuzzy string matching for words on string?

How to Convert NLP Question to Knowledge Graph triple?

I have what I think is a simple question. I am trying to put together a question answering system and I am having trouble converting a natural question to a knowledge graph triple. Here is an example of what I mean:
Assume I have a prebuilt knowledge graph with the relationship:
((Todd) -[:picked_up_by]-> (Jane))
How can I make this conversion:
"Who picked up Todd today?" -> ((Todd) -[:picked_up_by]-> (?))
I am aware that there is a field dedicated to "Relationship Extraction", but I don't think that this fits that problem if I could name it, "question triple extraction" would be the name of what I am trying to do.
Generally speaking, it looks like a relation extraction problem, with your custom relations. Since the question is too generic, this is not an answer, just some links.
Check out reading comprehension: projects on github and lecture by Christopher Manning
Also, look up Semantic Role Labeling.

Matching TV and Movie File names with NLP/Machine Learning?

So i've wondered if there would be a way to tokenize/tag TV or Movie Files using NLP/Machine Learing.
I know there are a lot of regexp approaches out there which do this already but shouldn't it be possible to get this done with NLP/Machine Learning as well?
Example:
The.Heart.Guy.S01E07.Die.Belastungsprobe.German.DL.720p.HDTV.x264-GDR
Should be something like:
The Heart Guy SHOW-NAME
1 SEASON
7 EPISODE
Die Belastungsprobe EP-NAME
German DL LANGUAGE
720p RESOLUTION
HDTV SOURCE
x264 CODEC
GDR GROUP
Anyone ever tried something like this? Or any hints where one should start or if it's even possible to get something like this working.
Machine learning approaches would cost more than rule-based approaches. But if you want to try a machine learning solution the best solution that comes to my mind is to use markov models as the problem has sequential observations and you can handle it with finite state automatas. You can use this paper as a reference.
I suspect using regexes is the easiest solution to this, but if you're willing to put in some time Conditional Random Fields are also a great solution. Here's an article about the New York Times using a CRF based model on recipe data.
Another example of CRFs on short text is libpostal, which extracts parts of postal addresses.

Translation API with candidates

I am looking for a translation API that outputs all the candidates and not just single "best" candidate.
All statistical machine translation systems at the last stage score the list of translation candidates and choice the best candidate. I wonder if there is a system like Google translate or Microsoft translate that returns the list of all possible candidates so that I will be able to score them by myself.
Thanks.
I think WordNet is good for this:
https://wordnet.princeton.edu/
Originally wordnet is english ontology describing english word in english, showing synonims, definition etc. but there are a lot of other language wordnets projects as well as multilingual wordnets. Below interesting links:
http://globalwordnet.org/wordnets-in-the-world/
http://www.certifiedchinesetranslation.com/openaccess/WordNet/
There is a big dictionary project leveraging from wordnets too:
http://babelnet.org/about

How to use Bayesian analysis to compute and combine weights for multiple rules to identify books

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.
Some are obvious to the human reader, like:
Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt
Others are not so obvious:
Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)
Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.
For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.
Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.
I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.
In the end I would have a set of tuples like this:
[rulename,hypothesis]
I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.
My questions are:
Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?
It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

Resources