Semantic search engines - search-engine

Are there any semantic search engines or information search engines out there that people has heard of? Of course Google is the biggest search engine right now. Other than Bing, Yahoo, Google, are there any semantic search engines that are interesting?

Maybe the most famous semantic search engine is Swoogle of umbc.
Also, you can find a comprehensive list of such SEs in (http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/SemanticWebSearchEngines).
Moreover, in recent years, Google provides support for semantic search.

My company IP Street provides a semantic patent search as an API service. Specifically, we use the a latent semantic indexing algorithm. You send us raw text, we send you patents indexed from most to least conceptually similar.
Check it our API service here: http://docs.ipstreet.com/docs/concept-search

Although the question is a bit older, you might be interested in looking into Weaviate.
Example of a semantic query from the docs:
{
Get{
Things{
Publication(
explore: {
concepts: ["fashion"],
certainty: 0.7,
moveAwayFrom: {
concepts: ["finance"],
force: 0.45
},
moveTo: {
concepts: ["haute couture"],
force: 0.85
}
}
){
name
}
}
}
}
You can also try out the query in real-time.

Related

Ruby Solr search give less priority for some word

I am using Solr search, in which I want to give less priority for some words.
Say if I search for dance lessons, it should get results more results for dance in top and then lessons, etc.
If you are using Sunspot, the answer would be to use boost.
This bit of the docs shows how to do phrase boosts specifically.
In your case, you may want to give a low number to the boost to effectively reduce the relevance of the word "lesson".
First apology if it doesn't seem to be pointing to the exact code. (I used sunspot long long ago)
So it is possible to boost specific term while you are searching some content. Based on your context I assume the following code should be working -
Post.search do
fulltext 'dance^2 lessons'
end
For further term boosting related information you can checkout this page - Boosting a Term

Find location from text

I am currently thinking of how to find a location from a text, such as a blogpost, without the user having to input any additional information. For example a post could look like this:
"Aberdeen, With a Foot on the Seafloor
Since the early 1970s, Aberdeen, Scotland, has evolved from a gritty fishing town into the world’s center of innovation in technology for the offshore energy industry."
By reading it I realize that the post is about Aberdeen Scotland but how can I geotag it? I have been using the geocoder (https://github.com/alexreisner/geocoder) by Alex Reisner but it seems weird to check every word against the google/nominatim(osm). My initial idea was to simply bruteforce it by checking every word with the geocoder and try to see if there are similarities between the words. But it seems like there could be a better way around this.
Has anyone done anything similar to this? Any algorithm that could be suggested (or gem :) ) would be immensely appreciated!
I'm sure there have been projects dedicated to this - for example, google's uncanny ability to geotag and pick data out of your personal emails effortlessly.
The most obvious answer I can see here, would be to create a few regular expressions for locations. The most simple one would be for City, Country:
Regexp.new("((?:[a-z][a-z]+))(.)(\\s+)((?:[a-z][a-z]+))",Regexp::IGNORECASE);
This would recognize Aberdeen, Scotland, but also course, I or even thanks, bye. It would be a start though, to query only those recognized spots instead of every word in the document.
There are also widely known regular expressions for addresses, cities, etc. You could use those as well if you find your algorithm missing matches.
Cheers!

"categorisation engine"?

Can any one explain "Categorization Engine" in search engine domain?
I have googled it, but could not find any satisfactory explanations.Even reference links would help!
P.S. : Thanks in advance!
It would be easier if you could provide more context, but generally I think you are referring to the domain of Natural Language Processing known as Categorization or Text Categorization.
That discipline is about parsing natural language text (e.g. English or whatever) and assigning that text to one or more categories. Was the speaker taking about cars, new medical products, the latest fashion trends, etc.
Some references:
Classification of entire documents:
http://en.wikipedia.org/wiki/Document_classification
Search for concepts in documents:
http://en.wikipedia.org/wiki/Concept_Mining
Automatic text categorization:
http://nlp.hivefire.com/articles/11632/fully-automatic-text-categorization-by-exploiting-/
Commercial categorization engine:
http://www.sightup.com/en/produits_sightis.html
If you want to use a search engine to find further references, I would suggest searching on "natural language processing" categorization

How do you think the "Quick Add" feature in Google Calendar works?

Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.
If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.
6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.
That said, I've found a nice collection of resources about NLP including this great FAQ.
I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:
<event>
<name>meet Sam</name>
<starttime>16:30 07/06/2010</starttime>
<endtime>17:30 07/06/2010</endtime>
</event>
I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.
Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.
It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.

What are some good methods to find the "relatedness" of two bodies of text?

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.
What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?
I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.
I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.
These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.
You could also look into Soundex for words that "sound alike" phonetically.
I've never used it, but you might want to look into Levenshtein distance
Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)
One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.
And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases
This is quite doable for reasonable large texts, however harder for smaller texts.
I did it once like this, and it worked pretty well:
Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.
This book may be relevant.
Edit: here is a related SO question
Phonetic algorithms
The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.
I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name
A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

Resources