What is the best approach for a interpreting an text input for geocoding purposes? - parsing

Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?

It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.

I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.

You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.

Related

Query-document similarity with doc2vec

Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.

Recognizing language patterns in a list of sentences on Google Sheets

I am trying to analyze a series of sentences by identifying the most common adverb-adjective-noun strings. I have managed to get answers for how to do so with random words but I think this is a standalone question, and it might better to be dealt with separately.
In this case, I would like to omit common word types like personal pronouns, articles, prepositions and even verbs. Ideally, the results should produce:
Most common nouns
Most common adjectives
Most common adverbs
Most common adjective+noun strings
Most common adverb+noun strings
I understand there is a way to do this by using an online dictionary but I have been unable to integrate that in my code to get the results I want. Is there any way of automating this without listing all the words that you want omitted? How could it be done?
Here's a link to the spreadsheet I'm using (for this particular query, see page 2) and a screenshot of the types of text I would like to analyze with a manual color-coded visualization of what I want to achieve:

How to best use Solr parser syntax in a specific business requirement

Just starting to learn Solr for a project at work and was wondering on how to go about this issue. Our application allows a user to search based on a business name. The business name is comprised of 3 different categories ( English, French and Combined Name ). Based on a single query entered by the user, how would one go about using Solr to provide the most relevant search results? I have looked into fuzzy and proximity searches which seem reasonable enough. Although fuzzy search only applies to a single term, which makes me believe that I would need to split the query into single terms and apply fuzzy search to each and merge the results if I were to use it ? My question is how to best approach the problem ? Thanks!
To provide relevancy to your documents , you need to have a combination of proper boosting queries and your priorities as what relevance means to your use case . If Regex based search is included in use case you may go for NGrams , if exact search is what you seeking for , boosting is important . You can use parameters like phrase slope , mm, and other edismax parameters to your advantage . You may use a combination of title and text content search, with a good combination of boosts . Also , Solr allows you to pass your query in parenthesis, that functions like an SQL IN query , that further boosts relevancy in your documents by sticking to keywords only mentioned in the query . And , at last , if all these doesn't suffice, you may use custom function queries to meet your needs . While doing all this, just keep in mind the Analyzers in schema.xml file are just right and serve the purpose to execute above mentioned queries .
You can go as far down this rabbit-hole as you have time for wrt Business Name search. (Fuzzy, sound-alike, language-specific analysis, weird compounded-terms used as a domain name (eg: getting "EZBake" to match "easy bake", or "1-to-1" to match "one to one" is non-trivial)
Since this sounds like a pre-existing application, I typically look to query logs (when available) to sample the frequency of different types of mismatches (dig out the zero-result search terms and start manually categorizing the high-level issues behind the more common mismatches).
That will provide you with a backlog of "matching use cases to research how to implement" (in the order of maximal benefit, as determined by your sample).
Then you're ready to start burning them down, and asking much more specific questions about how to get Solr to jump through your domain-specific hoops.

How to implement fuzzy search

I'm using Neo4j 3 REST API and i have node named customer it has properties like name etc i need to get search results of name of customer eg i should get results for name "john" for my input "joan".how to implement fuzzy search to get my desired results.
Thanks in advance
First off, I want to make that you know that if you're using Neo4j 3.x that 3.x is currently in beta and isn't considered stable yet.
You have two options to implement a fuzzy search in Neo4j. You can use the legacy indexes to implement Lecene-based indexing. That should provide anything that Lucene can do, though you'd probably need to do a bit more work. You can also implement your own unmanaged extension which will allow you to use Lucene a bit more directly.
Perhaps the easier alternative is to use elasticsearch with Neo4j and have elasticsearch do your full-text indexing. You might take a look at the Neo4j and ElasticSearch page on neo4j.com. There they provide a link to a GitHub repository which is a plugin for Neo4j which automagically updates ElasticSearch with data from Neo4j and which provides and endpoint for querying your graph fuzzily. There is also a video tutorial on how to do this.
You will have to try using https://neo4j.com/developer/kb/how-to-perform-a-soundex-search/ which in this case will work. If your input is Joan you will not get John as the response, unless you just give jo as input in which you will get both. To get what you are expecting you will have to use the soundex search.
Stepping back a little, what is the problem you are trying to solve with fuzzy matching?
My experience has been that misspellings and typos are far less common than you might think, and humans prefer exact matches whenever possible. If there is no exact match (often just missing a space between words), that's a good time to use a spellchecker, and that's where the fuzzy matching should kick in.
In addition, your example would match "joan" to "john", but some synonyms like "joanie" would be more useful. If you have a big corpus of content to work with, you may be able to extract some relationships, using fuzzy & machine learning to identify "joanne" and "joni" as possible synonyms and then submit that to a human curator. "Jon" looks like a related name but it's not, while "jo" and even "nonie" may or may not be nicknames in these groupings.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

Resources