I'm writing an iOS application that pulls events from a public Google calendar, pulls out the free-form "Location" field, and drops a pin on a map corresponding to the given location. I want to make the app as flexible as possible using some kind of string search or fuzzy matching algorithms, but I'm not sure where to begin.
There are several things a calendar moderator may enter into the Location field:
A building name and room number (e.g. Foo Hall Room 123)
A building abbreviation and room number (e.g. FOO 123)
A shorthand room or location name (e.g. Foo)
Currently, I have a sqlite database composed of one table, each row storing a latitude, longitude, full building name (Foo Hall), and standardized building abbreviation (FOO).
I want to take the moderator's free-form string and obtain the correct coordinates from the database (if present).
I've tried using LIKE '%FOO%' and similar patterns, as well as Levenshtein Distance, but I run into issues, for instance if the actual building name is "Example Foo and Bar Building" and the location entered by moderator is "Example Bar Building".
The three options I've considered are...
Force the moderator to enter in a standardized abbreviation or building name. This could potentially be a tedious process for the calendar moderators, so I'm trying to avoid this if possible.
Do a crude substring search that checks if the entered string is contained anywhere in the database string. This is what my university does on their website, but it obviously isn't very flexible.
Implement a more complex fuzzy string matching algorithm that provides maximum flexibility but will take an order of magnitude more time to implement. If the right one already exists, that would be the ideal solution!!
Which of these options (if any) seems the best? Is there a better alternative that I haven't thought of? Is there a library that does what I need and I just haven't found it yet?
Thanks in advance for any help!
Related
I am planning to implement bi-gram model to predict a search text. If a user has frequently searched "Test search word" and then if user types "Test" I am looking to automatically suggest "Test search word"
I have the list of data of searched text. I am trying with bi-gram as even if user types "Tast" it should still provide "Test search word". I am implementing it in Java. I am looking for a library to supply the data that I have and when I pass the user keyed in text, it should provide the prediction.
After research I found below links
https://www.javatips.net/api/Solbase-Lucene-master/contrib/analyzers/common/src/java/org/apache/lucene/analysis/shingle/ShingleFilter.java
https://opennlp.apache.org/docs/1.8.1/apidocs/opennlp-tools/opennlp/tools/ngram/NGramUtils.html
but they are not helping in my case. Are there any Java libraries that suits my purpose?
I'm thinking of two solutions:
First
Index each of your user string queries in a MARISA (Matching Algorithm with Recursively Implemented StorAge) TRIE data structure (data structure optimised for keywords search and autocomplete).
Prepare a Levenshtein distance measurement method to tolerate typos.
Now for each new user query q, get all strings indexed in MARISA TRIE that has your query q as prefix (after typo tolerance).
Second
Use a elasticsearch suggester
Documentation https://www.elastic.co/guide/en/elasticsearch/reference/7.5/search-suggesters.html#completion-suggester
Please notice that parts of the suggest feature are still under development.
I am putting my collection of some 13000 books in a mySQL database. Most of the copies I possess
can be identified uniquely by ISBN. I need to use this distinguishing code as a foreign key into
another database table.
However, quite a few of my books date from pre-ISBN ages. So for these, I am trying to devise a
scheme to uniquely assign a code, sort of like an SKU.
The code would be strictly for private use. It should have the important property that, when I
obtain a pre-ISBN publication, I could build the code from inspecting the work, and based on the
result search the database to see if I already have other copies in my possession.
Many years ago I think I saw a search scheme for some university(?) catalogue, where you could
perform a search of a title based on a concatenated string' (or code) that was made up of let's
say 8 letters from the title, and 4 from the author, and maybe some other data. For example,
to search 'The Nature of Space and Time' by Stephen Hawking and Roger Penrose you might perform
a search on the string 'Nature SHawk', being comprised of 8 characters from the title (omitting
non-filing words and stopwords) and 4 from the author(s).
I haven't been able to find any information on such scheme's, or whether or not such an approach
was standardized in any way.
Something along these lines could be made up of course, but I was wondering if people here have
heard of such schemes, of have ideas on how to come to a solution to this.
So keep in mind the important property of 'replicability': using the scheme, inspection of a pre-
ISBN dated work should --omitting very special or exclusive cases-- in general lead to a code
that can singly be used to subsequently determine if such a copy is already in the database.
Thank you for your time.
Just use the Title (add Author and Publisher as options) and a series id to produce a fake isbn. Take a look at fake_isbn.
NOTE: use the first digit as a series id but don't use 9!
Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.
I'm building an application that takes inputs from SMS text thru Twilio. I'd like to build a table the matches the incoming SMS body with the appropriate response.
For example, imagine I'm building an NFL text message thing.
Someone texts in 'Redskins' and we text back, "The Redskins play at FedEx field"
Someone texts in 'Colts' and we text back, "The Colts are the pride of Indiana."
Here's the tricky part:
Of course, our Rails app is going to need to interpret the incoming team names through Regular Expressions, as many people will text in: Redskins or REDSKINS or REDSKIN or Redskin or REDskin.....
With one or two teams, one could just hardcode the RegExp and response into the controller...but with 30 teams, that seems wrong. (And with 120 entries -- say all pro sports-- even worse).
Does any one have any tips on getting the team names from the input stage, thru the DB table stage with a 'RegExp' conversion in the middle?
Thanks in advance.
for a modest number of keywords, I recommend a two table approach with Keywords and Aliases, always stores in lower case. Convert input to lower case. For each Keyword (say, redskins) you manually add 5-10 variations (including the correct one) in Aliases all of which have Alias.keyword_id = the id of the keyword. So you simply search Alias for the user input, and if you find a match you have the keyword_id of the keyword.
It has two advantages: fast and easy to extend... i fyou log the "no matches" you'll get a list of new aliases to add once to the dbase. MUCH easier and more reliable than trying to do via regex.
I don't think you want regexps here. What about spelling errors? For helpfulness (esp coming from a txt msg) I think you want to allow shortenings too.
Maybe a Soundex-based library or spelling correction thing would be best. You want a nearest match algorithm not a patterned match one.
If the text message is not too long, you should first chop that into words, and then take an intersection with the list of team names.
array_of_team_names = %w(Redskins Colts ... ) # keep it all capitalized
'cOLts blah blah'.scan(/\w+/).map{|word| word.capitalize} & array_of_team_names
# => ['Colts']
If you want to handle mistypes as suggested by drysdam, or if you want to handle larger text with more accuracy, you should use some library specific to that.
I think what you are asking is "how do I avoid hardcoding a regexp into my code, since I might have a lot of them, and they are really a data element"?
If you want to do the matching with regexp, you should note that you can create a regexp from a string, so you could easily have a table that contains column of regexp in string form. You can then dynamically create the array of regexp objects that you'd be using to search the incoming string with. The trick is what to do when you have a match. You'll need to develop a set of rules (yet another table) that basically says which response to pick based on incoming text. For example, if your rule is simply "match based on the team name and say where they play", that's pretty easy. Each regexp that you are searching for maps to exactly one action ("The Bears play in Chicago"). If your rules are more complicated (look for the Bears, and then look to see if the word "schedule" is in there too as well as "first game(s)", then you'd need another table that maps a collection of matches to a response.
Consider the following site:
http://maps.google.com
It has a main text input, where the user can type business, countries, provinces, cities, addresses and zip codes. I wonder which is the best way to implement a search like this. I realize that probably Google Maps uses a full text search with all kinds of data in the same table, and it has a chance of having a parser which classifies the input (i.e. between numeric, like zip codes and coordinates, and textual, like business and addresses).
With the data spread in many tables and systems, a parser is essential. The parser could be built from regular expressions, or could be built with IA tools like Artificial Neural Networks and Genetic Algorithms.
Which approach would you recommend?
It might be best to aggregate the data from all of your tables into a search index. Lucene is a free search engine, similar to how Google's search engine works (inverted index), and it should allow you to search by any of those values or any combination of them with relative ease.
http://lucene.apache.org/java/docs/
Lucene comes with its own query language (again, very similar to Google's or any other Internet search sites syntax). The only drawback of using something like Lucene is you would need to build its index. You wouldn't be querying your database directly (which could get very complicated...inverted index are pretty much designed for what your trying to do), so you need to periodically gather up new information from your database and add it to your index. It might also be necessary to rebuild your index to remove unneeded data.
With Lucene, you get a pretty flexible query syntax that most people are familiar with (because pretty much everyone searches the internet), it performs very well, and is not terribly complicated. By using Lucene, you avoid the hit of using regular expressions (which are not the most performant text searching mechanism), and you don't have to write your own parser. Should be a win-win, aside from a little learning curve to build a Lucene index generator and figure out how to query that index.
I'd have the data in one database. If the data got to big or I knew it would be huge, I'd assign an id to each business, address etc, then have other tables which reference this data.
Regular Expressions would only be necessary if the user could define what they want to search for:
business: Argos
But then what happens if they want an Argos in Manchester (Sorry, I'm English), maybe then get the location of the user based on their IP but what happens if they say:
business: Argos Scotland
Now you don't know if the company has two words, or if there is a location next to it. All of this has to be taken into consideration.
P.s Sorry if that made no sense.
You will need to pre process the query before doing a full text search on it. If you are using a GIS database, then you will already have columns like city, areacode, country etc. Convert your query into tokens seperated on space or commas, or both. Then hit individual columns to see match. This way you will know what part of the query is the city, the areacode etc.
You could also try some naive approximation approaches,example - 6 consecutive numbers will probably be an area code. Look for common words like "road" , "restaurant" , "street" etc which will be part of many queries and then use some approximation to figure out what they are looking for. Hope this helps.