I'm building an application that takes inputs from SMS text thru Twilio. I'd like to build a table the matches the incoming SMS body with the appropriate response.
For example, imagine I'm building an NFL text message thing.
Someone texts in 'Redskins' and we text back, "The Redskins play at FedEx field"
Someone texts in 'Colts' and we text back, "The Colts are the pride of Indiana."
Here's the tricky part:
Of course, our Rails app is going to need to interpret the incoming team names through Regular Expressions, as many people will text in: Redskins or REDSKINS or REDSKIN or Redskin or REDskin.....
With one or two teams, one could just hardcode the RegExp and response into the controller...but with 30 teams, that seems wrong. (And with 120 entries -- say all pro sports-- even worse).
Does any one have any tips on getting the team names from the input stage, thru the DB table stage with a 'RegExp' conversion in the middle?
Thanks in advance.
for a modest number of keywords, I recommend a two table approach with Keywords and Aliases, always stores in lower case. Convert input to lower case. For each Keyword (say, redskins) you manually add 5-10 variations (including the correct one) in Aliases all of which have Alias.keyword_id = the id of the keyword. So you simply search Alias for the user input, and if you find a match you have the keyword_id of the keyword.
It has two advantages: fast and easy to extend... i fyou log the "no matches" you'll get a list of new aliases to add once to the dbase. MUCH easier and more reliable than trying to do via regex.
I don't think you want regexps here. What about spelling errors? For helpfulness (esp coming from a txt msg) I think you want to allow shortenings too.
Maybe a Soundex-based library or spelling correction thing would be best. You want a nearest match algorithm not a patterned match one.
If the text message is not too long, you should first chop that into words, and then take an intersection with the list of team names.
array_of_team_names = %w(Redskins Colts ... ) # keep it all capitalized
'cOLts blah blah'.scan(/\w+/).map{|word| word.capitalize} & array_of_team_names
# => ['Colts']
If you want to handle mistypes as suggested by drysdam, or if you want to handle larger text with more accuracy, you should use some library specific to that.
I think what you are asking is "how do I avoid hardcoding a regexp into my code, since I might have a lot of them, and they are really a data element"?
If you want to do the matching with regexp, you should note that you can create a regexp from a string, so you could easily have a table that contains column of regexp in string form. You can then dynamically create the array of regexp objects that you'd be using to search the incoming string with. The trick is what to do when you have a match. You'll need to develop a set of rules (yet another table) that basically says which response to pick based on incoming text. For example, if your rule is simply "match based on the team name and say where they play", that's pretty easy. Each regexp that you are searching for maps to exactly one action ("The Bears play in Chicago"). If your rules are more complicated (look for the Bears, and then look to see if the word "schedule" is in there too as well as "first game(s)", then you'd need another table that maps a collection of matches to a response.
Related
This is a project in really early phase and I'm trying to find ideas on where to start.
Any help or pointers would be greatly appreciated!
My problem:
I have text on one side, and a list of named GraphDB elements on the other (usually the name is either an acronym or a multi-word expression). My texts are not annotated.
I want to detect whenever a name is explicitly used in the text. The trick is that it will not necessarily be a perfect string match (for example an acronym can be used to shorten a multi-word expression, or a small part can be left out). So a simple string search will not have a 100% recall (even though it can be used as a starter).
If I just had an input and I wanted it to match it to one of the names, I would do a simple edit distance computation and that's it. What bugs me is that I have to do this for a whole text, and I don't know how to approach/break down the problem.
I cannot break down everything in N-grams because my named entities can be a single word or up to seven words long... Or can I?
I have thousands of Graph elements so I don't think NER can be applied here... Or can it?
An example could be:
My list of names is ['Graph Database', 'Manager', 'Employee Number 1']
The text is:
Every morning, the Manager browse through the Graph Database to look for updates. Every evening, Employee 1 updates the GraphDB.
I want in this block of text to map the 4 highlighted portions to their corresponding item in the list.
I have a small background in Machine Learning but I haven't really ever done NLP. To be clear, I do not care about the meaning of these words, I just want to be able to detect them.
Thanks
Dear programmers and IT experts, I need your help. I've just started to research what Sphinx is. I even made my own "google suggest", that fix frequent and common human search input mistakes. The problem is, that it tries to fix errors all the time and interrupt the real input.
Whell, I want the search engine try to find consilience in searched field by substring first, than, if consiliences are not found, than use my logic for fixing errors. If to say shortly, I want sphinx, first of all, execute this SQL equivalent command
SELECT * FROM suggest WHERE keyword LIKE('%$keyword%')
than, if nothing found, continue mistakes fixing.
The main questioin is....is it possible to tell spinx to search by substring?
Sphinx can mostly do that, but need to understand how it works. Sphinx indexes individual words, and matches by keywords. It uses a large inverted index to make queries fast (rather than running a substring match)
So can do MATCH('one two') as a query, and it will match a document that contains '... one two ...', but the order doesn't matter, and other words can be present, so will ALSO match '... two three one ...' which wouldn't happen with mysql LIKE (its a pure substring match)
Can use phrase operator to do that MATCH('"one two"')
Furthermore Sphinx matches whole words by default. So MATCH('one two') will only match those two works. It wont match against a document say "... one twotwentyone ...' whereas LIKE doesn't restrict to whole words.
So can use wildcards to allow part matches. MATCH('"*one two*"') --- also need to enable it on the index with min_infix_len config!
And even more, sphinx doesn't index punctuation etc (with default charset_table), so a document say '... One! (two?) ... ' WOULD still match MATCH('"one two"'). SQL like would NOT ignore that.
You could change sphinx to index more punctuation (via charset_table) to get a closer to substring match.
So SELECT * FROM index WHERE MATCH('"*$keyword*"') is possibly the closest sphinx query to your original (ie a substring match). Just as long as you aware of the differences. Also there is MySQL Collations to consider, they not exactly same as charset_table.
(frankly while this is correct. I wonder, if a bit OTT. If you just have a textual corpus you want to search, you could index it as normal. Then run queries though CALL KEYWORDS(), to get idea if the query is a valid word in the index (ie just tells you how many times given words appear in index). Can then run your algorithm to fix the mistakes)
As a side note sphinx does have a built in suggest system
http://sphinxsearch.com/blog/2016/10/03/2-3-2-feature-built-in-suggests/
I am putting my collection of some 13000 books in a mySQL database. Most of the copies I possess
can be identified uniquely by ISBN. I need to use this distinguishing code as a foreign key into
another database table.
However, quite a few of my books date from pre-ISBN ages. So for these, I am trying to devise a
scheme to uniquely assign a code, sort of like an SKU.
The code would be strictly for private use. It should have the important property that, when I
obtain a pre-ISBN publication, I could build the code from inspecting the work, and based on the
result search the database to see if I already have other copies in my possession.
Many years ago I think I saw a search scheme for some university(?) catalogue, where you could
perform a search of a title based on a concatenated string' (or code) that was made up of let's
say 8 letters from the title, and 4 from the author, and maybe some other data. For example,
to search 'The Nature of Space and Time' by Stephen Hawking and Roger Penrose you might perform
a search on the string 'Nature SHawk', being comprised of 8 characters from the title (omitting
non-filing words and stopwords) and 4 from the author(s).
I haven't been able to find any information on such scheme's, or whether or not such an approach
was standardized in any way.
Something along these lines could be made up of course, but I was wondering if people here have
heard of such schemes, of have ideas on how to come to a solution to this.
So keep in mind the important property of 'replicability': using the scheme, inspection of a pre-
ISBN dated work should --omitting very special or exclusive cases-- in general lead to a code
that can singly be used to subsequently determine if such a copy is already in the database.
Thank you for your time.
Just use the Title (add Author and Publisher as options) and a series id to produce a fake isbn. Take a look at fake_isbn.
NOTE: use the first digit as a series id but don't use 9!
Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.
I was thinking about text driven search by user input.
often you are searching in a database of addresses, where you can find customers and so on.
has anybody any idea how to find out which of the typed words is the name, which is the street name, which is the company name?
and secondly if the name is a double name like "Lee Harvey", how can I find out that the two words Lee and Harvey belong together?
Same problem with company names like "frank the baker inc."...
Is there any algorithm or best practice strategy?
thanks for links, tutorials, scripts and all other help ;-)
What you basically want is a search engine :) Here are the basic steps you need to follow -
You need to create an 'Inverted Index' of the content you want to be searched on.
The index is 'name'=>'value' pair. You can have this pair in whichever way you want (tuned according to your data & needs.
Eg. for your problem of double names, you could split all your names into single words & index it like so -
'lee'=>'lee harvey'
'harvey'=>'lee harvey'
...
this way when anyone searches for 'lee' they get 'lee harvey'. There are other better approaches to this called "n-gram" indexing. Check it out...
You could possibly build indexes of names, addresses, emails etc & when the user types a query check it against all your indexes with the approach suggested above. After you get the results then merge them. Maybe you could introduce the notion of rank so that you can sort your results & show the most latest or most relevant ones at the top. For this you need to figure out a way to score your terms...
Don't care, just perform full-text search. Then you should check the result items for which field contains the search terms. Also, you may display items in separate lists (terms found int name, term found in address). The only difficulty is if John Smith is living in the John Smiht street, you must decide, which list/lists the result item belongs to.