ENTITIES matching conflicting in LUIS - machine-learning

I have two entities created in LUIS. One entity to identify the AlphaNumeric word and another one to identify a word with a pattern. Both entities are created using a regular expression.
To identify alphanumeric I used - \w+\d+ regular expression.
To identify the word with a pattern I used - ^venid\d+ (words like venid12345, venid32310...)
These two entities are mapped to two different INTENTS. But actually how much I trained the LUIS, still the first entity is only getting recognized. How to overcome this?

Add the regex entities from Entities tab and train the app, then add utterances from the intents tab for the respective intent. This should enable the model to pickup the regex entities irrespective of the mapping. Here is a screen shot of both these entities getting recognized for some alphanumeric patterns.
numericalpha is the first pattern and vendor is the second pattern that i have added in the LUIS app.

Related

Custom names detection

This is a project in really early phase and I'm trying to find ideas on where to start.
Any help or pointers would be greatly appreciated!
My problem:
I have text on one side, and a list of named GraphDB elements on the other (usually the name is either an acronym or a multi-word expression). My texts are not annotated.
I want to detect whenever a name is explicitly used in the text. The trick is that it will not necessarily be a perfect string match (for example an acronym can be used to shorten a multi-word expression, or a small part can be left out). So a simple string search will not have a 100% recall (even though it can be used as a starter).
If I just had an input and I wanted it to match it to one of the names, I would do a simple edit distance computation and that's it. What bugs me is that I have to do this for a whole text, and I don't know how to approach/break down the problem.
I cannot break down everything in N-grams because my named entities can be a single word or up to seven words long... Or can I?
I have thousands of Graph elements so I don't think NER can be applied here... Or can it?
An example could be:
My list of names is ['Graph Database', 'Manager', 'Employee Number 1']
The text is:
Every morning, the Manager browse through the Graph Database to look for updates. Every evening, Employee 1 updates the GraphDB.
I want in this block of text to map the 4 highlighted portions to their corresponding item in the list.
I have a small background in Machine Learning but I haven't really ever done NLP. To be clear, I do not care about the meaning of these words, I just want to be able to detect them.
Thanks

database design for dictionary of words

(my reason for asking this question is based on having read this answer, which made me rethink my current setup)
I currently am developing a ruby on rails application in which there are many languages, each of which has a dictionary of base words attached to it, as well as a list of the words that map to each base word. The way I currently have it set up, there is a base_words table that contains the base_word as a string, along with the language_id as a foreign key. There is also a words table, each row of which contains a word string, along with the base_word_id as a foreign key. There is also a language_id indexed on each column, although I'm almost positive that this is superfluous due to the language_id on base_word, so I'm planning to take it off (although this could be a bad assumption on my part).
In sum, on the contrary to the answer I mentioned in the beginning, the tables are not separated by language, because I've reasoned that I can simply pull out the language words programmatically when the time comes. However, my application will also have translation(s) associated with each base word (as did the answer I referenced), and so I'm doubting my structure due to the realization that each translation will actually be a base_word in the same table as itself, which would mean that the translation would actually be just an id of another base word in said table. This may be completely fine, or it may not be - I have no clue (this is my first ever programming project).
Is this ok? Do I need to separate my base_words into separate tables for each language, or can I leave it all in one table?
Another example: I also need to store many phrases for each language, along with their translations. Should I have one table where each row has the appropriate translation of the phrase, or one table where each row contains simply one phrase and a language_id, or multiple tables (one for each language)?
Un saludo,
Michael
As in the other scenario, you'll have a translations table. There is no technical reason it couldn't have multiple foreign keys to base_words (a source_word_id and target_word_id, perhaps). So yes, you can absolutely store all your words in one table. There are some minor side effects involved with translations being directional relationships: it becomes possible to have translations which only work one way, and there will be many pairs of entries with opposite source and target. Neither of these is much of a worry: the first is even potentially desirable in order to represent words with double meanings in one language but not the other, and as for the second, space is cheap and indexing is easy.
You are correct that you do not need words.language_id, so long as you always join base_words when you're querying words and the language matters. This obviously changes if you have a use case where it makes sense to leave base_words out, but that scenario sounds unlikely based on what you describe.
As for phrases: why should they be handled any differently than base_words?

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

enlarging a text corpus with classes

I have a text corpus of many sentences, with some named entities marked within it.
For example, the sentence:
what is the best restaurant in wichita texas?
which is tagged as:
what is the best restaurant in <location>?
I want to expand this corpus, by taking or sampling all the sentences already in it, and replacing the named entities with other similar entities from the same types, e.g. replacing "wichita texas" with "new york", so the corpus will be bigger (more sentences) and more complete (number of entities within it). I have lists of similar entities, including ones which doesn't appear in the corpus but I would like to have some probability of inserting them in my replacements.
Can you recommend on a method or direct me to a paper regarding this?
For your specific question:
This type of work, assuming you have an organized list of named entities (like a separate list for 'places', 'people', etc), generally consists of manually removing potentially ambiguous names (for example, 'jersey' could be removed from your places list to avoid instances where it refers to the garment). Once you're confident you removed the most ambiguous names, simply select an appropriate tag for each group of terms ("location" or "person", for instance). In each sentence containing one of these words, replace the word with the tag. Then you can perform some basic expansion with the programming language of your choice so that each sentence containing 'location' is repeated with every location name, each sentence containing 'person' is repeated with every person name, etc.
For a general overview of clustering using word-classes, check out the seminal Brown et. al. paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.9919&rep=rep1&type=pdf

RegExp as table entries

I'm building an application that takes inputs from SMS text thru Twilio. I'd like to build a table the matches the incoming SMS body with the appropriate response.
For example, imagine I'm building an NFL text message thing.
Someone texts in 'Redskins' and we text back, "The Redskins play at FedEx field"
Someone texts in 'Colts' and we text back, "The Colts are the pride of Indiana."
Here's the tricky part:
Of course, our Rails app is going to need to interpret the incoming team names through Regular Expressions, as many people will text in: Redskins or REDSKINS or REDSKIN or Redskin or REDskin.....
With one or two teams, one could just hardcode the RegExp and response into the controller...but with 30 teams, that seems wrong. (And with 120 entries -- say all pro sports-- even worse).
Does any one have any tips on getting the team names from the input stage, thru the DB table stage with a 'RegExp' conversion in the middle?
Thanks in advance.
for a modest number of keywords, I recommend a two table approach with Keywords and Aliases, always stores in lower case. Convert input to lower case. For each Keyword (say, redskins) you manually add 5-10 variations (including the correct one) in Aliases all of which have Alias.keyword_id = the id of the keyword. So you simply search Alias for the user input, and if you find a match you have the keyword_id of the keyword.
It has two advantages: fast and easy to extend... i fyou log the "no matches" you'll get a list of new aliases to add once to the dbase. MUCH easier and more reliable than trying to do via regex.
I don't think you want regexps here. What about spelling errors? For helpfulness (esp coming from a txt msg) I think you want to allow shortenings too.
Maybe a Soundex-based library or spelling correction thing would be best. You want a nearest match algorithm not a patterned match one.
If the text message is not too long, you should first chop that into words, and then take an intersection with the list of team names.
array_of_team_names = %w(Redskins Colts ... ) # keep it all capitalized
'cOLts blah blah'.scan(/\w+/).map{|word| word.capitalize} & array_of_team_names
# => ['Colts']
If you want to handle mistypes as suggested by drysdam, or if you want to handle larger text with more accuracy, you should use some library specific to that.
I think what you are asking is "how do I avoid hardcoding a regexp into my code, since I might have a lot of them, and they are really a data element"?
If you want to do the matching with regexp, you should note that you can create a regexp from a string, so you could easily have a table that contains column of regexp in string form. You can then dynamically create the array of regexp objects that you'd be using to search the incoming string with. The trick is what to do when you have a match. You'll need to develop a set of rules (yet another table) that basically says which response to pick based on incoming text. For example, if your rule is simply "match based on the team name and say where they play", that's pretty easy. Each regexp that you are searching for maps to exactly one action ("The Bears play in Chicago"). If your rules are more complicated (look for the Bears, and then look to see if the word "schedule" is in there too as well as "first game(s)", then you'd need another table that maps a collection of matches to a response.

Resources