Word sense disambiguation in classification - machine-learning

I am classifying the input sentence to different category. like time, distance, speed, location etc
I trained classifier using MultinomialNB.
Classifier considers mainly tf as feature, I also tried with considering sentence structure (using 1-4 grams)
Using multinomialNB with alpha = 0.001 this is the result for few queries
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} #for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is a meter
{"1": {"period": "90.74%"}, "2": {"dist": "9.26%"}} #better result should be distance
Using multinomialNW with considering ngram (1-4)
what is the value of Watch
{"1": {"other": "33.27%"}, "2": {"identity": "25.40%"}, "3": {"desc": "16.20%"}, "4": {"country": "9.32%"}}
what is the price of Watch
{"1": {"other": "25.37%"}, "2": {"money": "23.79%"}, "3": {"identity": "19.37%"}, "4": {"desc": "12.35%"}, "5": {"country": "7.11%"}}
what is the cost of Watch
{"1": {"money": "48.34%"}, "2": {"other": "17.20%"}, "3": {"identity": "13.13%"}, "4": {"desc": "8.37%"}} # for above two query also result should be money
How early can I go to mumbai
{"1": {"manner": "97.77%"}} #result should be time
How fast can I go to mumbai
{"1": {"speed": "97.41%"}}
How come can I go to mumbai
{"1": {"manner": "100.00%"}}
How long is an hour
{"1": {"dist": "99.61%"}} #result should be time
So result purely depends on word occurrence. Is there any way to add word disambiguation(or anyother mean by which somekind of understanding could be brought) here?
I already checked Word sense disambiguation in NLTK Python
but here issue is identifying the main word in sentence, which differs in every sentence.
POS (gives NN,JJ, on which sentence does not rely), NER(highly dependent on capitalization, sometimes ner is also not disambiguating word like "early" ,"cost" in above sentence) I already tried, none of them helps.
**How long some times cosidered as time or distance. So based on sentence near by words, it should able to able understand what it is. Similarly for "how fast, "how come" "how early" [how + word] should be understable**
I am using nltk, scikit learn, python
Update :
40 classes (each with sentence belonging that class)
Total data 300 Kb
Accuracy depends on query. sometimes very good >90%. Sometimes irrelevant class as a result. Depends on how query matches with dataset

Attempting to deduce semantics purely by looking at individual words out of context is not going to take you very far. In your "watch" examples, the only term which actually indicates that you have "money" semantics is the one you hope to disambiguate. What other information is there in the sentence to help you reach that conclusion, as a human reader? How would you model that knowledge? (A traditional answer would reason about your perception of watches as valuable objects, or something like that.)
Having said that, you might want to look at Wordnet synsets as a possibly useful abstraction. At least then you could say that "cost", "price", and "value" are related somehow, but I suppose the word-level statistics you have already calculated show that they are not fully synonymous, and the variation you see basically accounts for that fact (though your input size sounds kind of small for adequately covering variances of usage patterns for individual word forms).
Another hint could be provided by part of speech annotation. If you know that "value" is used as a noun, that (to my mind, at least) narrows the meaning to "money talk", whereas the verb reading is much less specifically money-oriented ("we value your input", etc). In your other examples, it is harder to see whether it would help at all. Perhaps you could perform a quick experiment with POS-annotated input and see whether it makes a useful difference. (But then POS is not always possible to deduce correctly, for much the same reasons you are having problems now.)
The sentences you show as examples are all rather simple. It would not be very hard to write a restricted parser for a small subset of English where you could actually start to try to make some sense of the input grammatically, if you know that your input will generally be constrained to simple questions with no modal auxiliaries etc.
(Incidentally, I'm not sure "how come can I go to Mumbai" is "manner", if it is grammatical at all. Strictly speaking, you should have subordinate clause word order here. I would understand it to mean roughly "Why is it that I can go to Mumbai?")

Your result "depends purely on word occurrence" because that is the kind of features your code produces. If you feel that this approach is not sufficient for your problem, you need to decide what other information you need to extract. Express it as features, i.e. as key-value pairs, add them to your dictionary, and pass them to the classifier exactly as you do now. To avoid overtraining you should probably limit the number of ngrams you do include in the dictionary; e.g., keep only the frequent ones, or the ones containing certain keywords you consider relevant, or whatever.
I'm not quite sure what classification you mean by "distance, speed, location, **etc.", but you've mentioned most of the tools I'd think to use for something like this. If they didn't work to your satisfaction, think about more specific ways to detect properties that might be relevant; then express them as features so they can contribute to classification along with the "bag of words" features you have already. (But note that many experts in the field get acceptable results using just the bag-of-words approach).

Based on my understanding of the nature of your problem so far, I would suggest to use an unsupervised classification method, meaning that you have to use a set of rules for classification. By rules I mean if ... then ... else conditions. This is how some expert systems work. But, to add understanding of similar concepts and synonyms I suggest you to create an ontolgy. Ontologies are a sub-concept of Semantic web. Problems, such as yours are usually addressed by use of semantic web, let it be using RDF schemes or ontologies. You can learn more about semantic web here and about ontology here. My suggestion to you is not to go too deep into these fields, but just learn a general high-level idea, and then write your own ontology in a text file (avoid using any tools to build an ontology, because they take too much effort and your problem is easy enough not to need that effort).
Now when you search on the web you will find some already existing ontologies, but in your case its better to write a small ontology of your own, use it to build the set of rules and you are good to go.
One note about your solution (using NB) on this kind of data is that you can simply have an overfiting problem which would result in low accuracy for some queries and high accuracy for some queries. I think its better to avoid using supervised learning for this problem. Let me know if you had further questions.
Edit 1: In this edit I would like to elaborate on the above answer:
Lets say you want to build an unsupervised classifier. The data you currently have can be split into about 40 different classes. Because the sentences in your dataset are already somehow restricted and simple, you can simply do this by classifying those sentences based on a set of rules. Let me show you what I mean by this. Lets say a random sentence from your dataset is kept in variable sentence :
if sentence contains "long":
if it also contains "meter":
print "it is distance"
elif ...
.
.
.
else:
print "it is period"
if sentence contains "fast":
print "it is speed or time"
if sentence contains "early":
print "it is time"
So you got the idea what I meant. If you build a simple classifier in this way, and make it as precise as possible, you can easilly reach overall accuracies of almost 100%. Now, if you want to automate some complicated decision makings you need a form of knowledge base which I'd refer to as an ontology. if in a text file you'd have something like (I am writing it in plain English just to make it simple to understand; you can write it in a concise coded manner and its just a general example to show you what I mean):
"Value" depends 60% on "cost (measured with money)", 20% on "durability (measured in time)", 20% on "ease of use (measured in quality)"
Then, if you want to measure value, you already have a formula for it. You should decide if you need such formula based on your data. Or if you wanted to keep a synonyms list you can have them as a text file and alternately replace them.
The overall implementation of the classifier for 40 classes in the way I mentioned requires a few days and since the method used is quite deterministic, you are destined to achive a very high accuracy of up to 100%.

Related

Explain to a noob how to approached nested named entity recognition / tokens within spans?

I have a noob question, go easy on me — I'll probably get the terminology wrong. I'm hoping someone can give me the "here's what to google next" explanation for how to approach creating a CoreML model that can identify tokens within spans. Since my question falls between the hello world examples and the intellectual papers that cover the topics in detail, it has been hard to google for.
I'm taking my first stab at doing some natural language processing, specifically parsing data out of recipe ingredients. CreateML supports word tagging, which I interpret to mean Named Entity Recognition — split a string into tokens (probably words), annotate them, feed them to the model.
"1 tablespoon (0.5 oz / 14 g) baking soda"
This scenario immediately breaks my understanding of word tagging. Tokenize this by words, this includes three measurements. However, this is really one measurement with a clarification that contains two alternate measurements. What I really want to do is to label "(0.5 oz / 14 g)" as a clarification which contains measurements.
Or how about "Olive oil". If I were tokenizing by words, I'd probably get two tokens labeled as "ingredient" which I'd interpret to mean I have two ingredients, but they go together as one.
I've been looking at https://prodi.gy/ which does span categorization, and seemingly handles this scenario — tokenize, then name the entities, then categorize them into spans. However, as far as I understand it, spans are an entirely different paradigm which wouldn't convert over to CoreML.
My naive guess for how you'd do this in CoreML is that I use multiple models, or something that works recursively — one pass would tokenize "(0.5 oz / 14 g)" as a single token labeled as "clarification" and then the next pass would tokenize it into words. However, this smells like a bad idea.
So, how does one solve this problem with CoreML? Code is fine, if relevant, but I'm really just asking about how to think about the problem so I can continue my research.
Thanks for your help!

How to find the characteristics of a bunch of word Clusters?

My Motivations I'm trying to learn German and realized there's a confounding fact with the structure of German: every noun has a gender which seems unrelated to the noun itself in many cases.
Unlike languages such as English, each noun has a different definite article, depending on gender: der (masculine), die (feminine), and das (neuter). For example:
das Mädchen ("the girl"), der Rock ("the skirt), die Hose ("the trousers/pants"). So, there seems to be no correlation between gender assignment of nouns and their meanings.
The Data
I gathered up to 5000 German words with 3 columns (das, der, die) for each word with 1's and 0's. So, my data is already clustered with one hot encoding and I'm not trying to predict anything.
Why I'm here I am clueless on where to start, how to approach this problem as the concept of distance in clustering doesn't make sense to me in this setting. I can't think of a way to generate an understandable description of these clusters. The mixed data makes it impossible for me to think of some hard-coded metrics for evaluation.
So, my question is:
I want to find some patterns, some characteristics of these words that made them fall in a specific cluster. I don't know if I'm making any sense but some people managed to find some patterns already (for example word endings, elongated long objects tend to be masculine etc., etc) and I believe ML/AI could do a way better job at this. Would it be possible for me to do something like this?
Some personal thoughts
While I was doing some research (perhaps, naive), I realized the potential options are decision trees and cobweb algorithms. Also, I was thinking if I could just scrape a few images (say 5) for every word and try to run some image classification and see the intermediate NN's to see if any specific shapes support a specific object gender. In addition to that, I was wondering whether scraping the data of google n-gram viewers of these words could help in anyway. I couldn't think of a way to use NLP or its sub domains.
Alternatives If everything I just wrote sounds nonsensical, please suggest me a way to make visual representations of my dataframe (more like nodes and paths with images at nodes, one for each cluster) in Python so that I could make pictorial mind maps and try to by heart them.
The ultimate purpose is to make learning German simpler for myself and possibly for others

I have a dataset on which I want to do Phrase extraction using NLP but I am unable to do so?

How can I extract a phrase from a sentence using a dataset which has some set of the sentence and corresponding label in the form of
Sentence1:I want to play cricket
Label1: play cricket
Sentence2: Need to wash my clothes
Label2: wash clothes
I have tried using chunking with nltk but I am not able to use training data along with the chunks.
The "reminder paraphrases" you describe don't map exactly to other kinds of "phrases" with explicit software support.
For example, the gensim Phrases module uses a purely statistical approach to discover neighboring word-pairings that are so common, relative to the base rates of each word individually, that they might usefully be considered a combined unit. It might turn certain entities into phrases (eg: "New York" -> "New_York"), or repeated idioms (eg: "slacking off" -> "slacking_off"). But it'd only be neighboring-runs-of-words, and not the sort of contextual paraphrase you're seeking.
Similarly, libraries which are suitably grammar-aware to mark-up logical parts-of-speech (and inter-dependencies) also tend to simply group and label existing phrases in the text – not create simplified, imperative summaries like you desire.
Still, such libraries' output might help you work up your own rules-of-thumb. For example, it appears in your examples so far, your desired "reminder paraphrase" is always one verb and one noun (that verb's object). So after using part-of-speech tagging (as from NLTK or SpaCy), choosing the last verb (perhaps also preferring verbs in present/imperative tense), and the following noun-phrase (perhaps stripped of other modifiers/prepositions) may do most of what you need.
Of course, more complicated examples would need better heuristics. And if the full range of texts you need to work on is very varied, finding a general approach might require many more (hundreds/thousands) of positive training examples: what you think the best paraphrase is, given certain texts. Then, you could consider a number of machine-learning methods that might be able to pick the right ~2 words from larger texts.
Researching published work for "paraphrasing", rather than just "phrase extraction", might also guide you to ideas, but I unfortunately don't know any ready-to-use paraphrasing libraries.

How to use word embeddings/word2vec .. differently? With an actual, physical dictionary

If my title is incorrect/could be better, please let me know.
I've been trying to find an existing paper/article describing the problem that I'm having: I'm trying to create vectors for words so that they are equal to the sum of their parts.
For example: Cardinal(the bird) would be equal to the vectors of: red, bird, and ONLY that.
In order to train such a model, the input might be something like a dictionary, where each word is defined by it's attributes.
Something like:
Cardinal: bird, red, ....
Bluebird: blue, bird,....
Bird: warm-blooded, wings, beak, two eyes, claws....
Wings: Bone, feather....
So in this instance, each word-vector is equal to the sum of the word-vector of its parts, and so on.
I understand that in the original word2vec, semantic distance was preserved, such that Vec(Madrid)-Vec(Spain)+Vec(Paris) = approx Vec(Paris).
Thanks!
PS: Also, if it's possible, new words should be able to be added later on.
If you're going to be building a dictionary of the components you want, you don't really need word2vec at all. You've already defined the dimensions you want specified: just use them, e.g. in Python:
kb = {"wings": {"bone", "feather"},
"bird": {"wings", "warm-blooded", ...}, ...}
Since the values are sets, you can do set intersection:
kb["bird"] | kb["reptile"]
You'll need to do find some ways decompose the elements recursively for comparisons, simplifications, etc. These are decisions you'll have to make based on what you expect to happen during such operations.
This sort of manual dictionary development is quite an old fashioned approach. Folks like Schank and Abelson used to do stuff like this in the 1970's. The problem is, as these dictionaries get more complex, they become intractable to maintain and more inaccurate in their approximations. You're welcome to try as an exercise---it can be kind of fun!---but keep your expectations low.
You'll also find aspects of meaning lost in these sorts of decompositions. One of word2vec's remarkable properties is its sensitives to the gestalt of words---words may have meaning that is composed of parts, but there's a piece in that composition that makes the whole greater than the sum of the parts. In a decomposition, the gestalt is lost.
Rather than trying to build a dictionary, you might be best off exploring what W2V gives you anyway, from a large corpus, and seeing how you can leverage that information to your advantage. The linguistics of what exactly W2V renders from text aren't wholly understood, but in trying to do something specific with the embeddings, you might learn something new about language.

Selecting suitable model for creating Language Identification tool

I am working on developing a tool for language identification of a given text i.e. given a sample text, identify the language (for e.g. English, Swedish, German, etc.) it is written in.
Now the strategy I have decided to follow (based on a few references I have gathered) are as follows -
a) Create a character n-gram model (The value of n is decided based on certain heuristics and computations)
b) Use a machine learning classifier(such as naive bayes) to predict the language of the given text.
Now, the doubt I have is - Is creating a character N-gram model necessary. As in, what disadvantage does a simple bag of words strategy have i.e. if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail.
The reason why this doubt arose was the fact that any reference document/research paper I've come across states that language identification is a very difficult task. However, just using this strategy of using the words in the language seems to be a simple task.
EDIT: One reason why N-gram should be preferred is to make the model robust even if there are typos as stated here. Can anyone point out more?
if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail
Pretty much the same cases were a character n-gram model would fail. The problem is that you're not going to find appropriate statistics for all possible words.(*) Character n-gram statistics are easier to accumulate and more robust, even for text without typos: words in a language tend to follow the same spelling patterns. E.g. had you not found statistics for the Dutch word "uitbuiken" (a pretty rare word), then the occurrence of the n-grams "uit", "bui" and "uik" would still be strong indicators of this being Dutch.
(*) In agglutinative languages such as Turkish, new words can be formed by stringing morphemes together and the number of possible words is immense. Check the first few chapters of Jurafsky and Martin, or any undergraduate linguistics text, for interesting discussions on the possible number of words per language.
Cavnar and Trenkle proposed a very simple yet efficient approach using character n-grams of variable length. Maybe you should try to implement it first and move to a more complex ML approach if C&T approach doesn't meet your requirements.
Basically, the idea is to build a language model using only the X (e.g. X = 300) most frequent n-grams of variable length (e.g. 1 <= N <= 5). Doing so, you are very likely to capture most functional words/morphemes of the considered language... without any prior linguistic knowledge on that language!
Why would you choose character n-grams over a BoW approach? I think the notion of character n-gram is pretty straightforward and apply to every written language. Word, is a much much complex notion which greatly differ from one language to another (consider languages with almost no spacing marks).
Reference: http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
The performance really depends on your expected input. If you will be classifying multi-paragraph text all in one language, a functional words list (which your "bag of words" with pruning of hapaxes will quickly approximate) might well serve you perfectly, and could work better than n-grams.
There is significant overlap between individual words -- "of" could be Dutch or English; "and" is very common in English but also means "duck" in the Scandinavian languages, etc. But given enough input data, overlaps for individual stop words will not confuse your algorithm very often.
My anecdotal evidence is from using libtextcat on the Reuters multilingual newswire corpus. Many of the telegrams contain a lot of proper names, loan words etc. which throw off the n-gram classifier a lot of the time; whereas just examining the stop words would (in my humble estimation) produce much more stable results.
On the other hand, if you need to identify short, telegraphic utterances which might not be in your dictionary, a dictionary-based approach is obviously flawed. Note that many North European languages have very productive word formation by free compounding -- you see words like "tandborstställbrist" and "yhdyssanatauti" being coined left and right (and Finnish has agglutination on top -- "yhdyssanataudittomienkinkohan") which simply cannot be expected to be in a dictionary until somebody decides to use them.

Resources