Hierarchy of meaning - machine-learning

I am looking for a method to build a hierarchy of words.
Background: I am a "amateur" natural language processing enthusiast and right now one of the problems that I am interested in is determining the hierarchy of word semantics from a group of words.
For example, if I have the set which contains a "super" representation of others, i.e.
[cat, dog, monkey, animal, bird, ... ]
I am interested to use any technique which would allow me to extract the word 'animal' which has the most meaningful and accurate representation of the other words inside this set.
Note: they are NOT the same in meaning. cat != dog != monkey != animal
BUT cat is a subset of animal and dog is a subset of animal.
I know by now a lot of you will be telling me to use wordnet. Well, I will try to but I am actually interested in doing a very domain specific area which WordNet doesn't apply because:
1) Most words are not found in Wordnet
2) All the words are in another language; translation is possible but is to limited effect.
another example would be:
[ noise reduction, focal length, flash, functionality, .. ]
so functionality includes everything in this set.
I have also tried crawling wikipedia pages and applying some techniques on td-idf etc but wikipedia pages doesn't really do much either.
Can someone possibly enlighten me as to what direction my research should go towards? (I could use anything)

It looks like you want to use something like the hypernym/hyponym relationships in WordNet, but without actually using WordNet due to language and domain specific coverage issues? That is, if you had the domain specific hypernym relationships, you could get the "super" representation by just looking for the nearest parent that subsumed all of the words in the list, or the nearest node that was equal to one of the list words and subsumed all of the others.
To start, I would first point out that WordNets are actually available for many of the worlds major languages see the list at Global WordNet.
To get domain specific hypernym relationships, you could use the technique presented in Snow et al.'s Learning syntactic patterns for automatic hypernym discovery. That is, you could start off with a small list of seed hypernyms, and then use them to train a classifier to detected the hypernyms in a corpus. You would then run this classifier over data from your domain in order to build a list of domain specific hypernym pairs.

The opinion mining and sentiment analysis folks might be doing related things, in terms of deciding what words represent features of products, without knowing anything about the products.
A quick sketch of an idea for how you might do this, which I've totally made up on the spot:
Parse a bunch of sentences in the relevant domain; find the noun phrases and adjectives. Figure out which noun phrases are associated with which adjectives. Cluster the noun phrases together based on the set of adjectives used to describe them. Animals will tend together because they're going to be described by adjectives like "furry" or "cute", etc. (In particular, hierarchical clustering would probably be most appropriate.)
If you try this, and it works, let me know. :)

Related

How to find the characteristics of a bunch of word Clusters?

My Motivations I'm trying to learn German and realized there's a confounding fact with the structure of German: every noun has a gender which seems unrelated to the noun itself in many cases.
Unlike languages such as English, each noun has a different definite article, depending on gender: der (masculine), die (feminine), and das (neuter). For example:
das Mädchen ("the girl"), der Rock ("the skirt), die Hose ("the trousers/pants"). So, there seems to be no correlation between gender assignment of nouns and their meanings.
The Data
I gathered up to 5000 German words with 3 columns (das, der, die) for each word with 1's and 0's. So, my data is already clustered with one hot encoding and I'm not trying to predict anything.
Why I'm here I am clueless on where to start, how to approach this problem as the concept of distance in clustering doesn't make sense to me in this setting. I can't think of a way to generate an understandable description of these clusters. The mixed data makes it impossible for me to think of some hard-coded metrics for evaluation.
So, my question is:
I want to find some patterns, some characteristics of these words that made them fall in a specific cluster. I don't know if I'm making any sense but some people managed to find some patterns already (for example word endings, elongated long objects tend to be masculine etc., etc) and I believe ML/AI could do a way better job at this. Would it be possible for me to do something like this?
Some personal thoughts
While I was doing some research (perhaps, naive), I realized the potential options are decision trees and cobweb algorithms. Also, I was thinking if I could just scrape a few images (say 5) for every word and try to run some image classification and see the intermediate NN's to see if any specific shapes support a specific object gender. In addition to that, I was wondering whether scraping the data of google n-gram viewers of these words could help in anyway. I couldn't think of a way to use NLP or its sub domains.
Alternatives If everything I just wrote sounds nonsensical, please suggest me a way to make visual representations of my dataframe (more like nodes and paths with images at nodes, one for each cluster) in Python so that I could make pictorial mind maps and try to by heart them.
The ultimate purpose is to make learning German simpler for myself and possibly for others

I have a dataset on which I want to do Phrase extraction using NLP but I am unable to do so?

How can I extract a phrase from a sentence using a dataset which has some set of the sentence and corresponding label in the form of
Sentence1:I want to play cricket
Label1: play cricket
Sentence2: Need to wash my clothes
Label2: wash clothes
I have tried using chunking with nltk but I am not able to use training data along with the chunks.
The "reminder paraphrases" you describe don't map exactly to other kinds of "phrases" with explicit software support.
For example, the gensim Phrases module uses a purely statistical approach to discover neighboring word-pairings that are so common, relative to the base rates of each word individually, that they might usefully be considered a combined unit. It might turn certain entities into phrases (eg: "New York" -> "New_York"), or repeated idioms (eg: "slacking off" -> "slacking_off"). But it'd only be neighboring-runs-of-words, and not the sort of contextual paraphrase you're seeking.
Similarly, libraries which are suitably grammar-aware to mark-up logical parts-of-speech (and inter-dependencies) also tend to simply group and label existing phrases in the text – not create simplified, imperative summaries like you desire.
Still, such libraries' output might help you work up your own rules-of-thumb. For example, it appears in your examples so far, your desired "reminder paraphrase" is always one verb and one noun (that verb's object). So after using part-of-speech tagging (as from NLTK or SpaCy), choosing the last verb (perhaps also preferring verbs in present/imperative tense), and the following noun-phrase (perhaps stripped of other modifiers/prepositions) may do most of what you need.
Of course, more complicated examples would need better heuristics. And if the full range of texts you need to work on is very varied, finding a general approach might require many more (hundreds/thousands) of positive training examples: what you think the best paraphrase is, given certain texts. Then, you could consider a number of machine-learning methods that might be able to pick the right ~2 words from larger texts.
Researching published work for "paraphrasing", rather than just "phrase extraction", might also guide you to ideas, but I unfortunately don't know any ready-to-use paraphrasing libraries.

Can Word2Vec be used for information extraction?

I am using Gensim to train Word2Vec. I know word similarities are deteremined by if the words can replace each other and make sense in a sentence. But can word similarities be used to extract relationships between entities?
Example:
I have a bunch of interview documents and in each interview, the interviewee always says the name of their manager. If I wanted to extract the name of the manager from these interview transcripts could I just get a list of all human name's in the document (using nlp), and the name that is the most similar to the word "manager" using Word2Vec, is most likely the manager.
Does this thought process make any sense with Word2Vec? If it doesn't, would the ML solution to this problem then be to input my word embeddings into a sequence to sequence model?
Yes, word-vector similarities & relative-arrangements can indicate relationships.
In the original Word2Vec paper, this was demonstrated by using word-vectors to solve word-analogies. The most famous example involves the analogy "'man' is to 'king' as 'woman' is to ?".
By starting with the word-vector for 'king', then subtracting the vector for 'man', and adding the vector for 'woman', you arrive at a new point in the coordinate system. And then, if you look for other words close to that new point, often the closest word will be queen. Essentially, the directions & distances have helped find a word that's related in a particular way – a gender-reversed equivalent.
And, in large news-based corpuses, famous names like 'Obama' or 'Bush' do wind up with vectors closer to their well-known job titles like 'president'. (There will be many contexts in such corpuses where the words appear immediately together – "President Obama today signed…" – or simply in similar roles – "The President appointed…" or "Obama appointed…", etc.)
However, I suspect that's less-likely to work with your 'manager' interview-transcripts example. Achieving meaningful word-to-word arrangements depends on lots of varied examples of the words in shared usage contexts. Strong vectors require large corpuses of millions to billions of words. So the transcripts with a single manager wouldn't likely be enough to get a good model – you'd need transcripts across many managers.
And in such a corpus each manager's name might not be strongly associated with just manager-like contexts. The same name(s) will be repeated when also mentioning other roles, and transcripts may not especially refer to managerial-action in helpful third-person ways that make specific name-vectors well-positioned. (That is, there won't be clean expository statements like, "John_Smith called a staff meeting", or "John_Smith cancelled the project, alongside others like "…manager John_Smith…" or "The manager cancelled the project".)

Named entity recognition (NER) features

I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword

How to use Bayesian analysis to compute and combine weights for multiple rules to identify books

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.
Some are obvious to the human reader, like:
Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt
Others are not so obvious:
Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)
Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.
For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.
Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.
I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.
In the end I would have a set of tuples like this:
[rulename,hypothesis]
I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.
My questions are:
Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?
It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

Resources