How to use bigrams + trigrams + word-marks vocabulary in countVectorizer? - machine-learning

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :
bigrams + trigrams + word-marks vocabulary
He means by word-marks here, the words that are specific to a certain dialect.
How can I tweak those parameters in countVectorizer?
word marks
So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.
word_marks=['love', 'funny', 'happy', 'amazing']
Those are used to classify a text.
Also, in the this post:
Understanding the `ngram_range` argument in a CountVectorizer in sklearn
There was this answer :
>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]]) # unigram and bigram found
I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?

You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).
See the accepted answer to this question for more details.
Please provide example of "word-marks" for the other part of your question.
You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.
Here's a SO answer on concatenating arrays

Related

How to train a word embedding representation with gensim fasttext wrapper?

I would like to train my own word embeddings with fastext. However, after following the tutorial I can not manage to do it properly. So far I tried:
In:
from gensim.models.fasttext import FastText as FT_gensim
# Set file names for train and test data
corpus = df['sentences'].values.tolist()
model_gensim = FT_gensim(size=100)
# build the vocabulary
model_gensim.build_vocab(sentences=corpus)
model_gensim
Out:
<gensim.models.fasttext.FastText at 0x7f6087cc70f0>
In:
# train the model
model_gensim.train(
sentences = corpus,
epochs = model_gensim.epochs,
total_examples = model_gensim.corpus_count,
total_words = model_gensim.corpus_total_words
)
print(model_gensim)
Out:
FastText(vocab=107, size=100, alpha=0.025)
However, when I try to look in a vocabulary words:
print('return' in model_gensim.wv.vocab)
I get False, even the word is present in the sentences I am passing to the fast text model. Also, when I check the most similar words to return I am getting characters:
model_gensim.most_similar("return")
[('R', 0.15871645510196686),
('2', 0.08545402437448502),
('i', 0.08142799884080887),
('b', 0.07969795912504196),
('a', 0.05666942521929741),
('w', 0.03705815598368645),
('c', 0.032348938286304474),
('y', 0.0319858118891716),
('o', 0.027745068073272705),
('p', 0.026891689747571945)]
What is the correct way of using gensim's fasttext wrapper?
The gensim FastText class doesn't take plain strings as its training texts. It expects lists-of-words, instead. If you pass plain strings, they will look like lists-of-single-characters, and you'll get a stunted vocabulary like you're seeing.
Tokenize each item of your corpus into a list-of-word-tokens and you'll get closer-to-expected results. One super-simple way to do this might just be:
corpus = [s.split() for s in corpus]
But, usually you'd want to do other things to properly tokenize plain-text as well – perhaps case-flatten, or do something else with punctuation, etc.
In order to looking in vocabulary words, the vocabulary words should be written to a text file in order to become visible from that text file. This could be helpful for you:
with open("vocab.txt", "w", encoding="utf8") as vocab_out:
for word in model_gensim.wv.vocab:
vocab_out.write(word + "\n")

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

Use pos tagging in bag of words

I'm using the bag of words for text classification.
Results aren't good enough, test set accuracy is below 70%.
One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?
I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:
love_noun
and if it's a verb use:
love_verb
Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.
What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:
Still keep your original features. That is to say, don't replace love with love_noun or love_verb. Instead, you have two features coming from love:
love, love_noun (or)
love, love_verb
If you need some sample code, you can start from nltk python package.
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("Love is a lovely thing"))
[('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:
in-stock
which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".

Stanford GloVe's lack of punctuation?

I understand that GloVe trains vectors by noticing what frequently co-occurs, etc, but how come commas and periods are not included? For anything NLP, it seems like it would be an important feature to have a vector representation. I realize that something like (king - man = queen) would make no sense with (word - , = ?), but is there a way to represent punctuation marks and Numbers?
Is there a pre-made data set that includes such things? Would this even work?
I tried training GloVe with my own data set, but I ran into a problem with separating the punctuation (with a blank space) between words, etc.
pre-trained GloVe vectors do have punctuation, what makes you think they don't? At least Wikipedia 2014 + Gigaword 5 (6B tokens) set from http://nlp.stanford.edu/projects/glove/ have embeddings for "," ".", "-" and other included, just download these word vectors, and verify it yourseld, they are in plain text format, so its easy to do.
I have worked a bit with the word vectors used by Senna, and I am looking at the vocab list.
http://ml.nec-labs.com/senna/
I definitely see entries for punctuation.
A trick for handling numbers is to replace every digit with 0, and then learn a distribution for each pattern. For instance 1999 is mapped to 0000, 01-01-2015 is mapped to 00-00-0000, etc...
Senna has entries for these patterns like 0000, etc...
I will look over GloVe and try to update this answer soon...
It is totally ok and also common to also handle punctuation as single tokens for word vector generation. Also see for example word2vec papers. I assume that the prebuilt word2vec datasets have punctuations. And i'm sure the prebuilt glove vectors have also punctuations.
The are a lot of tokenizers separating the punctuations as seperate word. One I know for sure is the ARK Tweet Tokenizer.
I have used such a conversition for numbers and punctiotions. It is not a good way but slightly can be useful.
for numbers I convert all numbers to "NUM".
ex: 178 = "NUM" or 654 = "NUM"
for punctiotions I convert them to "PUNC".
ex: apple, orange, banana = apple "PUNC" orange "PUNC" banana
this is not a good solution but works someway.

How do you write a program to find if certain words are similar?

Ie: "college" and "schoolwork" and "academy" belong in the same cluster,
the words "essay", "scholarships" , "money" also belong in the same cluster. Is this a ML or NLP problem?
It depends on how strict your definition of similar is.
Machine Learning Techniques
As others have pointed out, you can use something like latent semantic analysis or the related latent Dirichlet allocation.
Semantic Similarity and WordNet
As was pointed out, you may wish to use an existing resource for something like this.
Many research papers (example) use the term semantic similarity. The basic idea is of computing this is usually done by finding the distance between two words on a graph, where a word is a child if it is a type of its parent. Example: "songbird" would be a child of "bird". Semantic similarity can be used as a distance metric for creating clusters, if you wish.
Example Implementation
In addition, if you put a threshold on the value of some semantic similarity measure, you can get a boolean True or False. Here is a Gist I created (word_similarity.py) that uses NLTK's corpus reader for WordNet. Hopefully that points you towards the right direction, and gives you a few more search terms.
def sim(word1, word2, lch_threshold=2.15, verbose=False):
"""Determine if two (already lemmatized) words are similar or not.
Call with verbose=True to print the WordNet senses from each word
that are considered similar.
The documentation for the NLTK WordNet Interface is available here:
http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
"""
from nltk.corpus import wordnet as wn
results = []
for net1 in wn.synsets(word1):
for net2 in wn.synsets(word2):
try:
lch = net1.lch_similarity(net2)
except:
continue
# The value to compare the LCH to was found empirically.
# (The value is very application dependent. Experiment!)
if lch >= lch_threshold:
results.append((net1, net2))
if not results:
return False
if verbose:
for net1, net2 in results:
print net1
print net1.definition
print net2
print net2.definition
print 'path similarity:'
print net1.path_similarity(net2)
print 'lch similarity:'
print net1.lch_similarity(net2)
print 'wup similarity:'
print net1.wup_similarity(net2)
print '-' * 79
return True
Example output
>>> sim('college', 'academy')
True
>>> sim('essay', 'schoolwork')
False
>>> sim('essay', 'schoolwork', lch_threshold=1.5)
True
>>> sim('human', 'man')
True
>>> sim('human', 'car')
False
>>> sim('fare', 'food')
True
>>> sim('fare', 'food', verbose=True)
Synset('fare.n.04')
the food and drink that are regularly served or consumed
Synset('food.n.01')
any substance that can be metabolized by an animal to give energy and build tissue
path similarity:
0.5
lch similarity:
2.94443897917
wup similarity:
0.909090909091
-------------------------------------------------------------------------------
True
>>> sim('bird', 'songbird', verbose=True)
Synset('bird.n.01')
warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings
Synset('songbird.n.01')
any bird having a musical call
path similarity:
0.25
lch similarity:
2.25129179861
wup similarity:
0.869565217391
-------------------------------------------------------------------------------
True
>>> sim('happen', 'cause', verbose=True)
Synset('happen.v.01')
come to pass
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
Synset('find.v.01')
come upon, as if by accident; meet with
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
True
I suppose you could build your own database of such associations sing ML and NLP techniques, but you might also consider querying existing resources such as WordNet to get the job done.
If you have a sizable collection of documents related to the topic of interest, you might want to look at Latent Direchlet Allocation. LDA is a fairly standard NLP technique that automatically clusters words into topics, where similarity between words is determined by collocation in the same document (you can treat a single sentence as a document if that serves your needs better).
You'll find a number of LDA toolkits available. We'd need more detail on your exact problem before recommending one over another. I'm not enough of an expert to make that recommendation anyway, but I can at least suggest you look at LDA.
The famous quote regarding your question is by John Rupert Firth in 1957:
You shall know a word by the company it keeps
To start delving into this topic you can look into this presentation.
Word2Vec can play role to find similar words (contextually/semantically). In word2vec, we have words as vector in n-dimensional space, and can calculate distance between words (Euclidean Distance) or can simply make clusters.
After this, we can come up with some numerical value for similarity b/w 2 words.

Resources