NLP & ML Phrase Extraction - machine-learning

What ML Algorithms can I use to train Action phrases in a given Sentence.
Sentence1:I want to play cricket
Label1: play cricket
Sentence2: Need to wash my clothes
Label2: wash clothes
I have a data of some ~2k Sentences & corresponding Action phrases (Labels) and need to predict another bunch of sentences based on them. Can someone guide me on how to do this using NLP/ML? Which Algo's to use for the same? (preferably python)

Here's the process of sentence classification:
1) Normalize the text - bring all text to lower case
2) Remove all stop words - ensures that only relevant features are left
3) Tokenize the sentences to unigram tokens
4) Apply stemming technique - try out different stemming models/ lemmatizer to bring the words to their base word. See which one works best for your case. For example: play, played, plays will be converted to base word "play". This step reduces the number of features.
5) Create a Term Document Matrix for all the sentences. Each row of the TDM corresponds to a sentence and each column of the TDM corresponds to a token of the sentence. (There's another way of representing text in the form of matrix called Tf-Idf)
6) Now this term document matrix contains tokens as columns. You already have the labels in place. You can start training the ML models now. I'm assuming you know how to do this part.

Take a look at NLTK's Naive Bayes Classifier,
it's multiclass and you can feed it the sentence/label pairs directly.
NaiveBayesClassifier.train() will want training features, I would start
with the features simply being the words in each sentence. You can modify the feature selection with more complex methods until you get the results you want.
You can use nltk.classify.util.accuracy to evaluate results. Remember to split your sentences into training and test data.

Related

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Can a list of websites be considered a corpus for a particular category?

I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]

Genres classification of documents

I'm looking for library whatever it's machine learning or something else it doesn't matter which will help me categorize the content I have. Basically content I have is articles written and I wanna know which of them are politics or sport bla bla so I have categorize them.
I was trying openNLP but cannot get it working as I need, is there anything else that will solve my need?
I guess I need some kind of Machine learning with natural language processing NLP but I can't find something that will do my job at this point.
This is a Naive implementation, but you could improvise it further. For classifying a paragraph under a category, first try to extract the unique words of the training data of a particular topic.
For example: Use NLTK to extract the unique words from the collection of paragraphs that talks about Sports and store it in a set. And then similarly do it for the other topics and store them in sets. Now subtract the common words in sets, so that you can now find the particular unique words that might represent a particular topic.
So, now when you input a paragraph it should give you the one-hot output.
Now Combine all the unique words of the list.
Now when you are analyzing a paragraph and if you find those words, just put them as 1.
Like, after analysing your first paragraph, you might get the result as,
[ 0, 0, 1, 0, 1, .... 1, 0, 0] -> Hereby this denotes that the unique words in the position 3 is found and etc.
So your training data will be this as input and output as one-hot encoded.
ie, if you have three categories, and if your first paragraph belongs to 1st topic, then outcome will be like [1,0,0].
Collect many inputs and outcomes to train and then test it with new inputs. You will get the higher probability on the topic it fits.
You can train it with basic neural network and a normal softmax loss function. This might take you just an hour to do.
All the best.
I would suggest two method and it depends on your data :
First if you know already how many classes you are going to have in your textual data, e.g. sports vs politics vs science. In this case you can use a supervised learning algorithm (SVM, MLP,LR ..).
In the second case where you don't know how many classes you will encounter in your data, it's best to use unsupervised learning algorithm LDA or LSI which will cluster documents with similar topics and you will only have to examine by hand some document from each cluster and assign a label to it.
As for you data representation you can use SKlearn or SPARK countvectorizer to create BoW (Bag of Word) vectors to feed to your learning algorithm.
I will just add that it's best (memory efficient and faster) to use scipy sparse vectors if you have a big vocabulary.

Sentence classification using Weka

I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.

Resources