Fuzzy matching sentences to stanzas - machine-learning

I have lyrics from srt subtitle files. If I want to match them to stanzas from another lyrics website, what is the best approach to this?
My approach has been taking tf-idf vector each lyric line and trying to fuzzy match to the staza, using starting and end time of the lyric line as a clue to whether the line might belong to the previous stanzas, next stanzas, or belong to a stanzas of it's own.
I've also tried dynamic programming, but with less success. Due to the high variance in the structure of the lyrics and the stanza, sometimes the results come out completely shifted or messed up, especially if there are repeated chorus.
If there is a Recurrent Neural Networks or other Machine Learning algorithm, is there an existing approach to such problem?

I suggest using Doc2Vec or Word2Vec methods, basically you train a NN with some corpus, the NN will generate a vector for each word/document, those vectors will have similarity based on the similarty of words in the real world (corpus)
so vector of love and care will be very similar, those vectors hold some other cool properties like if yo subtract or add them you can get a word that posses some meaning of the substation or addition induce
once you will get the vector of words or docs you can check similarity between vectors with various methods, commonly used is the cosine similarity
this method mixed with tf-idf to generate weights for best results
usage is very easy, for example
from gensim.models import Word2Vec
from nltk.corpus import brown
b = Word2Vec(brown.sents())
print b.most_similar('money', topn=5)
output
[(u'care', 0.9145717024803162), (u'chance', 0.9034961462020874), (u'job', 0.8980690240859985), (u'trouble', 0.8751360774040222), (u'everything', 0.873866856098175)]
I suggest to take a look at gensim

Related

Does summing up word embedding vectors in ML destroy their meaning?

For example, I have a paragraph which I want to classify in a binary manner. But because the inputs have to have a fixed length, I need to ensure that every paragraph is represented by a uniform quantity.
One thing I've done is taken every word in the paragraph, vectorized it using GloVe word2vec and then summed up all of the vectors to create a "paragraph" vector, which I've then fed in as an input for my model. In doing so, have I destroyed any meaning the words might have possessed? Considering these two sentences would have the same vector:
"My dog bit Dave" & "Dave bit my dog", how do I get around this? Am I approaching this wrong?
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
I want to be able to train a model that can classify text accurately.
Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
Edit: One suggestion I have received is to instead featurize my data as a 2D array where each word is a column, on which a CNN could work. Another suggestion I received was to use transfer learning through the huggingface transformer to get a vector for the whole paragraph. Which one is more feasible?
I want to be able to train a model that can classify text accurately. Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
If you look up papers on aggregating word embeddings you'll find out that this in fact occurs sometimes, especially if the texts are shorter.
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
Have you tried keyword extraction? It can alleviate some of the problems with averaging
In doing so, have I destroyed any meaning the words might have
possessed?
As you remarked, you throw out information on word order. But that's not even the worst part: most of the times for longer documents if you embed everything the mean will get dominated by common words ("how", "like", "do" et c). BTW see my answer to this question
Other than that, one trick I've seen is to average word vectors, but subtract first principal component of PCA on word embedding matrix. For details you can see for example this repo which also links to the paper (BTW this paper suggests you can ignore "Smooth Inverse Frequency" stuff since principal component reduction does the useful part).

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

how to use both word2vec and RNN for NLP together?

I recently studied and understood how word2vec works, it is responsible to convert words into numerical form so when we plot them or put them in the world space they will be spread and reveal the relationship between every word and the other.
my question here, I found also RNNs and suddenly I became confused. Is word2vec an alternative to RNNs or I can use word2vec to transfer the words to numeric form and then use them on RNNs ?
I mean both of them predict the next word, so I want to know if they are the different approaches for the same problem or I can use them both together ?
NOTE: I finished computer vision and started in NLP so please don't judge my question I am just starting, thanks in advance.
You did not understood the meaning of word2vec clearly. word2vec is a representation of words in a multi-dimensional space while RNN is an algorithm like Linear Regression or random forest or logistic regression. word2vec do NOT predicts the next words. Here is a small explanation of word2vec:
Take three words: apple,orange and car. Suppose they are represented in word2vec as:
apple = [0.01, 0.04 ...] orange = [0.02, 0.06 ...] car = [0.03, 0.09 ...]
Now you know apple and orange are similar to each other while car is not. So if you will take the dot product of apple and orange, the result value will be close to 1, say it's 0.85 but if you take the dot product of apple and car, the result will be far from 1 say it's 0.25. This is the concept of word2vec. It gives you a vector representation of words in a numerical form such that similar words are kept near to each other in the graph.
Now for RNN, as I said, it's an algorithm. You will feed some numerical data to it and it'll give you some output. You need to learn RNN in detail from some online tutorial.
To answer your question about how to use them together, RNN takes numerical inputs. It can't take English words directly. So we need to convert all words into some kind of numerical form. This is where word2vec comes into picture. You take each word, get it's numerical representation from word2vec (like I showed above for apple, orange and car) and then feed it to the RNN.
This is just a simple overview and it's not possible to explain everything here. If you really want to learn more then I would strongly suggest you to take this course. Everything from word2vec to RNN is explained beautifully there. It would be even better if you complete the whole specialisation there instead of completing only this course.

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Genres classification of documents

I'm looking for library whatever it's machine learning or something else it doesn't matter which will help me categorize the content I have. Basically content I have is articles written and I wanna know which of them are politics or sport bla bla so I have categorize them.
I was trying openNLP but cannot get it working as I need, is there anything else that will solve my need?
I guess I need some kind of Machine learning with natural language processing NLP but I can't find something that will do my job at this point.
This is a Naive implementation, but you could improvise it further. For classifying a paragraph under a category, first try to extract the unique words of the training data of a particular topic.
For example: Use NLTK to extract the unique words from the collection of paragraphs that talks about Sports and store it in a set. And then similarly do it for the other topics and store them in sets. Now subtract the common words in sets, so that you can now find the particular unique words that might represent a particular topic.
So, now when you input a paragraph it should give you the one-hot output.
Now Combine all the unique words of the list.
Now when you are analyzing a paragraph and if you find those words, just put them as 1.
Like, after analysing your first paragraph, you might get the result as,
[ 0, 0, 1, 0, 1, .... 1, 0, 0] -> Hereby this denotes that the unique words in the position 3 is found and etc.
So your training data will be this as input and output as one-hot encoded.
ie, if you have three categories, and if your first paragraph belongs to 1st topic, then outcome will be like [1,0,0].
Collect many inputs and outcomes to train and then test it with new inputs. You will get the higher probability on the topic it fits.
You can train it with basic neural network and a normal softmax loss function. This might take you just an hour to do.
All the best.
I would suggest two method and it depends on your data :
First if you know already how many classes you are going to have in your textual data, e.g. sports vs politics vs science. In this case you can use a supervised learning algorithm (SVM, MLP,LR ..).
In the second case where you don't know how many classes you will encounter in your data, it's best to use unsupervised learning algorithm LDA or LSI which will cluster documents with similar topics and you will only have to examine by hand some document from each cluster and assign a label to it.
As for you data representation you can use SKlearn or SPARK countvectorizer to create BoW (Bag of Word) vectors to feed to your learning algorithm.
I will just add that it's best (memory efficient and faster) to use scipy sparse vectors if you have a big vocabulary.

Resources