When tokenizing a text sequence in keras using Tokenizer class, we can specify a param 'num_words' to consider only the [top] n words in the dataset. My doubt here is
What does the [top] values mean? Does it mean the frequency of the word or any other value like tf-idf?
Is the [top] value computed on each document level or by considering the entire dataset?
Directing to any good resources or explanation with example will be very useful.
Here the [top] denotes the frequency of the word on the entire dataset.It considers the (num_words) number of words based on the descending frequency of each word.The doubt i had was that, it is obvious that stopwords words will occur more times than other words and thus most of the stop words will make it to the top (num_words) words, but to handle this we first remove the stopwords and then apply tokenization.
Related
I have created a word-level text generator using an LSTM model. But in my case, not every word is suitable to be selected. I want them to match additional conditions:
Each word has a map: if a character is a vowel then it will write 1 if not, it will write 0 (for instance, overflow would be 10100010). Then, the sentence generated needs to meet a given structure, for instance, 01001100 (hi 01 and friend 001100).
The last vowel of the last word must be the one provided. Let's say is e. (friend will do the job, then).
Thus, to handle this scenario, I've created a pandas dataframe with the following structure:
word last_vowel word_map
----- --------- ----------
hello o 01001
stack a 00100
jhon o 0010
This is my current workflow:
Given the sentence structure, I choose a random word from the dataframe which matches the pattern. For instance, if the sentence structure is 0100100100100, we can choose the word hello, as its vowel structure is 01001.
I subtract the selected word from the remaining structure: 0100100100100 will become 00100100 as we've removed the initial 01001 (hello).
I retrieve all the words from the dataframe which matches part of the remaining structure, in this case, stack 00100 and jhon 0010.
I pass the current word sentence content (just hello by now) to the LSTM model, and it retrieves the weights of each word.
But I don't just want to select the best option, I want to select the best option contained in the selection of point 3. So I choose the word with the highest estimation within that list, in this case, stack.
Repeat from point 2 until the remaining sentence structure is empty.
That works like a charm, but there is one remaining condition to handle: the last vowel of the sentence.
My way to deal with this issue is the following:
Generating 1000 sentences forcing that the last vowel is the one specified.
Get the rmse of the weights returned by the LSTM model. The better the output, the higher the weights will be.
Selecting the sentence which retrieves the higher rank.
Do you think is there a better approach? Maybe a GAN or reinforcement learning?
EDIT: I think another approach would be adding WFST. I've heard about pynini library, but I don't know how to apply it to my specific context.
If you are happy with your approach, the easiest way might be if you'd be able to train your LSTM on the reversed sequences as to train it to give the weight of the previous word, rather than the next one. In such a case, you can use the method you already employ, except that the first subset of words would be satisfying the last vowel constraint. I don't believe that this is guaranteed to produce the best result.
Now, if that reversal is not possible or if, after reading my answer further, you find that this doesn't find the best solution, then I suggest using a pathfinding algorithm, similar to reinforcement learning, but not statistical as the weights computed by the trained LSTM are deterministic. What you currently use is essentially a depth first greedy search which, depending on the LSTM output, might be even optimal. Say if LSTM is giving you a guaranteed monotonous increase in the sum which doesn't vary much between the acceptable consequent words (as the difference between N-1 and N sequence is much larger than the difference between the different options of the Nth word). In the general case, when there is no clear heuristic to help you, you will have to perform an exhaustive search. If you can come up with an admissible heuristic, you can use A* instead of Dijkstra's algorithm in the first option below, and it will do the faster, the better you heuristic is.
I suppose it is clear, but just in case, your graph connectivity is defined by your constraint sequence. The initial node (0-length sequence with no words) is connected with any word in your data frame that matches the beginning of your constraint sequence. So you do not have the graph as a data structure, just it's the compressed description as this constraint.
EDIT
As per request in the comment here are additional details. Here are a couple of options though:
Apply Dijkstra's algorithm multiple times. Dijkstra's search finds the shortest path between 2 known nodes, while in your case we only have the initial node (0-length sequence with no words) and the final words are unknown.
Find all acceptable last words (those that satisfy both the pattern and vowel constraints).
Apply Dijkstra's search for each one of those, finding the largest word sequence weight sum for each of them.
Dijkstra's algorithm is tailored to the searching of the shortest path, so to apply it directly you will have to negate the weights on each step and pick the smallest one of those that haven't been visited yet.
After finding all solutions (sentences that end with one of those last words that you identified initially), select the smallest solution (this is going to be exactly the largest weight sum among all solutions).
Modify your existing depth-first search to do an exhaustive search.
Perform the search operation as you described in OP and find a solution if the last step gives one (if the last word with a correct vowel is available at all), record the weight
Rollback one step to the previous word and pick the second-best option among previous words. You might be able to discard all the words of the same length on the previous step if there was no solution at all. If there was a solution, it depends on whether your LSTM provides different weights depending on the previous word. Likely it does and in that case, you have to perform that operation for all the words in the previous step.
When you run out of the words on the previous step, move one step up and restart down from there.
You keep the current winner all the time as well as the list of unvisited nodes on every step and perform exhaustive search. Eventually, you will find the best solution.
I would reach for a Beam Search here.
This is much like your current approach of starting 1000 solutions randomly. But instead of expanding each of those paths independently, it expands all candidate solutions together in a step by step manner.
Sticking with the current candidate count of 1000, it would look like this:
Generate 1000 stub solutions, for example using random starting points or selected from some "sentence start" model.
For each candidate, compute the best extensions based on your LSTM language model, which fit the constraints. This works just as it does in your current approach, except you could also try more than one option. For example using the best 5 choices for the next word would product 5000 child candidates.
Compute a score for each of those partial solution candidates, then reduce back to 1000 candidates by keeping only the best scoring options.
Repeat steps 2 and 3 until all candidates cover the full vowel sequence, including the end constraint.
Take the best scoring of these 1000 solutions.
You can play with the candidate scoring to trade off completed or longer solutions vs very good but short fits.
I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).
In bag-of-words model, I know we should remove stopwords and punctuation before training. But in RNN model, if I want to do text classification, should I remove stopwords too ?
This depends on what your model classifies. If you're doing something in which the classification is aided by stop words -- some level of syntax understanding, for instance -- then you need to either leave in the stop words or alter your stop list, such that you don't lose that information. For instance, cutting out all verbs of being (is, are, should be, ...) can mess up a NN that depends somewhat on sentence structure.
However, if your classification is topic-based (as suggested by your bag-of-words reference), then treat the input the same way: remove those pesky stop words before they burn valuable training time.
Do not remove SW, as they add new information(context-awareness) to
the sentence (viz., text summarization, machine/language translation,
language modeling, question-answering)
Remove SW if we want only
general idea of the sentence (viz., sentiment analysis, language/text
classification, spam filtering, caption generation, auto-tag
generation, topic/document
I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?
I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.
Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.
I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.