Concept of Bucketing in Seq2Seq model - machine-learning

To handle sequences of different lengths we use bucketing and padding. In bucketing we make different bucket for some max_len and we do this to reduce the amount of padding, after making different buckets we train different model on different bucket.
This is what I found so far. But what I don't understand is that how this all different models trained and how they are used for translating a new sentence?

Both at training and inference time, the algorithm needs to pick the network that is best suited for the current input sentence (or batch). Usually, it simply takes the minimal bucket which input size is greater or equal to the sentence length.
For example, suppose there are just two buckets [10, 16] and [20, 32]: the first one takes any input up to length 10 (padded to exactly 10) and outputs the translated sentence up to length 16 (padded to 16). Likewise the second bucket handles the inputs up to length 20. The two networks corresponding to these buckets accept non-intersecting input sets.
Then, for the sentence of length 8, it's better to select the first bucket. Note that if this is a test sentence, the second bucket can handle it as well, but in this case its neural network had been trained on bigger sentences, from 11 to 20 words, so it's likely not to recognize this sentence well. The network that corresponds to the first bucket had been trained on inputs 1 to 10, hence is a better choice.
You may be in trouble if the test sentence has the length 25, longer than any available bucket. There's no universal solution here. The best course of action here is to trim the input to 20 words and try to translate anyway.

Related

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

Data Encoding for Training in Neural Network

I have converted 349,900 words from a dictionary file to md5 hash. Sample are below:
74b87337454200d4d33f80c4663dc5e5
594f803b380a41396ed63dca39503542
0b4e7a0e5fe84ad35fb5f95b9ceeac79
5d793fc5b00a2348c3fb9ab59e5ca98a
3dbe00a167653a1aaee01d93e77e730e
ffc32e9606a34d09fca5d82e3448f71f
2fa9f0700f68f32d2d520302906e65ce
1c9b32ff1b53bd892b87578a11cbd333
26a10043bba821303408ebce568a2746
c3c32ff3481e9745e10defa7ce5b511e
I want to train a neural network to decrypt a hash using just simple architecture like MultiLayer Perceptron. Since all hash value is of length 32, I was thingking that the number of input nodes is 32, but the problem here is the number of output nodes. Since the output are words in the dictionary, it doesn't have any specific length. It could be of various length. That is the reason why Im confused on how many number of output nodes shall I have.
How will I encode my data, so that I can have specific number of output nodes?
I have found a paper here in this link that actually decrypt a hash using neural network. The paper said
The input to the neural network is the encrypted text that is to be decoded. This is fed into the neural network either in bipolar or binary format. This then traverses through the hidden layer to the final output layer which is also in the bipolar or binary format (as given in the input). This is then converted back to the plain text for further process.
How will I implement what is being said in the paper. I am thinking to limit the number of characters to decrypt. Initially , I can limit it up to 4 characters only(just for test purposes).
My input nodes will be 32 nodes representing every character of the hash. Each input node will have the (ASCII value of the each_hash_character/256). My output node will have 32 nodes also representing binary format. Since 8 bits/8 nodes represent one character, my network will have the capability of decrypting characters up to 4 characters only because (32/8) = 4. (I can increase it if I want to. ) Im planning to use 33 nodes. Is my network architecture feasible? 32 x 33 x 32? If no, why? Please guide me.
You could map the word in the dictionary in a vectorial space (e.g. bag of words, word2vec,..). In that case the words are encoded with a fix length. The number of neurons in the output layer will match that length.
There's a great discussion about the possibility of cracking SHA256 hashes using neural networks in another Stack Exchange forum: https://security.stackexchange.com/questions/135211/can-a-neural-network-crack-hashing-algorithms
The accepted answer was that:
No.
Neural networks are pattern matchers. They're very good pattern
matchers, but pattern matchers just the same. No more advanced than
the biological brains they are intended to mimic. More thorough, more
tireless, but not more sophisticated.
The patterns have to be there to be found. There has to be a bias in
the data to tease out. But cryptographic hashes are explicitly and
extremely carefully designed to eliminate any bias in the output. No
one bit is more likely than any other, no one output is more likely to
correlate to any given input. If such a correlation were possible, the
hash would be considered "broken" and a new algorithm would take its
place.
Flaws in hash functions have been found before, but never with the aid
of a neural network. Instead it's been with the careful application of
certain mathematical principles.
The following answer also makes a funny comparison:
SHA256 has an output space of 2^256, and an input space that's
essentially infinite. For reference, the time since the big bang is
estimated to be 5 billion years, which is about 1.577 x 10^27
nanoseconds, which is about 2^90 ns. So assuming each training
iteration takes 1 ns, you would need 2^166 ages of the universe to
train your neural net.

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

Debugging Neural Network for (Natural Language) Tagging

I've been coding a Neural Network for recognizing and tagging parts of speech in English (written in Java). The code itself has no 'errors' or apparent flaws. Nevertheless, it is not learning -- the more I train it does not change its ability to predict the testing data. The following is information about what I've done, please ask me to update this post if I left something important out.
I wrote the neural network and tested it on several different problems to make sure that the network itself worked. I trained it to learn how to double numbers, XOR, cube numbers, and learn the sin function to a decent accuracy. So, I'm fairly confident that the actual algorithm is working.
The network using using the sigmoid activation function. The learning rate is .3, Momentum is .6. The weights are initialized to rng.nextFloat() -.5) * 4
I then got the Brown Corpus data-set and simplified the tagset to 'universal' with NLTK. I used NLTK for generating and saving all the corpus and dictionary data. I cut the last 15,000 sentences out of the corpus for testing purposes. I used the rest of the corpus (about 40,000 sentences of tagged words) for training.
The neural network layout is as follows: There is an input neuron for each Tag. Output Layer: There is one output neuron for each tag. The network is taking inputs for 3 words: first: the word coming before the word we want to tag, second: the word that needs to be tagged, third: the word that follows the second word. So, total number of inputs are 3x(total number of possible tags). The input values are numbers between 0 and 1. Each of the 3 words being fed into the input layer is searched for in a dictionary (made up by the 40,000 corpus, the same corpus that is used for training). The dictionary holds the number of times that each word has been tagged in the corpus as what part of speech.
For instance, the word 'cover' is tagged as a noun 1 time and a verb 3
times.
Percentages of being tagged are computed for each part of speech that the word is associated as, and this is what is fed into the network for that particular word. So, the input neuron designated as NOUN would receive .33 and VERB would receive .66. The other input neurons that hold tags for that word receive an input of 0.0. This is done for each of the 3 words to be inputted. If a word is the first word of a sentence, the first group of tags are all 0. If a word is the last word of a sentence, the final group of input neurons that hold the tag probabilities for the following word are left as 0s.
I've been using 10 hidden nodes (I've read a number of papers and this seems to be a good place to start testing with)
None of the 15,000 testing sentences were used to make the 'dictionary.' So, when testing the network with this partial corpus there will be some words the network has never seen. Words that are not recognized have their suffix stripped, and their suffix is searched for in another 'dictionary.' Whatever is most probable for that word is then used as inputs for it.
This is my set-up, and I started trying to train the network. I've been training the network with all 40,000 sentences. 1 epoch = 1 forward and backpropagation of every word in each sentence of the 40,000 training-set. So, just doing 1 epoch takes quite a few seconds. Just by knowing the word probabilities the network did pretty well, but the more I train it, nothing happens. The numbers that follow the epochs are the number of correctly tagged words divided by the total number of words.
First run 50 epochs: 0.928218786
100 epochs: 0.933130661
500 epochs: 0.928614499 took around 30 minutes to train this
Tried 10 epochs: 0.928953683
Using only 1 epoch had results that pretty much varied between .92 and .93
So, it doesn't appear to be working...
I then took 55 sentences from the corpus and used the same dictionary that had probabilities for all 40,000 words. For this one, I trained it in the same way I trained my XOR -- I only used those 55 sentences and I only tested the trained network weights on those 55 sentences. The network was able to learn those 55 sentences quite easily. With 120 epochs (taking a couple seconds) the network went from tagging 3768 incorrectly and 56 correctly (on the first few epochs) to tagging 3772 correctly and 52 incorrectly on the 120th epoch.
This is where I'm at, I've been trying to debug this for over a day now, and haven't figured anything out.

modeling feature set with text documents

Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).

Resources