Inputs and Outputs for Neural Turing Machine Copy Task - machine-learning

I'm trying to understand the copy task from the NTM paper by Graves et al.
I have experience in language modeling with LSTMs and there the network is usually fed a sequence of words, one word at a time, and the output for each timestep is the predicted next word.
For the copy task of NTMs however, the output seems to be delayed (which is the whole point I guess):
Source: https://blog.wtf.sg/2014/11/11/neural-turing-machines-copy-task/
So exactly how does this work in code during training? Are the true output vectors for the first half and input vectors during the second half of the sequence just zero vectors, with the network being expected to output zero vectors for the firs half and then the correct sequence during the second half?
That part has me confused.

Related

What is used to train a self-attention mechanism?

I've been trying to understand self-attention, but everything I found doesn't explain the concept on a high level very well.
Let's say we use self-attention in a NLP task, so our input is a sentence.
Then self-attention can be used to measure how "important" each word in the sentence is for every other word.
The problem is that I do not understand how that "importance" is measured. Important for what?
What exactly is the goal vector the weights in the self-attention algorithm are trained against?
Connecting language with underlying meaning is called grounding. A sentence like “The ball is on the table” results into an image which can be reproduced with multimodal learning. Multimodal means, that different kind of words are available for example events, action words, subjects and so on. A self-attention mechanism works with mapping input vector to output vectors and between them is a neural network. The output vector of the neural network is referencing to the grounded situation.
Let us make a short example. We need a pixel image which is 300x200, we need a sentence in natural language and we need a parser. The parser works in both directions. He can convert text to image, that means the sentence “The ball is on the table” gets converted into the 300x200 image. But it is also possible to parse a given image and extract the natural sentence back. Self-attention learning is a bootstrapping technique to learn and use the grounded relationship. That means to verify existing language models, to learn new one and to predict future system states.
This question is old now but I came across it so I figured I should update others as my own understanding has increased.
Attention simply refers to some operation that takes the output and combines it with some other information. Typically this just happens by taking the dot product of the output with some other vector so it can "attend" to it in some way.
Self-attention combines the output with other parts of the input (hence self part). Again the combination usually occurs via the dot-product between the vectors.
Finally how is attention (or self-attention) trained?
Let's call Z our output, W our weight matrix and X our input (we'll use # as matrix multiplication symbol).
Z = X^T # W^T # X
In NLP we will compare Z to whatever we want the resulting output to be. In machine translation it is the sentence in the other language for example. We can compare the two with average cross entropy loss over each word predicted. Finally we can update W with back propagation.
How do we see what is important? We can look at the magnitudes of Z to see after the attention what words were most "attended" to.
This is a slightly simplified example as it only has one weight matrix and typically the inputs are embedded but I think it still highlights some of the necessary details concerning attention.
Here is a useful resource with visualizations for more information about attention.
Here is another resource with visualizations for more about attention in transformers specifically self-attention.

LSTM: choice of batch size when State reused for generation later

I am building an LSTM model that generates symbols step-by-step. The task is to train the model up to some point of the data sequence and then to use the trained model to process the remaining pieces of the sequence in the test phase -- these remaining pieces weren't seen during Training.
For this task, I am attempting to re-use the latest state from the Training phase for the subsequent Prediction phase (i.e. not to start predicting with clean zero-state, but to sort-of continue where things were left off during training).
In this context, I am wondering how to best choose the Batch size for training.
My Training data is one long sequence of time-ordered observations. If that sequence is chopped up into N batches for Training, then my understanding is that the State tensor will be of shape [N, Network_Size] during Training, and [1, Network_Size] during Prediction. So for Prediction, I simply take the last element of the [N, Network_Size] tensor, which is of shape [1, Network_Size].
That seems to work in terms of mechanics, but this means that the value of N determines how many observations that last vector of the original State has seen during Training.
Is there a best practice for determining how to chose N? The network trains much faster with a larger batch size, but I am concerned that this way the last part of the State tensor may have not seen enough. Obviously I'm trying various combinations, but curious how others have dealt with it.
Also, I have seen a few examples where parameters like this (or Cell size/etc.) are set as powers-of-2 (i.e. 64, 128, etc.). Is there any theoretical reason behind that vs simple 50/100/etc.? Or just a quirky choice?
First, for your last question: for computers powers of two are simpler than powers of 10 (memory size and alignment constraints, for example, are likelier to be powers of two).
It is unclear from your question what you mean by training; if updating parameters or just computing RNN forward steps. Updating parameters doesn't make much sense because for RNNs (including LSTMs) you'd ideally update parameters only after seeing an entire batch of sequences (and you often need many updates until the model is at all reasonable). Similarly, RNN forward steps don't make much sense to me because the state for each example is independent of the batch size (ignoring any batch normalization you might be doing).

Can a recurrent neural network learn slightly different sequences at once?

Can a recurrent neural network be used to learn a sequence with slightly different variations? For example, could I get an RNN trained so that it could produce a sequence of consecutive integers or alternate integers if I have enough training data?
For example, if I train using
1,2,3,4
2,3,4,5
3,4,5,6
and so on
and also train the same network using
1,3,5,7
2,4,6,8
3,5,7,9
and so on,
would I be able to predict both sequences successfully for the test set?
What if I have even more variations in the training data like sequences of every three integers or every four integers, et cetera?
Yes, provided there is enough information in the sequence so that it is not ambiguous, a neural network should be able to learn to complete these sequences correctly.
You should note a few details though:
Neural networks, and ML models in general, are bad at extrapolation. A simple network is very unlikely to learn about sequences in general. It will never learn the concept of sequence logic in the way a child quickly would. So if you feed in test data outside of its experience (e.g. steps of 3 between items, when they were not in the training data), it will perform badly.
Neural networks prefer scaled inputs - a common pre-processing step is to normalise to mean 0 standard deviation 1 for each input column. Whilst it is possible for a network to accept larger range of numbers at inputs, that will reduce effectiveness of training. With a generated training set such as artificial numeric sequences, you may be able to force your way through that by training for longer with more examples.
You will need more neurons, and more layers, to support a larger variation of sequences.
For a RNN, it will predict badly if the sequence it has processed so far is ambiguous. E.g. if you train 1,2,3,4 and 1,2,3,5 with equal numbers of samples, it will predict either 4.5 (for regression) or 50% chance 4 or 5 (for classifier) when it shown sequence 1,2,3 and asked to predict.

LSTM neural network for a chemical process?

I have the following dataset
for a chemical process in a refinery. It's comprised of 5x5 input vector where each vector is sampled at every minute. The output is the result of the whole process and sampled each 5 minutes.
I concluded that the output (yellow) depends highly on past input vectors in a timely manner. And got recently to have a look on LSTMs and trying to learn a bit about them on Python and Torch.
However I don't have any idea how should I prepare my dataset in a manner where my LSTM could process it and show me future predictions if tested with new input vectors.
Is there a straight forward manner to preprocess my dataset accordingly?
EDIT1: Actually i found out this awesome blog about training LSTMs on natural language processing http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Long story short, an LSTM takes a character as an input and tries to generate the next character. Eventually, it can be trained on Shakespeare poems to generate new Shakespeare poems! But GPU acceleration is recommended.
EDIT2: Based on EDIT1, the best way to format my dataset is to just transform my excel to txt with TAB-separated columns. I'll post the results of the LSTM prediction on my above numbers dataset as soon as possible.

Debugging Neural Network for (Natural Language) Tagging

I've been coding a Neural Network for recognizing and tagging parts of speech in English (written in Java). The code itself has no 'errors' or apparent flaws. Nevertheless, it is not learning -- the more I train it does not change its ability to predict the testing data. The following is information about what I've done, please ask me to update this post if I left something important out.
I wrote the neural network and tested it on several different problems to make sure that the network itself worked. I trained it to learn how to double numbers, XOR, cube numbers, and learn the sin function to a decent accuracy. So, I'm fairly confident that the actual algorithm is working.
The network using using the sigmoid activation function. The learning rate is .3, Momentum is .6. The weights are initialized to rng.nextFloat() -.5) * 4
I then got the Brown Corpus data-set and simplified the tagset to 'universal' with NLTK. I used NLTK for generating and saving all the corpus and dictionary data. I cut the last 15,000 sentences out of the corpus for testing purposes. I used the rest of the corpus (about 40,000 sentences of tagged words) for training.
The neural network layout is as follows: There is an input neuron for each Tag. Output Layer: There is one output neuron for each tag. The network is taking inputs for 3 words: first: the word coming before the word we want to tag, second: the word that needs to be tagged, third: the word that follows the second word. So, total number of inputs are 3x(total number of possible tags). The input values are numbers between 0 and 1. Each of the 3 words being fed into the input layer is searched for in a dictionary (made up by the 40,000 corpus, the same corpus that is used for training). The dictionary holds the number of times that each word has been tagged in the corpus as what part of speech.
For instance, the word 'cover' is tagged as a noun 1 time and a verb 3
times.
Percentages of being tagged are computed for each part of speech that the word is associated as, and this is what is fed into the network for that particular word. So, the input neuron designated as NOUN would receive .33 and VERB would receive .66. The other input neurons that hold tags for that word receive an input of 0.0. This is done for each of the 3 words to be inputted. If a word is the first word of a sentence, the first group of tags are all 0. If a word is the last word of a sentence, the final group of input neurons that hold the tag probabilities for the following word are left as 0s.
I've been using 10 hidden nodes (I've read a number of papers and this seems to be a good place to start testing with)
None of the 15,000 testing sentences were used to make the 'dictionary.' So, when testing the network with this partial corpus there will be some words the network has never seen. Words that are not recognized have their suffix stripped, and their suffix is searched for in another 'dictionary.' Whatever is most probable for that word is then used as inputs for it.
This is my set-up, and I started trying to train the network. I've been training the network with all 40,000 sentences. 1 epoch = 1 forward and backpropagation of every word in each sentence of the 40,000 training-set. So, just doing 1 epoch takes quite a few seconds. Just by knowing the word probabilities the network did pretty well, but the more I train it, nothing happens. The numbers that follow the epochs are the number of correctly tagged words divided by the total number of words.
First run 50 epochs: 0.928218786
100 epochs: 0.933130661
500 epochs: 0.928614499 took around 30 minutes to train this
Tried 10 epochs: 0.928953683
Using only 1 epoch had results that pretty much varied between .92 and .93
So, it doesn't appear to be working...
I then took 55 sentences from the corpus and used the same dictionary that had probabilities for all 40,000 words. For this one, I trained it in the same way I trained my XOR -- I only used those 55 sentences and I only tested the trained network weights on those 55 sentences. The network was able to learn those 55 sentences quite easily. With 120 epochs (taking a couple seconds) the network went from tagging 3768 incorrectly and 56 correctly (on the first few epochs) to tagging 3772 correctly and 52 incorrectly on the 120th epoch.
This is where I'm at, I've been trying to debug this for over a day now, and haven't figured anything out.

Resources