Label alignment in RNN Transducer training

Label alignment in RNN Transducer training - machine-learning

I am trying to understand how RNN Transducer is trained with ground truth labels. In case of CTC, I know that model is trained with loss function that sums up all scores of all possible alignments of the ground truth labels.
But in RNN-T, the prediction network has to receive input from the last step to produce output similar to the "teacher-forcing" method. But my doubt here is should the ground truth labels be converted into all possible alignments with blank label and feed each alignment to the network by teacher-forcing" method?

RNN-T has a transcription network (analogous to an acoustic model), a prediction network (language model) and a joint network (/function, depending on implementation) that combines the outputs of the prediction network and the transcription network.
During training, you process each utterance by:
Propagating all T acoustic frames through the transcription network and storing the outputs (transcription network hidden states)
Propagating the ground truth label sequence, of length U, through the prediction network, passing in an all-zero vector at the beginning of the sequence. Note that you do not need to worry about blank states at this point
Propagating all T*U combinations of transcription and prediction network hidden states through the joint network, whether that be a simple sum and exponential as per Graves (2012) or a feed-forward network as per the more recent Google ASR publications (i.e.: He et al. 2019).
The T*U outputs from the joint network can be viewed as a grid, as per Figure 1 of Graves 2012. The loss function can then be efficiently realised using the forward-backward algorithm (Section 2.4, Graves 2012). Only horizontal (consuming acoustic frames) and vertical (consuming labels) transitions are permitted. Stepping from t to t+1 is analogous to the blank state in CTC, whilst non-blank symbols are output when making vertical transitions, i.e. from output label u to u+1. Note that you can consume multiple time frames without outputting a non-blank symbol (as per CTC), but you can also output multiple labels without advancing through t.
To more directly answer your question, note that only non-blank outputs are passed back to the input of the prediction network, and that the transcription and prediction networks are operating asynchronously.
References:
Sequence Transduction with Recurrent Neural Networks, Graves 2012
Streaming End-to-end Speech Recognition For Mobile Devices, He et al.
2019

Related

Is it better to make neural network to have hierarchical output?

I'm quite new to neural network and I recently built neural network for number classification in vehicle license plate. It has 3 layers: 1 input layer for 16*24(382 neurons) number image with 150 dpi , 1 hidden layer(199 neurons) with sigmoid activation function, 1 softmax output layer(10 neurons) for each number 0 to 9.
I'm trying to expand my neural network to also classify letters in license plate. But I'm worried if I just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class. And also, I think it might cause problem when input is one of number and neural network wrongly classifies as one of letter with biggest probability, even though sum of probabilities of all number output exceeds that.
So I wonder if it is possible to build hierarchical neural network in following manner:
There are 3 neural networks: 'Item', 'Number', 'Letter'
'Item' neural network classifies whether input is numbers or letters.
If 'Item' neural network classifies input as numbers(letters), then input goes through 'Number'('Letter') neural network.
Return final output from Number(Letter) neural network.
And learning mechanism for each network is below:
'Item' neural network learns all images of numbers and letters. So there are 2 output.
'Number'('Letter') neural network learns images of only numbers(letter).
Which method should I pick to have better classification? Just simply add 10 more classes or build hierarchical neural networks with method above?

I'd strongly recommend training only a single neural network with outputs for all the kinds of images you want to be able to detect (so one output node per letter you want to be able to recognize, and one output node for every digit you want to be able to recognize).
The main reason for this is because recognizing digits and recognizing letters is really kind of exactly the same task. Intuitively, you can understand a trained neural network with multiple layers as performing the recognition in multiple steps. In the hidden layer it may learn to detect various kinds of simple, primitive shapes (e.g. the hidden layer may learn to detect vertical lines, horizontal lines, diagonal lines, certain kinds of simple curved shapes, etc.). Then, in the weights between hidden and output layers, it may learn how to recognize combinations of multiple of these primitive shapes as a specific output class (e.g. a vertical and a horizontal line in roughly the correct locations may be recoginzed as a capital letter L).
Those "things" it learns in the hidden layer will be perfectly relevant for digits as well as letters (that vertical line which may indicate an L may also indicate a 1 when combined with other shapes). So, there are useful things to learn that are relevant for both ''tasks'', and it will probably be able to learn these things more easily if it can learn them all in the same network.
See also a this answer I gave to a related question in the past.

I'm trying to expand my neural network to also classify letters in license plate. But i'm worried if i just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class.
You're far from where it becomes problematic. ImageNet has 1000 classes and is commonly done in a single network. See the AlexNet paper. If you want to learn more about CNNs, have a look at chapter 2 of "Analysis and Optimization of
Convolutional Neural Network Architectures". And when you're on it, see chapter 4 for hirarchical classification. You can read the summary for ... well, a summary of it.

Can a recurrent neural network learn slightly different sequences at once?

Can a recurrent neural network be used to learn a sequence with slightly different variations? For example, could I get an RNN trained so that it could produce a sequence of consecutive integers or alternate integers if I have enough training data?
For example, if I train using
1,2,3,4
2,3,4,5
3,4,5,6
and so on
and also train the same network using
1,3,5,7
2,4,6,8
3,5,7,9
and so on,
would I be able to predict both sequences successfully for the test set?
What if I have even more variations in the training data like sequences of every three integers or every four integers, et cetera?

Yes, provided there is enough information in the sequence so that it is not ambiguous, a neural network should be able to learn to complete these sequences correctly.
You should note a few details though:
Neural networks, and ML models in general, are bad at extrapolation. A simple network is very unlikely to learn about sequences in general. It will never learn the concept of sequence logic in the way a child quickly would. So if you feed in test data outside of its experience (e.g. steps of 3 between items, when they were not in the training data), it will perform badly.
Neural networks prefer scaled inputs - a common pre-processing step is to normalise to mean 0 standard deviation 1 for each input column. Whilst it is possible for a network to accept larger range of numbers at inputs, that will reduce effectiveness of training. With a generated training set such as artificial numeric sequences, you may be able to force your way through that by training for longer with more examples.
You will need more neurons, and more layers, to support a larger variation of sequences.
For a RNN, it will predict badly if the sequence it has processed so far is ambiguous. E.g. if you train 1,2,3,4 and 1,2,3,5 with equal numbers of samples, it will predict either 4.5 (for regression) or 50% chance 4 or 5 (for classifier) when it shown sequence 1,2,3 and asked to predict.

Where do dimensions in Word2Vec come from?

I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.

The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.

I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.

Artificial Neural Network with unbalanced weights

I have been reading the concept of ANN for applying it on my project (credit card fraud detection). Given a set of inputs to the network, say:
A1 - Time to input PIN
A2 - Amount to be withdrawn
A3 - ATM location
A4 - Global behavior (Time & date, & sequence in performing a transaction )
The more any of these inputs deviates from the "norm", the greater the weight of that input to the network. Here comes my question, how does the Neural Network treat a situation whereby one input's weight, say A1, is high whilst all the other weights are low?

The input probability density functions combine to form a multidimensional probability distribution (usually an ellipsoid in that many dimensions). The combination of the inputs is a vector, and the probability value at that point in the N-space tells you how likely it is to be real or fake. This works along each of the axes, where all but one input would be zero, as well as out where all the variables have significant values. If all of your inputs have smooth gaussian probability distributions your resulting probability distribution is a hyperellipsoid and you don't really need a neural net.
Using a neural net gets economical when you have a complicated probability density in one or more of the variables, or if combining the variables creates unexpected features (holes and bumps) in the probability density. Then the training of the neural net over a large number of real input combinations and known results tells it what regions of the input space are interesting and what regions are mundane. Again, you could just map them yourself in a big N-dimensional array with high resolution, if you have enough memory, but where's the fun in that? The neural net will also interpolate smoothly between regions, which may make its decisions more fuzzy than the actual probability space (i.e., that's where the accuracy metric drops below 100%).

Tic Tac Toe Neural Network as Evaluation Function

I've been trying to program an AI for tic tac toe using a multilayer perceptron and backpropagation. My idea was to train the neural network to be an accurate evaluation function for board states, but the problem is even after analyzing thousands of games, the network does not output accurate evaluations.
I'm using 27 input neurons; each square on the 3x3 board is associated with three input neurons that receive values of 0 or 1 depending on whether the square has an x, o or is blank. These 27 input neurons send signals to 10 hidden neurons (I chose 10 arbitrarily, but I have tried with 5 and 15 as well).
For training, I've had the program generate a series of games by playing against itself using the current evaluation function to select what are deemed optimal moves for each side. After generating a game, the NN compiles training examples (which comprise a board state and the correct output) by taking the correct output for a given board state to be the value (using the evaluation function) of the board state that follows it in the game sequence. I think this is what Gerald Tesauro did when programming TD-Gammon, but I might have misinterpreted the article. (note: I put the specific mechanism for updating weights at the bottom of this post).
I have tried various values for the learning rate, as well as varying numbers of hidden neurons, but nothing seems to work. Even after hours of "learning," there is no discernible improvement in strategy and the evaluation function is not anywhere close to accurate.
I realize that there are much easier ways to program tic tac toe, but I want to do it with a multilayer perceptron so that I may apply it to connect 4 later on. Is this even possible? I'm starting to think that there is no reliable evaluation function for a tic tac toe board with a reasonable amount of hidden neurons.
I assure you that I am not looking for some quick code to turn in for a homework assignment. I've been working unsuccessfully for a while now and would just like to know what I'm doing wrong. All advice is appreciated.
This is the specific mechanism I used for the NN:
Each of the 27 input neurons receives a 0 or 1, which passes through the differentiable sigmoid function 1/(1+e^(-x)). Each input neuron i sends this output (i.output), multiplied by some weight (i.weights[h]) to each hidden neuron h. The sum of these values is taken as input by the hidden neuron h (h.input), and this input passes through the sigmoid to form the output for each hidden neuron (h.output). I denote the lastInput to be the sum of (h.output * h.weight) across all of the hidden neurons. The outputted value of the board is then sigmoid(lastInput).
I denote the learning rate to be alpha, and err to be the correct output minus to actual output. Also I let dSigmoid(x) equal the derivative of the sigmoid at the point x.
The weight of each hidden neuron h is incremented by the value: (alpha*err*dSigmoid(lastInput)*h.output) and the weight of the signal from a given input neuron i to a given hidden neuron h is incremented by the value: (alpha*err*dSigmoid(lastInput)*h.weight*dSigmoid(h.input)*i.output).
I got these formulas from this lecture on backpropagation: http://www.youtube.com/watch?v=UnWL2w7Fuo8 .

Tic tac toe has 3^9 = 19683 states (actually, some of them aren't legal, but the order of magnitude is right). The ouput function isn't smooth, so I think the best a backpropagation network can do is "rote learning" a look-up table for all these states.
With that in mind, 10 hidden neurons seems very small, and there's no way you can train 20k different look-up-table entries by teaching a few thousand games. For that, the network would have to "extrapolate" from states it has been taught to states it has never seen, and I don't see how it could do that.

You might want to consider more than one hidden layer, as well as upping the size of the hidden layer. For comparison purposes, Fogel and Chellapilla used two layers of 40 and 10 neurons to program up a checkers player, so if you need something more than that, something is probably going terribly wrong.
You might also want to use bias inputs, if you're not already.
Your basic methodology seems sound, although I'm not 100% sure what you mean by this:
After generating a game, the NN compiles training examples (which comprise a board state and the correct output) by taking the correct output for a given board state to be the value (using the evaluation function) of the board state that follows it in the game sequence.
I think you mean that you're using some known-good method (like a minimax game tree) to determine the "correct" answers for the training examples. Can you explain that a little bit? Or, if I'm correct, it seems like there's a subtlety to deal with, in terms of symmetric boards, which might have more than one equally good best response. If you're only treating one of those as correct, that might lead to problems. (Or it might not, I'm not sure.)

Just to throw in another thought have your thought about using reinforcement learning for this task? It would be much easier to implement and much more effective. For example you could use Q learning which is often used for games.

Here you can find an implementation for training a Neural Network in Tik Tak Toe (variable board-size) using self-play. The gradient is back-propagated through the whole game employing a simple gradient-copy trick.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart