Meaning of parameters in RNNlib - machine-learning

I'm new to study recurrent neural networks and now confused by the parameters in RNNLib. Specifically, I don't understand the hidden Block, hidden size, input Block, subsample size and stuffs with mdl. In my experience, I just had input vectors, one lstm hidden layer and softmax output layer. Why does the block seem like a matrix?

RNNLib implements a novel type of RNN, so-called "Multidimensional recurrent neural network". Following reference on RNNLib page explains that : Alex Graves, Santiago Fernández and Jürgen Schmidhuber.Multidimensional recurrent neural networks International Conference on Artificial Neural Networks, September 2007,Porto. This extension is designed for processing images, video and so on. As explained in the paper:
"The basic idea of MDRNNs is to replace the single recurrent connection found in standard
RNNs with as many recurrent connections as there are dimensions in the data.
During the forward pass, at each point in the data sequence, the hidden layer of the network
receives both an external input and its own activations from one step back along
all dimensions"
I think, that is the reason why you have ability to use multidimensional input. If you want to use RNNLib as usual one-dimensional RNN, just specify one dimension for input and LSTM block.
MDL stands for "Minimum Description length" cost function, used for approximation of Bayesian inference (a method for regularizing NN). If you want to use that, its best to read original references, provided on RNNLib website. Otherwise, I think, it can be just ignored.

Related

Use Feed Forward Neural Networks instead of LSTM?

Can a LSTM problem be expressed as FFNN one?
LSTM neural networks simply look in the past. But I can also take some (or many) past values and use them as input features of a FFNN.
In this way, could FFNN replace LSTM Networks? Why should I prefer LSTM over FFNN if I can take past values and use them as input features?
LSTM is also a feed forward neural network with Memory Cell and recurrent connection. LSTM is an optimized NN algorithm since it can handle the problem of vanishing and exploring gradients and it can handle the long term dependencies. Obviously, you can use a FFNN by customizing the input layer information with a valid Neural Network architecture, this is not a replacement of LSTM.

What do non-linear activation functions do at a fundamental level in neural networks?

I've been trying to find out what exactly non-linear activation functions do when implemented in a neural network.
I know they modify the output of a neuron, but how and for what purpose?
I know they add non-linearity to otherwise linear neural networks, but for what purpose?
What exactly do they do to the output of each layer? Is it some form of classification?
I want to know what exactly their purpose is within neural networks.
Wikipedia says that "the activation function of a node defines the output of that node given an input or set of inputs." This article states that the activation function checks whether a neuron has "fired" or not. I've looked at a bunch more articles and other questions on Stack Overflow as well, but none of them gave a satisfying answer as to what is occurring.
The main reason for using non-linear activation functions is to be able to learn non-linear target functions, i.e. learn a non-linear relationship between the inputs and outputs. If a network consists of only linear activation functions, it can only model a linear relationship between the inputs and outputs, which is not useful in almost all applications.
I am by no means an ML expert, so maybe this video can explain it better: https://www.coursera.org/lecture/neural-networks-deep-learning/why-do-you-need-non-linear-activation-functions-OASKH
Hope this helps!
First of all it's better to have a clear idea on why we use activation functions.
We use activation functions to propagate the output of one layer’s nodes to
the next layer. Activation functions are scalar-to-scalar functions and we use activation functions for hidden neurons in a neural network to introduce non-linearity into the network’s model. So in a simpler level, activation function are used to introduce non-linearity into the network.
So what is the use of introducing non-linearity? Before that, non-linearity means that an output cannot be reproduced from a linear combination of the inputs. Therefore without a non-linear activation function in a neural-network, even though it may have hundreds of hidden layers it would still behave like a single-layer perceptron. The reason is whichever the way you sum them, it would only result a linear output.
Anyhow for more deeper level understanding, I suggest you to look at this Medium post as well as this video by Andrew Ng himself.
From the Andrew Ng's video let me rephrase some important parts below.
...if you don't have an activation function, then no matter how many
layers your neural network has, all it's doing is just computing a
linear activation function. So you might as well not have any hidden
layers.
...it turns out that if you have a linear activation function here and
a sigmoid function here, then this model is no more expressive than
standard logistic regression without any hidden layer.
...so unless
you throw a non-linear in there, then you're not computing more
interesting functions even as you go deeper in the network.

How does Fine-tuning Word Embeddings work?

I've been reading some NLP with Deep Learning papers and found Fine-tuning seems to be a simple but yet confusing concept. There's been the same question asked here but still not quite clear.
Fine-tuning pre-trained word embeddings to task-specific word embeddings as mentioned in papers like Y. Kim, “Convolutional Neural Networks for Sentence Classification,” and K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” had only a brief mention without getting into any details.
My question is:
Word Embeddings generated using word2vec or Glove as pretrained word vectors are used as input features (X) for downstream tasks like parsing or sentiment analysis, meaning those input vectors are plugged into a new neural network model for some specific task, while training this new model, somehow we can get updated task-specific word embeddings.
But as far as I know, during the training, what back-propagation does is updating the weights (W) of the model, it does not change the input features (X), so how exactly does the original word embeddings get fine-tuned? and where do these fine-tuned vectors come from?
Yes, if you feed the embedding vector as your input, you can't fine-tune the embeddings (at least easily). However, all the frameworks provide some sort of an EmbeddingLayer that takes as input an integer that is the class ordinal of the word/character/other input token, and performs a embedding lookup. Such an embedding layer is very similar to a fully connected layer that is fed a one-hot encoded class, but is way more efficient, as it only needs to fetch/change one row from the matrix on both front and back passes. More importantly, it allows the weights of the embedding to be learned.
So the classic way would be to feed the actual classes to the network instead of embeddings, and prepend the entire network with a embedding layer, that is initialized with word2vec / glove, and which continues learning the weights. It might also be reasonable to freeze them for several iterations at the beginning until the rest of the network starts doing something reasonable with them before you start fine tuning them.
One hot encoding is the base for constructing initial layer for embeddings. Once you train the network one hot encoding essentially serves as a table lookup. In fine-tuning step you can select data for specific works and mention variables that need to be fine tune when you define the optimizer using something like this
embedding_variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope="embedding_variables/kernel")
ft_optimizer = tf.train.AdamOptimizer(learning_rate=0.001,name='FineTune')
ft_op = ft_optimizer.minimize(mean_loss,var_list=embedding_variables)
where "embedding_variables/kernel" is the name of the next layer after one-hot encoding.

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.
Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?
I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.
Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.
This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.
NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider
this animation
This pictures where taken from this article. Also check out that cool visualization applet.
Activation functions have very little to do with classifying non-linearly separable sets of data.
Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).
You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.
Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.
When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.
Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set:
I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

extrapolation with recurrent neural network

I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."

Resources