In CBOW model, do we need to take Average at Hidden layer? - machine-learning

I search and read some articles about CBOW. But seem to have difference between these articles.
As I understand:
Input is a batch vector. And we will feed it to Hidden layer. So that we will get another batch vector H at Hidden layer.
In an article (part 2.2.1), they say that we will not use any Activation Function at Hidden layer, but we will take average on batch vector H to get a single vector (not a batch anymore). Then we will feed this average vector to Output layer and apply Softmax on it.
However, in this Coursera's video, they don't take average on batch vector H. They just feed this batch vector H to Output layer and apply Softmax on batch Output vector. And then calculate Cost function on it.
And, in Coursera's video, they say that we can use RelU as Activation function at Hidden layer. Is this a new method? Because I read many articles, but they always say that there is no Activation function at Hidden layer.
Can you please help me to answer it?

In actual implementations – whose source code you can review – the set of context-word vectors are averaged together before being fed as the "input" to the neural-network.
Then, any back-propagated adjustments to the input are also applied to all the vectors contributing to that average.
(For example, in the original word2vec.c released with Google's original word2vec paper, you can see the tallying of vectors into neu1, then averaging via division by the context-window count cw, at:
https://github.com/tmikolov/word2vec/blob/master/word2vec.c#L444-L448
)

Related

Time series Autoencoder with Metadata

At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)

How to initialize the kernel weights of the input layer of a CNN with high pass filters?

I am implementing a CNN model for Detection of Forged Images. The paper I am referring to asks to initialize the kernel weights of the first layer with 30 basic high pass filters (the ones used in calculation of residual maps in SRM). What is this high pass filter and how to do this?
Also, is there any function that instead of a single image at a time, these filters can be applied to a batch of images...similar to ImageDataGenerator?
Research Paper Reference: https://ieeexplore.ieee.org/document/7823911

Neural Network Developing

I am try to write a neural network class but I don't fully understand some aspects of it. I have two questions on the folling design.
Am I doing this correctly? Does the bias neuron need to connect to all of neurons (except those in the input layer) or just those in the hidden layer?
My second question is about calculation the output value. I'm using the equation below to calculate the output value of the neurons.
HiddenLayerFirstNeuron.Value =
(input1.Value * weight) + (input2.Value * weight) + (Bias.Value * weight)
After this equation, I'm calculating the activation and the result send the output. And output neurons doing same.
I'm not sure what I am do and I want to clear up problems.
Take a look at: http://deeplearning.net/tutorial/contents.html in theano. This explains everything you need to know for multi layer perceptron using theano (symbolic mathematic library).
The bias is usually connected to all hidden and output units.
Yes, you compute the input of activation function like summation of weight*output of previous layer neuron.
Good luck with development ;)
There should be a separate bias neuron for each hidden and the output layer. Think of the layers as a function applied to a first order polynomials such as f(m*x+b)=y where y is your output and f(x) your activation function. If you look at the the linear term you will recognize the b. This represents the bias and it behaves similar with neural network as with this simplification: It shifts the hyperplane up and down the in the space. Keep in mind that you will have one bias per layer connected to all neurons of that layer f((wi*xi+b)+...+(wn*xn+b)) with an initial value of 1. When it comes to gradient descent, you will have to train this neuron like a normal weight.
In my opinion should you apply the activation function to the output layer as well. This is how it's usually done with multilayer perceptrons. But it actually depends of what you want. If you, for example, use the logistic function as activation function and you want an output in the interval (0,1), then you have to apply your activation function to the output as well. Since a basic linear combination, as it is in your example, can theoretically go above the boundaries of the previously mentioned Intervall.

Where do dimensions in Word2Vec come from?

I am using word2vec model for training a neural network and building a neural embedding for finding the similar words on the vector space. But my question is about dimensions in the word and context embeddings (matrices), which we initialise them by random numbers(vectors) at the beginning of the training, like this https://iksinc.wordpress.com/2015/04/13/words-as-vectors/
Lets say we want to display {book,paper,notebook,novel} words on a graph, first of all we should build a matrix with this dimensions 4x2 or 4x3 or 4x4 etc, I know the first dimension of the matrix its the size of our vocabulary |v|. But the second dimension of the matrix (number of vector's dimensions), for example this is a vector for word “book" [0.3,0.01,0.04], what are these numbers? do they have any meaning? for example the 0.3 number related to the relation between word “book" and “paper” in the vocabulary, the 0.01 is the relation between book and notebook, etc.
Just like TF-IDF, or Co-Occurence matrices that each dimension (column) Y has a meaning - its a word or document related to the word in row X.
The word2vec model uses a network architecture to represent the input word(s) and most likely associated output word(s).
Assuming there is one hidden layer (as in the example linked in the question), the two matrices introduced represent the weights and biases that allow the network to compute its internal representation of the function mapping the input vector (e.g. “cat” in the linked example) to the output vector (e.g. “climbed”).
The weights of the network are a sub-symbolic representation of the mapping between the input and the output – any single weight doesn’t necessarily represent anything meaningful on its own. It’s the connection weights between all units (i.e. the interactions of all the weights) in the network that gives rise to the network’s representation of the function mapping. This is why neural networks are often referred to as “black box” models – it can be very difficult to interpret why they make particular decisions and how they learn. As such, it's very difficult to say what the vector [0.3,0.01,0.04] represents exactly.
Network weights are traditionally initialised to random values for two main reasons:
It prevents a bias being introduced to the model before training begins
It allows the network to start from different points in the search space after initialisation (helping reduce the impact of local minima)
A network’s ability to learn can be very sensitive to the way its weights are initialised. There are more advanced ways of initialising weights today e.g. this paper (see section: Weights initialization scaling coefficient).
The way in which weights are initialised and the dimension of the hidden layer are often referred to as hyper-parameters and are typically chosen according to heuristics and prior knowledge of the problem space.
I have wondered the same thing and put in a vector like (1 0 0 0 0 0...) to see what terms it was nearest to. The answer is that the results returned didn't seem to cluster around any particular meaning, but were just kind of random. This was using Mikolov's 300-dimensional vectors trained on Google News.
Look up NNSE semantic vectors for a vector space where the individual dimensions do seem to carry specific human-graspable meanings.

How does a function approximation (say sine) in Neural network really works?

I am learning neural networks for the first time. I was trying to understand how using a single hidden layer function approximation can be performed. I saw this example on stackexchange but I had some questions after going through one of the answers.
Suppose I want to approximate a sine function between 0 and 3.14 radians. So will I have 1 input neuron? If so, then next if I assume K neurons in the hidden layer and each of which uses a sigmoid transfer function. Then in the output neuron(if say it just uses a linear sum of results from hidden layer) how can be output be something other than sigmoid shape? Shouldn't the linear sum be sigmoid as well? Or in short how can a sine function be approximated using this architecture in a Neural network.
It is possible and it is formally stated as the universal approximation theorem. It holds for any non-constant, bounded, and monotonically-increasing continuous activation function
I actually don't know the formal proof but to get an intuitive idea that it is possible I recommend the following chapter: A visual proof that neural nets can compute any function
It shows that with the enough hidden neurons and the right parameters you can create step functions as the summed output of the hidden layer. With step functions it is easy to argue how you can approximate any function at least coarsely. Now to get the final output correct the sum of the hidden layer has to be since the final neuron then outputs: . And as already said, we are be able to approximate this at least to some accuracy.

Resources