I'm very new to dl4j and I'm struggling to preform basic operations within a graph.
As an example how would I preform a multiplication of the output of 2 layers. For example in the code below what should the ??? be substituted with? Given that I want to matrix multiply document_output and question_output.
builder.addLayer(document_output,
RnnProvider().makeLaterLayer(), second_layer_document_rnn)
builder.addLayer(question_output,
RnnProvider().makeLaterLayer(), second_layer_question_rnn)
builder.addVertex(attention,
???,
document_output, question_output
)
builder.setOutputs(attention)
Related
I am working on a machine learning project to learn more about this field. The project is about image classification. I want to use the EffnetB0 architecure and they mention in this architecure they use in the fisrt stage the following layer: "Conv3X3" and the following layers they use "MBConv1".
I tried to understand the difference between these two layers but I can't seem to find the answer. These two layers are both convolutional layers right ?
But what exactly is the difference between "Conv" and "MBConv"?
Thank you for helping me!
A conv means that there is a convolution core to scan the matrix corresponding to the target image line by line and convolution, the result of each convolution constitutes a value of the output matrix.
About the MBConv,i think you means mobile inverted bottleneck convolution,it's more of an encapsulated module than a single conv layer. A MBConv's structure can be expressed as follows:
MBConv = 1x1conv(ascending dimension) + Depthwise Convolution + SENet + 1x1conv(dimensionality reduction) + add
By the way, you may notice the new names Depthwise Convolution and SENet, which are also a kind of modules(honestly, it's like a nesting doll)
If you just want to use it, you don't necessarily need to fully understand it until you need to improve your model structure. So my answer to your question
What is the difference between these two layers : CONV and MBConv?
is : the former is a simple layer, and the latter is a complex module made up of many simple layers
Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.
I'm working on a project which tries to "learn" a relationship between a set of around 10 k complex-valued input images (amplitude/phase; real/imag) and a real-valued output-vector with 48 entries. This output-vector is not a set of labels, but a set of numbers which represents the best parameters to optimize the visual impression of the given complex-valued image. These parameters are generated by an algorithm. It's possible, that there is some noise in the data (comming from images and from the algorithm which generates the parameter-vector)
Those parameters more-less depends on the FFT (fast-fourier-transform) of the input image. Therfore I was thinking of feeding the network (5 hidden-layers, but architecture shouldn't matter right now) with a 1D-reshaped version of the FFT(complexImage) - some pseudocode:
// discretize spectrum
obj_ft = fftshift(fft2(object));
obj_real_2d = real(obj_ft);
obj_imag_2d = imag(obj_ft);
// convert 2D in 1D rows
obj_real_1d = reshape(obj_real_2d, 1, []);
obj_imag_1d = reshape(obj_imag_2d, 1, []);
// create complex variable for 1d object and concat
obj_complx_1d(index, :) = [obj_real_1d obj_imag_1d];
opt_param_1D(index, :) = get_opt_param(object);
I was wondering if there is a better approach for feeding complex-valued images into a deep-network. I'd like to avoid the use of complex gradients, because it's not really necessary?! I "just" try to find a "black-box" which outputs the optimized parameters after inserting a new image.
Tensorflow gets the input: obj_complx_1d and output-vector opt_param_1D for training.
There are several ways you can treat complex signals as input.
Use a transform to make them into 'images'. Short Time Fourier Transforms are used to make spectrograms which are 2D. The x-axis being time, y-axis being frequency. If you have complex input data, you may choose to simply look at the magnitude spectrum, or the power spectral density of your transformed data.
Something else that I've seen in practice is to treat the in-phase and quadrature (real/imaginary) channels separate in early layers of the network, and operate across both in higher layers. In the early layers, your network will learn characteristics of each channel, in higher layers it will learn the relationship between the I/Q channels.
These guys do a lot with complex signals and neural nets. In particular check out 'Convolutional Radio Modulation Recognition Networks'
https://radioml.com/research/
The simplest way to feed complex valued numbers with out using complex gradients in your models is to represent the complex values in a different representation. The two main ways are:
Magnitude/Angle components
Real/Imaginary components
I'll show this idea using magnitude/angle components. Assuming you have a 2d numpy array representing an image with shape = (WIDTH, HEIGHT)
import numpy as np
kSpace = np.fft.ifftshift(np.fft.fft2(img))
This would give you a 2D complex array. You can then transform the array into a
data = np.dstack((np.abs(kSpace), np.angle(kSpace)))
This array will be a numpy array with shape = (WIDTH, HEIGHT, 2). This array represents one complex valued image. For a set of images, make sure to concatenate them together to get an array of shape = (NUM_IMAGES, WIDTH, HEIGHT, 2)
I made a simple example of using tensorflow to learn an Fourier Transform with a simple neural network. You can find this example at https://github.com/michaelmendoza/learning-tensorflow
I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.
Can anyone explain me the similarities and differences, of the Correlation and Convolution ? Please explain the intuition behind that, not the mathematical equation(i.e, flipping the kernel/impulse).. Application examples in the image processing domain for each category would be appreciated too
You will likely get a much better answer on dsp stack exchange but... for starters I have found a number of similar terms and they can be tricky to pin down definitions.
Correlation
Cross correlation
Convolution
Correlation coefficient
Sliding dot product
Pearson correlation
1, 2, 3, and 5 are very similar
4,6 are similar
Note that all of these terms have dot products rearing their heads
You asked about Correlation and Convolution - these are conceptually the same except that the output is flipped in convolution. I suspect that you may have been asking about the difference between correlation coefficient (such as Pearson) and convolution/correlation.
Prerequisites
I am assuming that you know how to compute the dot-product. Given two equal sized vectors v and w each with three elements, the algebraic dot product is v[0]*w[0]+v[1]*w[1]+v[2]*w[2]
There is a lot of theory behind the dot product in terms of what it represents etc....
Notice the dot product is a single number (scalar) representing the mapping between these two vectors/points v,w In geometry frequently one computes the cosine of the angle between two vectors which uses the dot product. The cosine of the angle between two vectors is between -1 and 1 and can be thought of as a measure of similarity.
Correlation coefficient (Pearson)
Correlation coefficient between equal length v,w is simply the dot product of two zero mean signals (subtract mean v from v to get zmv and mean w from w to get zmw - here zm is shorthand for zero mean) divided by the magnitudes of zmv and zmw.
to produce a number between -1 and 1. Close to zero means little correlation, close to +/- 1 is high correlation. it measures the similarity between these two vectors.
See http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient for a better definition.
Convolution and Correlation
When we want to correlate/convolve v1 and v2 we basically are computing a series of dot-products and putting them into an output vector. Let's say that v1 is three elements and v2 is 10 elements. The dot products we compute are as follows:
output[0] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[1] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[2] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[3] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[4] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[5] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
output[6] = v1[0]*v2[8]+v1[1]*v2[9]+v1[2]*v2[10] #note this is
#mathematically valid but might give you a run time error in a computer implementation
The output can be flipped if a true convolution is needed.
output[5] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[4] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[3] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[2] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[1] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[0] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
Notice that we have less than 10 elements in the output as for simplicity I am computing the convolution only where both v1 and v2 are defined
Notice also that the convolution is simply a number of dot products. There has been considerable work over the years to be able to speed up convolutions. The sweeping dot products are slow and can be sped up by first transforming the vectors into the fourier basis space and then computing a single vector multiplication then inverting the result, though I won't go into that here...
You might want to look at these resources as well as googling: Calculating Pearson correlation and significance in Python
The best answer I got were from this document:http://www.cs.umd.edu/~djacobs/CMSC426/Convolution.pdf
I'm just going to copy the excerpt from the doc:
"The key difference between the two is that convolution is associative. That is, if F and G are filters, then F*(GI) = (FG)*I. If you don’t believe this, try a simple example, using F=G=(-1 0 1), for example. It is very convenient to have convolution be associative. Suppose, for example, we want to smooth an image and then take its derivative. We could do this by convolving the image with a Gaussian filter, and then convolving it with a derivative filter. But we could alternatively convolve the derivative filter with the Gaussian to produce a filter called a Difference of Gaussian (DOG), and then convolve this with our image. The nice thing about this is that the DOG filter can be precomputed, and we only have to convolve one filter with our image.
In general, people use convolution for image processing operations such as smoothing, and they use correlation to match a template to an image. Then, we don’t mind that correlation isn’t associative, because it doesn’t really make sense to combine two templates into one with correlation, whereas we might often want to combine two filter together for convolution."
Convolution is just like correlation, except that we flip over the filter before correlating