If you have 32 input channels and 16 filters, do channels 17-32 get dropped? Or do 16 random channels get dropped or do all of the channels pass on? What if there are more filters than input channels?
I read that the first filter/kernel convolves with the corresponding channel in a convolution layer --first channel with first filter, second channels with second filter, ect -- but what happens if the number of input channels is not equal to the number of filters?
(Source: http://machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html)
If you have 32 input channels, then the input has shape (samples, W, H, 32). then your 16 filters would have shape (F_W, F_H, 32). As you see, each of the filters has 32 channels, so when you perform convolution it is compatible since the number of channels in the filter equal the number of channels in the input.
And then since there are 16 filters, after doing all 16 convolutions, you will have a 16 channel output feature map, since each convolution operation produces one feature map.
Here input channel means depth. Like in RGB image we have 3 channels.
what these 16 filter do is, they compute 16 different feature maps(FM) and these 16 FM will be input for next layer.
Hope this helps!
Related
I read an article, and the authors use a CNN with the following scheme :
-----------------
Input image 30*30 => Feature maps => Feature maps => Feature maps => Feature maps
28 * 28 14*14 12*12 6*6
-----------------
Filtring 5*3*3 downsampling 2*2 Filtring 5*3*3 downsampling 2*2
With my comprehension we have two filter 5 * 3 and the last 3 correspond to the channel RGB so 3, is it correct ?
it means you have 5 channels (means 5 filters/kernels) of 3x3.
What they are trying to say is that:
First:
the conv is being done using 5 different 3x3 2d kernels
input 30x30 ==> output 5 different 28x28
Second:
max pool 2x2 ie the output dim becomes halved
input 28x28 ==> output 14x14
Third:
the conv is being done using 5 different 3x3 2d kernels
input 14x14 ==> output 5 different 12x12
Lastly:
max pool 2x2 ie the output dim becomes halved
input 12x12 ==> output 6x6
Below is the code I have been writing to try to create a mel triangular filter bank.
I start with 300 to 8000 hz range, convert the frequency to mels, and then mels back into frequency to then get the fft_bin numbers.
clear all;
g=[300 8000]; % low freqncy and fs/2 for the highest frequency
freq2mel=1125*log(1+(g/700)); % creating mel scale from the frequency
% answer [401.25 2834.99]
f=linspace(0,2835,12); % if we want 10 filter banks that we use the
two endpoints and it will put 10 banks between them
% answer is [401.25 622.50 843.75 1065.0 1286.25 1507.50 1728.74
1949.99 2171.24 2392.49 2613.74 2834.99]
mel2freq=700*(exp(f/1125)-1); % converting the mel back into frequency
%answer is [300 517.33 781.90 1103.97 1496.04 1973.32 2554.33
3261.62 4122.63 5170.76 6446.70 8000]
fft_bins=floor((mel2freq/16000)*512); % creating fft bins
%answer is [9 16 25 35 47 63 81 104 132 165 206 256]
My issue is this. I am stuck after this. I keep seeing the below filter bank piecewise function come up but I do not understand what K is in this function. Is k the array of $$ \mid(FFT)\mid ^2 $$ numbers from the hamming window? how to get the actual filter with the triangular output with magnitude of 1 to pass the $\mid(FFT)\mid^2$ to get my MFCC's. Can someone please help me out.
Normally when you do this kind of filtering, you have your spectrum in a 1D array and then you have a Mel filter bank in 2D matrix, with one dimension matching the FFT bins on your spectrum array, and another dimension being your target Mel bands. You multiply them and get your 1D Mel spectrum.
The H_m function really just describes a triangle centered around m, where m is the center Mel band and k is the frequency from 0 to Fs/2. In theory, the k parameter should be continuous. You can assume that k is a FFT bin and it will kind of work, but you will not get great results at low frequencies where you entire Mel band covers 1 or 2 FFT bins. If you need to get a better resolution than that, you will consider how much of the triangle a particular FFT bin contains.
I am trying to parameterise a 1D conv net via Torch.
Let's say I have a Tensor called data that is of dimensions 10 x 512, in that there are 10 rows and 512 columns. As such, I want to implement a single 3-layer stack of a TemporalConvolution layer, followed by ReLU, followed by TemporalMaxPooling. My classification problem is binary, and there is a corresponding labels tensor, which is 10 x 1. Let us assume that there is already written a feval to iterate through each row in both data and labels.
As such, the problem is to construct a net that can map from 512 columns down to 1 column
Adapted from the documentation:
...
model = nn.Sequential()
model:add(nn.TemporalConvolution(inputFrameSize, outputFrameSize, kW, [dW]))
model:add(nn.ReLU())
model:add(nn.TemporalMaxPooling(kW2, [dW2])
...
criterion = nn.BCECriterion()
...
I have parameterised it as follows, but the following doesn't work : /
TemporalConvolution(512,1,3,1)
ReLU())
TemporalMaxPooling(3, 1)
It throws the error: 2D or 3D(batch mode) tensor expected. As a result I tried to reshape data before passing it to the net:
data = data:resize(1, 100, 512)
But this throws the error: invalid input frame size.
I can see that the error concerns the shape of the data coming into the conv net and of course the parameterisation too. I am further confused by this post here which seems to suggest that inputFrameSize of TemporalConvolution should be set to 10 not 512.
Any guidance would be appreciated, as to how to build a 1D conv net.
P.S. I have tested the script with a logisticRegression model, and that runs, so the issue is purely with the conv net architecture / the shape of the data coming into it.
I guess you misunderstand the meaning of inputFrameSize, which is not the seqlen of your input but n_channels (e.g. for 512*512 RGB images in 2d-convlution, the inputFrameSize should be 3 not 512).
I am using BOW in opencv for clustering the features of variable size. However one thing is not clear from the documentation of the opencv and also i am unable to find the reason for this question:
assume: dictionary size = 100.
I use surf to compute the features, and each image has variable size descriptors e.g.: 128 x 34, 128 x 63, etc. Now in BOW each of them are clustered and I get a fixed descriptor size of 128 x 100 for a image. I know 100 is the cluster center created using kmeans clustering.
But I am confused in that, if image has 128 x 63 descriptors, than how come it clusters into 100 clusters which is impossible using kmeans UNLESS i convert the descriptor matrix to 1D. Wont converting to 1D will lose valid 128 dimensional information of a single key points?
I need to know how is the descriptor matrix manipulated to get 100 cluter centers from only 63 features.
Think it like this.
You have 10 cluster means total and 6 features for current image. First 3 of those features are closest to 5th mean and remaining 3 of them are closest to 7th, 8th, and 9th mean respectively. Then your feature will be like [0, 0, 0, 0, 3, 0, 1, 1, 1, 0] or normalized version of this. Which is 10 dimensional, and that is equal to number of cluster mean. So you can create 100000 dimensional vector from 63 features if you want.
But still I think there is something wrong, because after you applied BOW, your features should be 1x100 not 128x100. Your cluster means are 128x1 and you are assigning your 128x1 sized features (you hvae 34 128x1 feature for first image, 63 128x1 feature for second image, etc.) to those means. So in basic you are assigning 34 or 63 features to 100 means, your result should be 1x100.
I have an array of 240 data points sampled at 600hz, representing 400ms. I need to resample this data to 512 data points sampled at 1024hz, representing 500ms. I assume since I'm starting with 400ms of data, the last 100ms will just need to be padded with 0s.
Is there a best approach to take to accomplish this?
If you want to avoid interpolation then you need to upsample to a 76.8 kHz sample rate (i.e. insert 127 0s after every input sample), low pass filter, then decimate (drop 74 out of every 75 samples).
You can use windowed Sinc interpolation, which will give you the same result as upsampling and downsampling using a linear phase FIR low-pass filter with a windowed Sinc impulse response. When using a FIR filter, one normally has to pad a signal with zeros the length of the FIR filter kernel on both sides.
Added:
Another possibility is to zero pad 240 samples with 60 zeros, apply a non-power-of-2 FFT of length 300, "center" zero pad the FFT result with 212 complex zeros to make it 512 long, but with the identical spectrum, and do an IFFT of length 512 to get the resampled result.
Yes to endolith's response, if you want to interpolate x[n] by simply computing the FFT, zero-stuff, and then IFFT, you'll get errors if x[n] is not periodic. See this reference: http://www.embedded.com/design/other/4212939/Time-domain-interpolation-using-the-Fast-Fourier-Transform-
FFT based resampling/upsampling is pretty easy...
If you can use python, scipy.signal.resample should work.
For C/C++, there is a simple fftw trick to upsample if you have real (as opposed to complex) data.
nfft = the original data length
upnfft = the new data length
double * data = the original data
// allocate
fftw_complex * tmp_fd = (fftw_complex*)fftw_malloc((upnfft/2+1)*sizeof(fftw_complex));
double * result = (double*)fftw_malloc(upnfft*sizeof(double));
// create fftw plans
fftw_plan fft_plan = fftw_plan_dft_r2c_1d(nfft, data, tmp_fd, FFTW_ESTIMATE);
fftw_plan ifft_plan = fftw_plan_dft_c2r_1d(upnfft, tmp_fd, result, FFTW_ESTIMATE);
// zero out tmp_fd
memset(tmp_fd, 0, (upnfft/2+1)*sizeof(fftw_complex));
// execute the plans (forward then reverse)
fftw_execute_dft_r2c(fft_plan, data, tmp_fd);
fftw_execute_dft_c2r(ifft_plan, tmp_fd, result);
// cleanup
fftw_free(tmp_fd);