Someone should add "net#" as a tag. I'm trying to improve my neural network in Azure Machine Learning Studio by turning it into a convolution neural net using this tutorial:
https://gallery.cortanaintelligence.com/Experiment/Neural-Network-Convolution-and-pooling-deep-net-2
The differences between mine and the tutorial is I'm doing regression with 35 features and 1 label and they're doing classification with 28x28 features and 10 labels.
I start with the basic and 2nd example and get them working with:
input Data [35];
hidden H1 [100]
from Data all;
hidden H2 [100]
from H1 all;
output Result [1] linear
from H2 all;
Now the transformation to convolution I misunderstand. In the tutorial and documentation here: https://learn.microsoft.com/en-us/azure/machine-learning/machine-learning-azure-ml-netsharp-reference-guide it doesn't mention how the node tuple values are calculated for the hidden layers. The tutorial says:
hidden C1 [5, 12, 12]
from Picture convolve {
InputShape = [28, 28];
KernelShape = [ 5, 5];
Stride = [ 2, 2];
MapCount = 5;
}
hidden C2 [50, 4, 4]
from C1 convolve {
InputShape = [ 5, 12, 12];
KernelShape = [ 1, 5, 5];
Stride = [ 1, 2, 2];
Sharing = [ F, T, T];
MapCount = 10;
}
Seems like the [5, 12, 12] and [50,4,4] pop out of no where along with the KernalShape, Stride, and MapCount. How do I know what values are valid for my example? I tried using the same values, but it didn't work and I have a feeling since he has a [28,28] input and I have a [35], I need tuples with 2 integers not 3.
I just tried with random values that seem to correlate with the tutorial:
const { T = true; F = false; }
input Data [35];
hidden C1 [7, 23]
from Data convolve {
InputShape = [35];
KernelShape = [7];
Stride = [2];
MapCount = 7;
}
hidden C2 [200, 6]
from C1 convolve {
InputShape = [ 7, 23];
KernelShape = [ 1, 7];
Stride = [ 1, 2];
Sharing = [ F, T];
MapCount = 14;
}
hidden H3 [100]
from C2 all;
output Result [1] linear
from H3 all;
Right now it seems impossible to debug because the only error code Azure Machine Learning Studio ever gives is:
Exception":{"ErrorId":"LibraryException","ErrorCode":"1000","ExceptionType":"ModuleException","Message":"Error 1000: TLC library exception: Exception of type 'Microsoft.Numerics.AFxLibraryException' was thrown.","Exception":{"Library":"TLC","ExceptionType":"LibraryException","Message":"Exception of type 'Microsoft.Numerics.AFxLibraryException' was thrown."}}}Error: Error 1000: TLC library exception: Exception of type 'Microsoft.Numerics.AFxLibraryException' was thrown. Process exited with error code -2
Lastly my setup is
Thanks for the help!
The correct network definition for 35-column length input with given kernels and strides would be following:
const { T = true; F = false; }
input Data [35];
hidden C1 [7, 15]
from Data convolve {
InputShape = [35];
KernelShape = [7];
Stride = [2];
MapCount = 7;
}
hidden C2 [14, 7, 5]
from C1 convolve {
InputShape = [ 7, 15];
KernelShape = [ 1, 7];
Stride = [ 1, 2];
Sharing = [ F, T];
MapCount = 14;
}
hidden H3 [100]
from C2 all;
output Result [1] linear
from H3 all;
First, the C1 = [7,15]. The first dimension is simply the MapCount. For the second dimension, the kernel shape defines the length of the "window" that's used to scan the input columns, and the stride defines how much it moves at each step. So the kernel windows would cover columns 1-7, 3-9, 5-11,...,29-35, yielding the second dimension of 15 when you tally the windows.
Next, the C2 = [14,7,5]. The first dimension is again the MapCount. For the second and third dimension, the 1-by-7 kernel "window" has to cover the input size of 7-by-15, using steps of 1 and 2 along corresponding dimensions.
Note that you could specify C2 hidden layer shape of [98,5] or even [490], if you wanted to flatten the outputs.
Related
I would like to train a ML model in Caret based on training data.
I have a training data from the following structure:
df <- data.frame(Label = c("A","A","A","B","A", "A","A","B","B","A", "B","B","A","A","A"), EXPERIMENT = c("X","X","X","X","X", "Y","Y","Y","Y","Y", "Z","Z","Z","Z","Z"), VALUE1 = c( 1, 2, 1, 5, 1, 3, 1, 5, 6, 1, 7, 5, 1, 2, 2), VALUE2 = c( 9, 7, 8, 1, 8, 2, 1, 9, 8, 2, 7, 7, 2, 1, 1) )
I would want to use train and split the data according to experiments for cross-validation training (in this experiment 3 crossvalidation splits).
that is
Split1: training = X,Y and validation = Z
Split2: training = X,Z and validation = Y
Split2: training = Y,Z and validation = X
How can I do that? With traincontrol?
I found a index option in traincontrol, but did not understand, if that can do it.
I'm using the HuggingFace Transformers BERT model, and I want to compute a summary vector (a.k.a. embedding) over the tokens in a sentence, using either the mean or max function. The complication is that some tokens are [PAD], so I want to ignore the vectors for those tokens when computing the average or max.
Here's an example. I initially instantiate a BertTokenizer and a BertModel:
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
transformer_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(transformer_name, use_fast=True)
model = AutoModel.from_pretrained(transformer_name)
I then input some sentences into the tokenizer and get out input_ids and attention_mask. Notably, an attention_mask value of 0 means that the token was a [PAD] that I can ignore.
sentences = ['Deep learning is difficult yet very rewarding.',
'Deep learning is not easy.',
'But is rewarding if done right.']
tokenizer_result = tokenizer(sentences, max_length=32, padding=True, return_attention_mask=True, return_tensors='pt')
input_ids = tokenizer_result.input_ids
attention_mask = tokenizer_result.attention_mask
print(input_ids.shape) # torch.Size([3, 11])
print(input_ids)
# tensor([[ 101, 2784, 4083, 2003, 3697, 2664, 2200, 10377, 2075, 1012, 102],
# [ 101, 2784, 4083, 2003, 2025, 3733, 1012, 102, 0, 0, 0],
# [ 101, 2021, 2003, 10377, 2075, 2065, 2589, 2157, 1012, 102, 0]])
print(attention_mask.shape) # torch.Size([3, 11])
print(attention_mask)
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Now, I call the BERT model to get the 768-D token embeddings (the top-layer hidden states).
model_result = model(input_ids, attention_mask=attention_mask, return_dict=True)
token_embeddings = model_result.last_hidden_state
print(token_embeddings.shape) # torch.Size([3, 11, 768])
So at this point, I have:
token embeddings in a [3, 11, 768] matrix: 3 sentences, 11 tokens, 768-D vector for each token.
attention mask in a [3, 11] matrix: 3 sentences, 11 tokens. A 1 value indicates non-[PAD].
How do I compute the mean / max over the vectors for the valid, non-[PAD] tokens?
I tried using the attention mask as a mask and then called torch.max(), but I don't get the right dimensions:
masked_token_embeddings = token_embeddings[attention_mask==1]
print(masked_token_embeddings.shape) # torch.Size([29, 768] <-- WRONG. SHOULD BE [3, 11, 768]
pooled = torch.max(masked_token_embeddings, 1)
print(pooled.values.shape) # torch.Size([29]) <-- WRONG. SHOULD BE [3, 768]
What I really want is a tensor of shape [3, 768]. That is, a 768-D vector for each of the 3 sentences.
For max, you can multiply with attention_mask:
pooled = torch.max((token_embeddings * attention_mask.unsqueeze(-1)), axis=1)
For mean, you can sum along the axis and divide by attention_mask along that axis:
mean_pooled = token_embeddings.sum(axis=1) / attention_mask.sum(axis=-1).unsqueeze(-1)
In addition to #Quang, you can have a look at sentence_transformers Pooling layer.
For max pooling, they do this:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
pooled = torch.max(token_embeddings, 1)[0]
And for mean pooling they do the following:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
pooled = sum_embeddings / sum_mask
The max pooling presented in the accepted answer will suffer when the max is negative, and the implementation from sentence transformers changes token_embeddings, which throw an error when you want to use the embedding for back propagation:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
If you're interested on anything back-prop related, you can do something like this:
input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
pooled = torch.max(token_embeddings-input_mask_expanded, 1)[0] # Set padding tokens to large negative value
It's the same idea of making all masked tokens to be very small, but it doesn't change the token_embeddings while at it.
Alex is right.
Look on hidden states for strings that go into tokenizer. For different strings, padding will have different embeddings.
So, in order to properly pool embeddings, you need to ignore those padding vectors.
Let's say you want to get embeddings out of the last 4 layers of BERT (as it yields the best classification results):
#iterate over the last 4 layers and get embeddings for
#strings without having embeddings from PAD tokens
m = []
for i in range(len(hidden_states[0])):
m.append([hidden_states[j+9][i,:,:][tokens["attention_mask"][i] !=0] for j in range(4)])
#average over all tokens embeddings
means = []
for i in range(len(hidden_states[0])):
means.append(torch.stack(m[i]).mean(dim=1))
#stack embeddings for all strings
pooled = torch.stack(means).reshape(-1,1,3072)
I am trying to filter a single channel 2D image of size 256x256 using unfold to create 16x16 blocks with an overlap of 8. This is shown below:
# I = [256, 256] image
kernel_size = 16
stride = bx/2
patches = I.unfold(1, kernel_size,
int(stride)).unfold(0, kernel_size, int(stride)) # size = [31, 31, 16, 16]
I have started to attempt to put the image back together with fold but I’m not quite there yet. I’ve tried to use view to get the image to ‘fit’ the way it’s supposed to but I don’t see how this would preserve the original image. Perhaps I’m overthinking this.
# patches.shape = [31, 31, 16, 16]
patches = = filt_data_block.contiguous().view(-1, kernel_size*kernel_size) # [961, 256]
patches = patches.permute(1, 0) # size = [951, 256]
Any help would be greatly appreciated. Thanks very much.
I believe you will benefit from using torch.nn.functional.fold and torch.nn.functional.unfold in this case, as these functions are built specifically for images (or any 4D tensors, that is with shape B X C X H X W).
Let's start with unfolding the image:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
from sklearn.datasets import load_sample_image #Used to load a sample image
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor
#Load a flower image from sklearn.datasets, crop it to shape 1 X 3 X 256 X 256:
I = torch.from_numpy(load_sample_image('flower.jpg')).permute(2,0,1).unsqueeze(0).type(dtype)[...,128:128+256,256:256+256]
kernel_size = 16
stride = kernel_size//2
I_unf = F.unfold(I, kernel_size, stride=stride)
Here we obtain all the 16x16 image patches with strides of 8 by using the F.unfold function. This will result in a 3D tensor with shape torch.Size([1, 768, 961]). ie - 961 patches with 768 = 16 X 16 X 3 pixels within each.
Now, say we wish to fold it back to I:
I_f = F.fold(I_unf,I.shape[-2:],kernel_size,stride=stride)
norm_map = F.fold(F.unfold(torch.ones(I.shape).type(dtype),kernel_size,stride=stride),I.shape[-2:],kernel_size,stride=stride)
I_f /= norm_map
We use F.fold where we tell it the original shape of I, the kernel_size we used to unfold and the stride used. After folding I_unf we will obtain a summation with overlaps. This means that the resulting image will appear saturated. As a result, we need to compute a normalization map which will normalize multiple summation of pixels due to overlaps. A way to do this efficiently is to take a ones tensor and use unfold followed by fold - to mimic the summation with overlaps. This gives us the normalization map by which we normalize I_f to recover I.
Now, we wish to plot I_f and I to prove content is preserved:
#Plot I:
plt.imshow(I[0,...].permute(1,2,0).cpu()/255)
#Plot I_f:
plt.imshow(I_f[0,...].permute(1,2,0).cpu()/255)
This whole process will work also for single-channel images. One thing to notice is that if spatial dimensions of the image are not divisible by the stride, you will get norm_map with zeros (at the edges) due to some pixels not reachable but you can easily handle this case as well.
A slightly less elegant solution than that proposed by Gil:
I took inspiration from this post on the Pytorch forums, formatting my image tensor to be of standard shape B x C x H x W (1 x 1 x 256 x 256). Unfolding:
# CREATE THE UNFOLDED IMAGE SLICES
I = image # shape [256, 256]
kernel_size = bx #shape [16]
stride = int(bx/2) #shape [8]
I2 = I.unsqueeze(0).unsqueeze(0) #shape [1, 1, 256, 256]
patches2 = I2.unfold(2, kernel_size, stride).unfold(3, kernel_size, stride)
#shape [1, 1, 31, 31, 16, 16]
Following this, I do some transforms and filtering to my tensor stack. Before doing this I apply a cosine window and normalise:
# NORMALISE AND WINDOW
Pvv = torch.mean(torch.pow(win, 2))*torch.numel(win)*(noise_std**2)
Pvv = Pvv.double()
mean_patches = torch.mean(patches2, (4, 5), keepdim=True)
mean_patches = mean_patches.repeat(1, 1, 1, 1, 16, 16)
window_patches = win.unsqueeze(0).unsqueeze(0).unsqueeze(0).unsqueeze(0).repeat(1, 1, 31, 31, 1, 1)
zero_mean = patches2 - mean_patches
windowed_patches = zero_mean * window_patches
#SOME FILTERING ....
#ADD MEAN AND WINDOW BEFORE FOLDING BACK TOGETHER.
filt_data_block = (filt_data_block + mean_patches*window_patches) * window_patches
The above code works for me, but a mask would be more simple. Next, I prepare my tensor of form [1, 1, 31, 31, 16, 16] to be transformed back into the original [1, 1, 256, 256]:
# REASSEMBLE THE IMAGE USING FOLD
patches = filt_data_block.contiguous().view(1, 1, -1, kernel_size*kernel_size)
patches = patches.permute(0, 1, 3, 2)
patches = patches.contiguous().view(1, kernel_size*kernel_size, -1)
IR = F.fold(patches, output_size=(256, 256), kernel_size=kernel_size, stride=stride)
IR = IR.squeeze()
This allowed me to create an overlapping sliding window and seamlessly stitch the image back together. Cutting out the filtering makes for an identical image.
Not sure whether my question is TF specific or just NNs in general but i have created a CNN using tensorflow. and im having trouble understanding why the size of the output on my fully connected layer is what it is.
X = tf.placeholder(tf.float32, [None, 32, 32, 3])
y = tf.placeholder(tf.int64, [None])
is_training = tf.placeholder(tf.bool)
# define model
def complex_model(X,y,is_training):
# conv layer
wconv_1 = tf.get_variable('wconv_1', [7 ,7 ,3, 32])
bconv_1 = tf.get_variable('bconv_1', [32])
# affine layer 1
w1 = tf.get_variable('w1', [26*26*32//4, 1024]) #LINE 13
b1 = tf.get_variable('b1', [1024])
# batchnorm params
bn_gamma = tf.get_variable('bn_gamma', shape=[32]) #scale
bn_beta = tf.get_variable('bn_beta', shape=[32] ) #shift
# affine layer 2
w2 = tf.get_variable('w2', [1024, 10])
b2 = tf.get_variable('b2', [10])
c1_out = tf.nn.conv2d(X, wconv_1, strides=[1, 1, 1, 1], padding="VALID") + bconv_1
activ_1 = tf.nn.relu(c1_out)
mean, var = tf.nn.moments(activ_1, axes=[0,1,2], keep_dims=False)
bn = tf.nn.batch_normalization(act_1, mean, var, bn_gamma, bn_beta, 1e-6)
mp = tf.nn.max_pool(bn, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='VALID')
affine_in_flat = tf.reshape(mp, [-1, 26*26*32//4])
affine_1 = tf.matmul(affine_in_flat, w1) + b1
activ_2 = tf.nn.relu(affine_1)
affine_2 = tf.matmul(activ_2, w2) + b2
return affine_2
#print(affine_2.shape)
In line 13 where i set the value of w1 i would have expected to just put:
w1 = tf.get_variable('w1', [26*26*32, 1024])
however if i run the code with the line shown above and with
affine_in_flat = tf.reshape(mp, [-1, 26*26*32])
my output size is 16,10 instead of 64,10 which is what i would expect given the initialisations below:
x = np.random.randn(64, 32, 32,3)
with tf.Session() as sess:
with tf.device("/cpu:0"): #"/cpu:0" or "/gpu:0"
tf.global_variables_initializer().run()
#print("train", x.size, is_training, y_out)
ans = sess.run(y_out,feed_dict={X:x,is_training:True})
%timeit sess.run(y_out,feed_dict={X:x,is_training:True})
print(ans.shape)
print(np.array_equal(ans.shape, np.array([64, 10])))
can anybody tell me why i need to divide the size of w1[0] by 4?
Adding print statements for bn and mp I get:
bn: <tf.Tensor 'batchnorm/add_1:0' shape=(?, 26, 26, 32) dtype=float32>
mp: <tf.Tensor 'MaxPool:0' shape=(?, 13, 13, 32) dtype=float32>
Which would seem to be due to the strides=[1, 2, 2, 1] on the max pooling (but to maintain 26, 26 you'd also need padding='SAME').
I'm trying this very simple neural net which tells if a number is odd or even.
labels: [1, 0] means it's even. I'm using two output neuron because I'm using softmax function.
My code:
import tensorflow as tf
data_in = [
[1],
[2],
[3]
]
data_lbl = [
[0, 1],
[1, 0],
[0, 1]
]
# HP
learning_rate = 0.1
epochs = 10000
ip = tf.placeholder('float', [None, 1])
labels = tf.placeholder('float', [None, 2])
w1 = tf.Variable(tf.random_normal([1, 2]))
w2 = tf.Variable(tf.random_normal([2, 2]))
l1 = tf.matmul(ip, w1)
l2 = tf.matmul(l1, w2)
l2 = tf.nn.softmax(l2)
loss = tf.reduce_mean((labels - l2)**2)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for epoch in range(epochs):
_, err = sess.run([train, loss], feed_dict={ip: data_in, labels: data_lbl})
print(err)
print(sess.run(l2, feed_dict={ip: [[2], [5], [7]]}))
# [it is, it's not]
# 1 = even
sess.close()
My error is not changing and I'm getting wrong answers. Suggestions?
You have multiple issues here, fixing those should at least give you something that learns something:
You don't have any nonlinearities in your network other than the final softmax. You need nonlinearities, as parity is not a linear function.
Your intermediate layers are quite small.
Your training samples are very limited.
You don't have biases.
In addition, parity is a concept that is very hard to learn so it generalizes to numbers not seen in the training set.