Obtaining intermediate layers' activation values given input example(s) - machine-learning

Suppose I have defined my sequential model as follow:
require 'nn'
net = nn.Sequential()
net:add(nn.SpatialConvolution(1, 6, 5, 5)) -- 1 input image channel, 6 output channels, 5x5 convolution kernel
net:add(nn.ReLU()) -- non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2)) -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.ReLU()) -- non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(16*5*5)) -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5
net:add(nn.Linear(16*5*5, 120)) -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.ReLU()) -- non-linearity
net:add(nn.Linear(120, 84))
net:add(nn.ReLU()) -- non-linearity
net:add(nn.Linear(84, 10)) -- 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax()) -- converts the output to a log-probability. Useful for classification problems
And here's the model printed:
net
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> output]
(1): nn.SpatialConvolution(1 -> 6, 5x5)
(2): nn.ReLU
(3): nn.SpatialMaxPooling(2x2, 2,2)
(4): nn.SpatialConvolution(6 -> 16, 5x5)
(5): nn.ReLU
(6): nn.SpatialMaxPooling(2x2, 2,2)
(7): nn.View(400)
(8): nn.Linear(400 -> 120)
(9): nn.ReLU
(10): nn.Linear(120 -> 84)
(11): nn.ReLU
(12): nn.Linear(84 -> 10)
(13): nn.LogSoftMax
}
Simply using net:forward(input) returns the last layer's output after LogSoftMax has been applied to which is what I do not want. Instead, I would like to get the activations of some of the intermediate layers (e.g. module 6).
So, How can I get the activations of each of the intermediate layers when feeding an input? i.e. I feed an input example to the network and want to extract the activation results of the 6th layer (a convolutional layer), not just the final layer.
Thanks

Via net:get(6).output (see get and output).

Related

CoreML: exception Espresso exception: "Invalid state": Null output blobs

I run a keras model inside my iOS application using CoreML and I have this problem I can't completely understand but I think it's not related to the app but instead to the Keras model. I'm using 48x48 images, is it supported this size in CoreML?
[Espresso::handle_ex_plan] exception=Espresso exception: "Invalid state": Null output blobs [Exception from Layer: 5: sequential/conv2d_3/BiasAdd]
2020-12-15 01:06:31.245711+0100 TSD[41213:1753543] [coreml] Error
computing NN outputs -1
2020-12-15 01:06:31.245849+0100 TSD[41213:1753543] [coreml] Failure in
-executePlan:error:. Error computing NN outputs.
def cnn_model():
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same',
activation='relu'))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(128, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(keras.layers.InputLayer(input_shape=(X.shape[1],)))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(CLASSNAME_SIZE, activation='softmax'))
return model
EDIT 1
This is the problematic layer according to the error description (conv2d_3/biasadd)
I would remove this line,
model.add(keras.layers.InputLayer(input_shape=(X.shape[1])))
from your Keras model and try again. Anything else seems fine, but that line might be messing things up during the model conversion to Core ML.

what does that mean "vector augmented to 1"?

I am new to machine learning and statistics (well, I've been learning math in my university but that was about 10-12 years ago)
Could you please explain the meaning of following sentence from 4 page (in a book 5 page) from book here ( https://www.researchgate.net/publication/227612766_An_Empirical_Comparison_of_Machine_Learning_Models_for_Time_Series_Forecasting ):
The multilayer perceptron (often simply called neural network) is perhaps the most
popular network architecture in use today both for classification and regression (Bishop
[5]). The MLP is given as follows:
N
H
y ˆ = v0 +
j=1
X
vj g(wj T x′ )
(1)
where x′ is the input vector x, augmented with 1, i.e. x′ = (1, xT )T , wj is the weight
vector for j th hidden node, v0 , v1 , . . . , vN H are the weights for the output node, and y ˆ is
the network output. The function g represents the hidden node output, and it is given
in terms of a squashing function, for example (and that is what we used) the logistic
function: g(u) = 1/(1 + exp(−u)). A related model in the econometrics literature is
For instance, we have a vector x = [0.2, 0.3, 0.4, 0.5]
How do I transform it to get a x′ vector augmented to 1
x′ = (1, x)
This is part of the isomorphism between matrices and systems of equations. What you have at the moment is a row equivalent to a right-hand-side expression, such as
w1 = 0.2*x1 + 0.3*x2 + 0.4*x3 + 0.5*x4
w2 = ...
w3 = ...
w4 = ...
When we want to solve the system, we need to augment the matrix. This requires adding the coefficient of each w[n] variable. They are trivially all ones:
1*w1 = 0.2*x1 + 0.3*x2 + 0.4*x3 + 0.5*x4
1*w2 = ...
1*w3 = ...
1*w4 = ...
... and that's where we get the augmented matrix. When we assume the variables by position -- w by row, x by column -- what remains is the coefficients alone, in a nice matrix.

What is the equation for multivariate kernel density estimation techniques?

I was reading about non-parametric kernel density estimation.
http://en.wikipedia.org/wiki/Kernel_density_estimation
For uni-variate where D = 1, we can write like
For Multivariate Kernel density estimation (KDE), more preciously for d=3 and X = (x,y,z) can we write:
Is this technically correct? Can any one help with this?
This is very difficult to do on your own, and you really should do this through some package. Nevertheless, the definition is:
fH(x)= 1 / n \sum{i=1}n KH (x - xi), where
x = (x1, x2, …, xd)T, xi = (xi1, xi2, …, xid)T, i = 1, 2, …, n are d-vectors;
H is the bandwidth (or smoothing) d×d matrix which is symmetric and positive definite;
K is the kernel function which is a symmetric multivariate density;
KH(x) = |H|−1/n K(H−1/2x).

Logistic Regression and optimal parameters w

When I learn Logistic Regression, we use negative log likelihood to optimize the parameters w for us.
SO the loss function(negative log likelihood) is L(w).
There is an assertion that: the magnitude of the optimal w can go to infinity when the training samples are linearly seperable.
I get very confused:
1. what does the magnitude of optimal w mean?
2. Could you explain why w can go infinity?
It is the norm (euclidean, for example) what is usually understood as a magnitude of a vector.
Assume that we do binary classification and classes are linearly separable. That means
that there exists w' such that (x1, w') ≥ 0 for x1 from one class and (x2, w') < 0 otherwise. Then consider z = a w' for some positive a. It's clear that (x1, z) ≥ 0 and (x2, z) < 0 (we can multiply equations for w' by a and use linearity of dot product), so as you can see there are separating hyperplanes (zs) of unbounded norm (magnitude).
That's why one should add regularization term.
Short answer:
This is fundamental characteristic of the log function.
consider:
log(x), where x spans (0,1)
Range of values log(x) can take:
is (-Inf, 0)
More specifically to your question -
Log likelihood is given by: ( see image )
l(w) = y * log( h(x)) + (1 - y) * log (1 - h(x) )
where,
h(x) is a sigmoid function parameters by w:
h(x) = ( 1 + exp{-wx} )^-1
For simplicity consider the case of a training example where y = 1,
the equation becomes :
likelihood (l) :
= y * log ( h(x) );
= log ( h(x) )
h(x) in logistic regression maybe represented by the sigmoid function.
it has a range (0,1)
Hence,
range of (l):
(log (0), log(1) ) = (-Inf, 0)
(l) spans the range (-Inf, 0)
The above simplification only considered the (y = 1) case. If you consider the entire log likelihood function (i.e for y=1 & y=0), you will see a inverted bowl shaped cost function. Hence there is a optimum weight that will maximize log likelihood (l) or minimize negative log likelihood (-l)

Fast array access using Racket FFI

I am trying to write OpenCV FFI in Racket and arrived at a point where arrays need to be manipulated efficiently. However, all my attempts to access arrays by using Racket FFI resulted in very inefficient code. Is there a way for fast access of C arrays using FFI?
In Racket, this type of manipulation is reasonably fast, i.e.:
(define a-vector (make-vector (* 640 480 3)))
(time (let loop ([i (- (* 640 480 3) 1)])
(when (>= i 0)
;; invert each pixel channel-wise
(vector-set! a-vector i (- 255 (vector-ref a-vector i)))
(loop (- i 1)))))
-> cpu time: 14 real time: 14 gc time: 0
Now, in OpenCV, there is a struct called IplImage that looks like this:
typedef struct _IplImage
{
int imageSize; /* sizeof(IplImage) */
...
char *imageData; /* Pointer to aligned image data.*/
}IplImage;
The struct is defined in Racket as follows:
(define-cstruct _IplImage
([imageSize _int]
...
[imageData _pointer]))
Now we load an image using cvLoadImage function as follows:
(define img
(ptr-ref
(cvLoadImage "images/test-image.png" CV_LOAD_IMAGE_COLOR)
_IplImage))
The pointer imageData can be accessed by: (define data (IplImage-imageData img)))
Now, we want to manipulate data, and the most efficient way I could come up with was by using pointers:
(time (let loop ([i (- (* width height channels) 1)]) ;; same 640 480 3
(when (>= i 0)
;; invert each pixel channel-wise
(ptr-set! data _ubyte i (- 255 (ptr-ref data _ubyte i)))
(loop (- i 1)))))
-> cpu time: 114 real time: 113 gc time: 0
This is very slow, compared to the speed of native Racket vectors.
I also tried other ways, such as _array, _cvector that don't even come close to the speed of using pointers, except for writing a first-class function in C that gets a function for running over the whole array. This C function is compiled to a library and bound in Racket using FFI. Then, Racket procedures can be passed to it and applied to all elements of the array. The speed was the same as with pointers, but still not sufficient to continue porting OpenCV library to Racket.
Is there a better way to do this?
I tried the approach suggested by Eli and it worked out! The idea is to use a bytestring. Since in this case the size of the array is known, (make-sized-byte-string cptr length) can be used:
(define data (make-sized-byte-string (IplImage-imageData img)
(* width height channels)))
This results in run times close to Racket's native vectors:
(time (let loop ([i (- (* 640 480 3) 1)])
(when (>= i 0)
;; invert each pixel channel-wise
(bytes-set! data i (- 255 (bytes-ref data i)))
(loop (- i 1)))))
-> cpu time: 18 real time: 18 gc time: 0
Thank you, Eli.
It would probably be better to set the whole thing using a bytestring (via _bytes), but that's a very rough guess. It would be much better to ask this question on the mailing list...

Resources