How does correlation work for an even-sized filter in this example? - image-processing

(I know a question like this exists, but I wanted help with a specific example)
If the linear filter has even dimensions, how is the "center" defined? i.e. in the following scenario:
filter = np.array([[a, b],
[c, d]])
and the image was:
image = np.array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]])
what would be the result of correlation of the image with the linear filter?

Which of the elements of the even-sized filter is considered the origin is an arbitrary choice. Each implementation will make a different choice. Though a and d are the two most likely choices for reasons of similarity of the two image dimensions.
For example, MATLAB's imfilter (which implements correlation, not convolution) does the following:
f = [1,2;4,8];
img = [0,1,0;1,1,1;0,1,0];
imfilter(img,f,'same')
ans =
14 13 4
11 7 1
2 1 0
meaning that a is the origin of the kernel in this case. Other implementations might make a different choice.

Related

How to deal with array of string features in traditional machine learning?

Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
...
If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?
Age is pretty straightforward since it is already numerical.
Job can be hashed / indexed since it is a categorical variable.
Friends is a list of categorical variables. How would I go about representing this feature?
Approaches:
Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:
NULL -> 0
engineer -> 42069
World of Warcraft -> 9001
Netflix -> 14
9gag -> 9
manager -> 250
Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.
Approach 1: Hash and Stack
dataframe after feature transformation would look like this:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
Limitations
Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
...
Compare the features of the first and third record:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 14, 9, 9001, 0] 1
Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.
Approach 2: Hash, Order, and Stack
To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 9001, 14, 9, 0, 0] 1
This approach has a limitation too. Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
42 'manager' ['Netflix', '9gag'] 1
...
Applying feature transform with ordering we get:
row feature label
1 [23, 42069, 9001, 14, 9, 0, 0] 1
2 [35, 250, 0, 0, 0, 0, 0] 0
3 [26, 42069, 9001, 14, 9, 0, 0] 1
4 [44, 250, 14, 9, 0, 0, 0] 1
What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.
Approach 3: Convert Array to Columns
What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?
Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.
Approach 4: One-Hot-Encoding and then Sum
How about this? Convert each hash to one-hot-vector, and add up all these vectors.
In this case, the feature in row one for example would look like this:
[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]
Where 01x8 denotes a row of 8 zeros.
The problem with this approach is that these vectors will be very huge and sparse.
Approach 5: Use Embedding Layer and 1D-Conv
With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/
Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.
What are your thoughts on this?
As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.
For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).
I'll also throw a hashing variant into the mix:
Bloom Filter
You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.
First, there is no definitive answer to this problem, you presented 5 alternatives, and the five are valid, it all depends on the dataset you are using.
Considering this, I will list the options that I find most advantageous. For me option 5 is the best, but as in your case you want to use traditional machine learning techniques, I will discard it. So I would go for option 4, but in this case I need to know if you have the hardware to deal with this problem, if the answer is yes, I would go with this option, considering the answer is no, i would try approach 2, as you pointed out, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array, but not in row 4, but that won't be a problem if you have enough data for training ( again, it all depends on the data available ), even if I have some problems with this approach, I would apply a Data Augmentation technique before discarding it.
Option 1 seems to me the worst, in it you have a great chance of overfitting and certainly a use of a lot of computational resources.
Hope this helps!
Approach 1 (Hash and Stack) and 2 (Hash, Order, and Stack) resolve their limitations if the result of the hashing function is considered as the index of a sparse vector with values of 1 instead of the values of each position of the vector.
Then, whenever "World of Warcraft" is in friends array, the feature vector will have a value of 1 in position 9001, regardless of the position of "World of Warcraft" in friends array (limitation of approach 1) and regardless of the existence of other elements in friends array (limitation of approach 2). If "World of Warcraft" is not in friends array, then the value of features vector in position 9001 will most likely be 0 (look up hashing trick collisions to learn more).
Using word2vec representation (as a feature value), then do a supervised classification also can be a good idea.

How to efficiently find correspondences between two point sets without nested for loop in Pytorch?

I now have two point sets (tensor) A and B that shape like
A.size() >>(50, 3) , example: [ [0, 0, 0], [0, 1, 2], ..., [1, 1, 1]]
B.size() >>(10, 3)
where the first dimension stands for number of points and the second dim stands for coordinates (x,y,z)
To some extent, the question could also be simplified into " Finding common elements between two tensors ". Is there a quick way to do this without nested loop ?
You can quickly compute all the 50x10 distances using:
d2 = ((A[:, None, :] - B[None, ...])**2).sum(dim=2)
Once you have all the pair-wise distances, you can select "similar" ones if the distance does not exceed a threshold thr:
(d2 < thr).nonzero()
returns pairs of a-idx, b-idx of "similar" points.
If you want to match the points exactly, you can do instead:
((a[:, None, :] == b[None, ...]).all(dim=2)).nonzero()

Simple RNN example showing numerics

I'm trying to understand RNNs and I would like to find a simple example that actually shows the one hot vectors and the numerical operations. Preferably conceptual since actual code may make it even more confusing. Most examples I google just show boxes with loops coming out of them and its really difficult to understand what exactly is going on. In the rare case where they do show the vectors its still difficult to see how they are getting the values.
for example I don't know where the values are coming from in this picture https://i1.wp.com/karpathy.github.io/assets/rnn/charseq.jpeg
If the example could integrate LSTMs and other popular extensions that would be cool too.
In the simple RNN case, a network accepts an input sequence x and produces an output sequence y while a hidden sequence h stores the network's dynamic state, such that at timestep i: x(i) ∊ ℝM, h(i) ∊ ℝN, y(i) ∊ ℝP the real valued vectors of M/N/P dimensions corresponding to input, hidden and output values respectively. The RNN changes its state and omits output based on the state equations:
h(t) = tanh(Wxh ∗ [x(t); h(t-1)]), where Wxh a linear map: ℝM+N ↦ ℝN, * the matrix multiplication and ; the concatenation operation. Concretely, to obtain h(t) you concatenate x(t) with h(t-1), you apply matrix multiplication between Wxh (of shape (M+N, N)) and the concatenated vector (of shape M+N) , and you use a tanh non-linearity on each element of the resulting vector (of shape N).
y(t) = sigmoid(Why * h(t)), where Why a linear map: ℝN ↦ ℝP. Concretely, you apply matrix multiplication between Why (of shape (N, P)) and h(t) (of shape N) to obtain a P-dimensional output vector, on which the sigmoid function is applied.
In other words, obtaining the output at time t requires iterating through the above equations for i=0,1,...,t. Therefore, the hidden state acts as a finite memory for the system, allowing for context-dependent computation (i.e. h(t) fully depends on both the history of the computation and the current input, and so does y(t)).
In the case of gated RNNs (GRU or LSTM), the state equations get somewhat harder to follow, due to the gating mechanisms which essentially allow selection between the input and the memory, but the core concept remains the same.
Numeric Example
Let's follow your example; we have M = 4, N = 3, P = 4, so Wxh is of shape (7, 3) and Why of shape (3, 4). We of course do not know the values of either W matrix, so we cannot reproduce the same results; we can still follow the process though.
At timestep t<0, we have h(t) = [0, 0, 0].
At timestep t=0, we receive input x(0) = [1, 0, 0, 0]. Concatenating x(0) with h(0-), we get [x(t); h(t-1)] = [1, 0, 0 ..., 0] (let's call this vector u to ease notation). We apply u * Wxh (i.e. multiplying a 7-dimensional vector with a 7 by 3 matrix) and get a vector v = [v1, v2, v3], where vi = Σj uj Wji = u1 W1i + u2 W2i + ... + u7 W7i. Finally, we apply tanh on v, obtaining h(0) = [tanh(v1), tanh(v2), tanh(v3)] = [0.3, -0.1, 0.9]. From h(0) we can also get y(0) via the same process; multiply h(0) with Why (i.e. 3 dimensional vector with a 3 by 4 matrix), get a vector s = [s1, s2, s3, s4], apply sigmoid on s and get σ(s) = y(0).
At timestep t=1, we receive input x(1) = [0, 1, 0, 0]. We concatenate x(1) with h(0) to get a new u = [0, 1, 0, 0, 0.3, -0.1, 0.9]. u is again multiplied with Wxh, and tanh is again applied on the result, giving us h(1) = [1, 0.3, 1]. Similarly, h(1) is multiplied by Why, giving us a new s vector on which we apply the sigmoid to obtain σ(s) = y(1).
This process continues until the input sequence finishes, ending the computation.
Note: I have ignored bias terms in the above equations because they do not affect the core concept and they make notation impossible to follow

How to correctly implement dropout for convolution in TensorFlow

According to the original paper on Dropout said regularisation method can be applied to convolution layers often improving their performance. TensorFlow function tf.nn.dropout supports that by having a noise_shape parameter to allow the user to choose which parts of the tensors will drop out independently. However, neither the paper nor the documentation give a clear explanation of which dimensions should be kept independently, and the TensorFlow explanation of how noise_shape works is rather unclear.
only dimensions with noise_shape[i] == shape(x)[i] will make independent decisions.
I would assume that for a typical CNN layer output of the shape [batch_size, height, width, channels] we don't want individual rows or columns to drop out by themselves, but rather whole channels (which would be equivalent to a node in a fully connected NN) independently of the examples (i.e. different channels could be dropped for different examples in a batch). Am I correct in this assumption?
If so, how would one go about implementing dropout with such specificity using the noise_shape parameter? Would it be:
noise_shape=[batch_size, 1, 1, channels]
or:
noise_shape=[1, height, width, 1]
from here,
For example, if shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n], each batch and channel component will be kept independently and each row and column will be kept or not kept together.
The code may help explain this.
noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
# uniform [keep_prob, 1.0 + keep_prob)
random_tensor = keep_prob
random_tensor += random_ops.random_uniform(noise_shape,
seed=seed,
dtype=x.dtype)
# 0. if [keep_prob, 1.0) and 1. if [1.0, 1.0 + keep_prob)
binary_tensor = math_ops.floor(random_tensor)
ret = math_ops.div(x, keep_prob) * binary_tensor
ret.set_shape(x.get_shape())
return ret
the line random_tensor += supports broadcast. When the noise_shape[i] is set to 1, that means all elements in this dimension will add the same random value ranged from 0 to 1. So when noise_shape=[k, 1, 1, n], each row and column in the feature map will be kept or not kept together. On the other hand, each example (batch) or each channel receives different random values and each of them will be kept independently.

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?
Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.
What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Resources