Co-occurrence for set prediction - machine-learning

I'm having trouble with set prediction. I thought I wanted to use co-occurrence to solve this, but now that I've attempted it I'm not sure it's the right tool to use.
I have a database with some data (each column corresponding to a specific item, each row corresponding to each set), e.g.:
data:
[[1 0 1 1]
[1 1 1 1]
[1 0 1 1]]
I calculate the co-occurrence matrix:
cooccur_matrix:
[[0 1 3 3]
[1 0 1 1]
[3 1 0 3]
[3 1 3 0]]
And now I have an incomplete set:
target:
[1 0 1 1]
The dot product of my co-occurrence matrix and this is:
prediction:
[6 3 6 6]
But that's not at all what I want. What I'm trying to get back is something like this:
prediction:
[1 0.33 1 1]
Or:
prediction:
[0 0.33 0 0]
Any thoughts on what I'm doing wrong? I'm fairly new to ML concepts, and this seems like a pretty simple problem.

It seems like what I really needed was cosine similarity applied to a co-occurrence matrix. This seems to work well (but was very hard to solve).

Related

How does CountVectorizer deal with new words in test data?

I understand how CountVectorizer works in general. It takes word tokens and creates a sparse count matrix of documents (rows) and token counts (columns), that we can use for ML modeling.
However, how does it deal with new words that can presumably show up in test data, that weren't in the training data? Does it just ignore them?
Also, from a modeling standpoint, should the assumption be that if certain words are so rare that they didn't show up in the training data at all, and that they aren't relevant for any modeling you might perform?
I am assuming you are referring to the scikit-learn CountVectorizer. Not that I know if any other myself.
Yes, when new documents are encoded, words that are not part of the vocabulary(created from the training data) are ignored by the count vectorizer.
Example of creating vocabulary: (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]
Now, use transform on new document and you can see that the Out of vocabulary words are ignored:
>>> print(vectorizer.transform(['not in any of the document second second']).toarray())
[[0 1 0 0 0 2 1 0 0]]
With respect to the rare words that are not part of the training data, I would agree to your statement that it is not significant for modeling since we would want to believe that the words that are most relevant to create and generalize a good model are already part of the training data.

OpenCV - how to pass a pattern-matching kernel over binary image?

I am implementing a contour-finding algorithm for pixel-wide contours in binary images. It needs to be robust against deletion of individual pixels (i.e. pixel-wide gaps).
Various attempts at dilation & erosion kernels have not yielded a reliable solution.
Instead the reliable solution I want ot implement is to pass a pattern matching kernel over the image, which can directly fill in the gaps based on surrounding pixels. For example, when the exact pattern on the left is observed at a location, it is replaced with the right (where * means wildcard):
[1 * *] [1 * *]
[* 0 *] ==> [* 1 *]
[* * 1] [* * 1]
[1 0 *] [1 0 *]
[* 0 1] ==> [* 1 1]
[* * *] [* * *]
[* 1 *] [* 1 *]
[* 0 *] ==> [* 1 *]
[* 1 *] [* 1 *]
And define the ~14 or so replacements necessary to fill in the possible gaps in each 3x3 window.
It could be implemented in raw Python but likely to be extremely slow without low level vectorized operations.
Can this be done through OpenCV or some other fast operation?
Thanks to #beaker comment above I implemented a solution. Design a kernel with the neighbouring pixels of interest as 0.5 and center as 1 and it will fill in the center with a 1 if it is missing, although some other pixels will be 2. Then clip the values to 1 and you get the desired result.
It needs to be applied independently for each direction of gap which isn't ideal but still works.
img_with_gap = np.array(
[[1,0,0,0,0],
[0,1,0,0,0],
[0,0,0,0,0],
[0,0,0,1,0],
[0,0,0,0,1]], dtype=np.uint8)
kernel = np.array(
[[0.5, 0, 0],
[0 , 1, 0],
[0 , 0, 0.5]])
connected_img = np.minimum(cv2.filter2D(img_with_gap, -1, kernel), 1)
connected_img
An even tighter implementation is to do an exact pattern match by penalizing the zero ones, clipping to {0,1} and ensuring that nothing deleted from the original image:
kernel = np.array([[0.5, -10.0, -10.0],
[-10.0 , 1, -10.0],
[-10.0 , -10.0, 0.5]])
connected_img = np.maximum(img, np.clip(cv2.filter2D(img, -1, kernel), 0, 1))

What is the first parameter (gradients) of the backward method, in pytorch?

We have the following code from the pytorch documentation:
x = torch.randn(3)
x = Variable(x, requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
y = y * 2
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)
What exactly is the gradients parameter that we pass into the backward method and based on what do we initialize it?
To fully answer your question, it'd require a somewhat longer explanation that evolves around the details of how Backprop or, more fundamentally, the chain rule works.
The short programmatic answer is that the backwards function of a Variable computes the gradient of all variables in the computation graph attached to that Variable. (To clarify: if you have a = b + c, then the computation graph (recursively) points first to b, then to c, then to how those are computed, etc.) and cumulatively stores (sums) these gradients in the .grad attribute of these Variables. When you then call opt.step(), i.e. a step of your optimizer, it adds a fraction of that gradient to the value of these Variables.
That said, there are two answers when you look at it conceptually: If you want to train a Machine Learning model, you normally want to have the gradient with respect to some loss function. In this case, the gradients computed will be such that the overall loss (a scalar value) will decrease when applying the step function. In this special case, we want to compute the gradient to a specific value, namely the unit length step (so that the learning rate will compute the fraction of the gradients that we want). This means that if you have a loss function, and you call loss.backward(), this will compute the same as loss.backward(torch.FloatTensor([1.])).
While this is the common use case for backprop in DNNs, it is only a special case of general differentiation of functions. More generally, the symbolic differentiation packages (autograd in this case, as part of pytorch) can be used to compute gradients of earlier parts of the computation graph with respect to any gradient at a root of whatever subgraph you choose. This is when the key-word argument gradient comes in useful, since you can provide this "root-level" gradient there, even for non-scalar functions!
To illustrate, here's a small example:
a = nn.Parameter(torch.FloatTensor([[1, 1], [2, 2]]))
b = nn.Parameter(torch.FloatTensor([[1, 2], [1, 2]]))
c = torch.sum(a - b)
c.backward(None) # could be c.backward(torch.FloatTensor([1.])) for the same result
print(a.grad, b.grad)
prints:
Variable containing:
1 1
1 1
[torch.FloatTensor of size 2x2]
Variable containing:
-1 -1
-1 -1
[torch.FloatTensor of size 2x2]
While
a = nn.Parameter(torch.FloatTensor([[1, 1], [2, 2]]))
b = nn.Parameter(torch.FloatTensor([[1, 2], [1, 2]]))
c = torch.sum(a - b)
c.backward(torch.FloatTensor([[1, 2], [3, 4]]))
print(a.grad, b.grad)
prints:
Variable containing:
1 2
3 4
[torch.FloatTensor of size 2x2]
Variable containing:
-1 -2
-3 -4
[torch.FloatTensor of size 2x2]
and
a = nn.Parameter(torch.FloatTensor([[0, 0], [2, 2]]))
b = nn.Parameter(torch.FloatTensor([[1, 2], [1, 2]]))
c = torch.matmul(a, b)
c.backward(torch.FloatTensor([[1, 1], [1, 1]])) # we compute w.r.t. a non-scalar variable, so the gradient supplied cannot be scalar, either!
print(a.grad, b.grad)
prints
Variable containing:
3 3
3 3
[torch.FloatTensor of size 2x2]
Variable containing:
2 2
2 2
[torch.FloatTensor of size 2x2]
and
a = nn.Parameter(torch.FloatTensor([[0, 0], [2, 2]]))
b = nn.Parameter(torch.FloatTensor([[1, 2], [1, 2]]))
c = torch.matmul(a, b)
c.backward(torch.FloatTensor([[1, 2], [3, 4]])) # we compute w.r.t. a non-scalar variable, so the gradient supplied cannot be scalar, either!
print(a.grad, b.grad)
prints:
Variable containing:
5 5
11 11
[torch.FloatTensor of size 2x2]
Variable containing:
6 8
6 8
[torch.FloatTensor of size 2x2]

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?
Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.
What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

mean image filter

Starting to learn image filtering and stumped on a question found on website: Applying a 3×3 mean filter twice does not produce quite the same result as applying a 5×5 mean filter once. However, a 5×5 convolution kernel can be constructed which is equivalent. What does this kernel look like?
Would appreciate help so that I can understand the subject better. Thanks.
Marcelo's answer is right. Another way of seeing it (more easy to think it first in one dimension) : we know that the mean filter is equivalent to a convolution with a rectangular window. And we know that the convolution is a linear operation, which is also associative.
Now, applying a mean filter M to a signal X can be written as
Y = M * X
where * denotes convolution. Appying the filter twice would then give
Y = M * (M * X) = (M * M) * X = M2 * X
This says that filtering twice a signal with a mean filter is the same as filtering it once with an equivalent filter given by M2 = M * M. Now, this consists of applying the mean filter to itself, what gives a "smoother" filter (a triangular filter in this case).
The process can be repeated, (see first graph here) and it can be shown that the equivalent filter for many repetitions of a mean filter (N convolutions of the rectangular filter with itself) tends to a gaussian filter. Further, it can be shown that the gaussian filter has that property you didn't found in the rectangular (mean) filter: two passes of a gaussian filter are equivalent to another gaussian filter.
3x3 mean:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]
3x3 mean twice:
[1 2 3 2 1]
[2 4 6 4 2]
[3 6 9 6 3] * 1/81
[2 4 6 4 2]
[1 2 3 2 1]
How? Each cell contributes indirectly via one or more intermediate 3x3 windows. Consider the set of stage 1 windows that contribute to a given stage 2 computation. The number of such 3x3 windows that contain a given source cell determines the contribution by that cell. The middle cell, for instance, is contained in all nine windows, so its contribution is 9 * 1/9 * 1/9. I don't know if I've explained it that well, so I hope it makes sense to you.
Actually I believe that 3x3 twice should give:
[1 2 3 2 1]
[2 4 6 4 2]
[3 6 9 6 3] * 1/81
[2 4 6 4 2]
[1 2 3 2 1]
The reason is because the sum of all values must be equal to 1.

Resources