Just a simple question related to image convolution boundary conditions. As every knows, there are several types of image convolution boundaries, among which symmetric condition is widely accepted. My question is: where do we put the “mirror” when performing convolution? More specifically, I give the following example:
Image matrix is [1 2 3], and the kernel is [ 1 1 1 1 1]. Then the mirrored image with regard to this kernel should be [ 3 2 1 2 3 2 3] or [ 2 1 1 2 3 3 2]?
Each edge needs to be mirrored independently. Your second example is the correct one.
Related
(I know a question like this exists, but I wanted help with a specific example)
If the linear filter has even dimensions, how is the "center" defined? i.e. in the following scenario:
filter = np.array([[a, b],
[c, d]])
and the image was:
image = np.array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]])
what would be the result of correlation of the image with the linear filter?
Which of the elements of the even-sized filter is considered the origin is an arbitrary choice. Each implementation will make a different choice. Though a and d are the two most likely choices for reasons of similarity of the two image dimensions.
For example, MATLAB's imfilter (which implements correlation, not convolution) does the following:
f = [1,2;4,8];
img = [0,1,0;1,1,1;0,1,0];
imfilter(img,f,'same')
ans =
14 13 4
11 7 1
2 1 0
meaning that a is the origin of the kernel in this case. Other implementations might make a different choice.
I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)
I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?
I do regression analysis with multiple features. Number of features is 20-23. For now, I check each feature correlation with output variable. Some features show correlation coefficient close to 1 or -1 (highly correlated). Some features show correlation coefficient near 0. My question is: do I have to remove this feature if it has close to 0 correlation coefficient? Or I can keep it and the only problem is that this feature will no make some noticeable effect to regression model or will have faint affect on it. Or removing that kind of features is obligatory?
In short
High (absolute) correlation between a feature and output implies that this feature should be valuable as predictor
Lack of correlation between feature and output implies nothing
More details
Pair-wise correlation only shows you how one thing affects the other, it says completely nothing about how good is this feature connected with others. So if your model is not trivial then you should not drop variables because they are not correlated with output). I will give you the example which should show you why.
Consider following sample, we have 2 features (X, Y), and one output value (Z, say red is 1, black is 0)
X Y Z
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
2 3 0
3 1 0
3 2 0
3 3 1
Let us compute the correlations:
CORREL(X, Z) = 0
CORREL(Y, Z) = 0
So... we should drop all values? One of them? If we drop any variable - our prolem becomes completely impossible to model! "magic" lies in the fact that there is actually a "hidden" relation in the data.
|X-Y|
0
1
2
1
0
1
2
1
0
And
CORREL(|X-Y|, Z) = -0.8528028654
Now this is a good predictor!
You can actually get a perfect regressor (interpolator) through
Z = 1 - sign(|X-Y|)
Starting to learn image filtering and stumped on a question found on website: Applying a 3×3 mean filter twice does not produce quite the same result as applying a 5×5 mean filter once. However, a 5×5 convolution kernel can be constructed which is equivalent. What does this kernel look like?
Would appreciate help so that I can understand the subject better. Thanks.
Marcelo's answer is right. Another way of seeing it (more easy to think it first in one dimension) : we know that the mean filter is equivalent to a convolution with a rectangular window. And we know that the convolution is a linear operation, which is also associative.
Now, applying a mean filter M to a signal X can be written as
Y = M * X
where * denotes convolution. Appying the filter twice would then give
Y = M * (M * X) = (M * M) * X = M2 * X
This says that filtering twice a signal with a mean filter is the same as filtering it once with an equivalent filter given by M2 = M * M. Now, this consists of applying the mean filter to itself, what gives a "smoother" filter (a triangular filter in this case).
The process can be repeated, (see first graph here) and it can be shown that the equivalent filter for many repetitions of a mean filter (N convolutions of the rectangular filter with itself) tends to a gaussian filter. Further, it can be shown that the gaussian filter has that property you didn't found in the rectangular (mean) filter: two passes of a gaussian filter are equivalent to another gaussian filter.
3x3 mean:
[1 1 1]
[1 1 1] * 1/9
[1 1 1]
3x3 mean twice:
[1 2 3 2 1]
[2 4 6 4 2]
[3 6 9 6 3] * 1/81
[2 4 6 4 2]
[1 2 3 2 1]
How? Each cell contributes indirectly via one or more intermediate 3x3 windows. Consider the set of stage 1 windows that contribute to a given stage 2 computation. The number of such 3x3 windows that contain a given source cell determines the contribution by that cell. The middle cell, for instance, is contained in all nine windows, so its contribution is 9 * 1/9 * 1/9. I don't know if I've explained it that well, so I hope it makes sense to you.
Actually I believe that 3x3 twice should give:
[1 2 3 2 1]
[2 4 6 4 2]
[3 6 9 6 3] * 1/81
[2 4 6 4 2]
[1 2 3 2 1]
The reason is because the sum of all values must be equal to 1.