Why the VC dimension of 2D perceptron is 3?

Why the VC dimension of 2D perceptron is 3? - machine-learning

If three points in one line as below, how can the 2D perceptron classify this 3 points?

The VC dimension of a classifier is determined the following way:
VC = 1
found = False
while True:
for point_distribution in all possible point distributions of VC+1 points:
allcorrect = True
for classdist in every way the classes could be assigned to the classes:
if classifier can't classify everything correct:
allcorrect = False
break
if allcorrect:
VC += 1
continue
break
So there has only to be one way to place three points such that all possible class distributions among this point-placement can be classified the correct way.
If you don't place the three points on a line, the perception gets it right. But there is no way to get the perception classify all possible class distributions of 4 points, no matter how you place the points

Related

Can we use combination of two polynomial ( degree n and n+1 ) for regression to fit a curve for application in wall-filter?

I am trying to remove clutter(noise, tissue interference) from color flow ultrasound image of blood vessels - what is technically called wall filter. What I tried is to used single polynomial regression of degree 2 to fit the clutter space and remove this from the original data (from the color flow matrix data) and I also used Legendre polynomial to acheive the same goal. I am gettig different result for degree 2 and degree 3. Degree 2 gives somehow good result but degree 3 messed result. I heard there is a away to combine these two polynomial with certain weigh to acheive better fit.
here is what I tied in math lab:
for k = 1:m*n %%% m and n beig dimension of the data
i = mod(k-1,m)+1; % row
j = (k-i)/m+1; % column
p = polyfit(t,IQ_sample(i,j,:),2); % polynomial curve fitting - IQ_sample is my 3D data , 2 degree used here
IQ_filterd(i,j,:) = IQ_sample_sq(i,j,:)-polyval(p,t); % remove the clutter from the data -filtering
end
I did similar to the legendre polynomial . I am wondering if there is a way to fit two polynomials hoping to get better fit thereby getting better filter. I may be mixing up some concept as i am new comer to this area, I apprecaite your insight on this and realted tips on wall-filter using polynomial regression and thank you.

Perceptron training rule, why multiply by x

I was reading tom Mitchell's machine learning book and he mentioned the formula for perceptron training rule is
where
: training rate
: expected output
: actual output
: ith input
This implies that if is very large then so is , but I don't understand the purpose of a large update when is large
on the contrary, I feel like if there is a large then the update should be small since a small fluctuation in will result in a big change in the final output (due to )

The adjustments are vector additions and subtractions, which can be thought as rotating a hyperplane such that class 0 falls on one part and class 1 falls on the other part.
Consider a 1xd weight vector indicating the weights of the perceptron model. Also, consider a 1xd datapoint . Then the predicted value of the perceptron model, considering a linear threshold without a loss of generality, will be
-- Eq. 1
Here '.' is a dot product, or
The hyperplane above equation is
(Ignoring the iteration indices for the weight updates for simplicity)
Let us consider we have two classes 0 and 1, again without a loss of generality, datapoints labelled 0 fall on one side where Eq.1 <= 0 of the hyperplane, and the datapoints labelled 1 fall on the other side where Eq.1 > 0.
The vector which is normal to this hyperplane is . The angle between the datapoints with label 0 should be more that 90 degrees and the datapoints between the datapoints with label 1 should be less than 90 degrees.
There are three possibilities of (ignoring the training rate)
: implying that this example is classified correctly by the present set of weights. Therefore we do not need any changes for the specific datapoint.
implying that the target was 1, but the present set of weights classified it as 0. The Eq1. which was supposed to be . Eq1. in this case is , which indicates that the angle between and is greater that 90 degrees, which should have been lesser. The update rule is . If you imagine a vector addition in 2d, this will rotate the hyperplane so that the angle between and is closer than before and less than 90 degrees.
implying that the target was 0, but the present set of weights classified it as 1. The eq1. which was supposed to be . Eq1. in this case is indicates that the angle between and is lesser that 90 degrees, which should have been greater. The update rule is . Similarly this will rotate the hyperplane so that the angle between and is greater than 90 degrees.
This is iterated over and overe and the hyperplane is rotated and adjusted so that the angle of the hyperplane's normal has less than 90 degrees with the datapoint with class labeled 1 and greater than 90 degrees with the datapoints of class labelled 0.
If the magnitude of is huge there will be big changes, and therefore it will cause problems in the process, and may take more iterations to converge depending on the magnitude of the initial weights. Therefore it is a good idea to normalise or standardise the datapoints. From this perspective it is easy to visualise what exactly the update rules are doing visually (consider the bias as a part of the hyperplane Eq.1). Now extend this to more complicated networks and or with thresholds.
Recommended reading and reference: Neural Network, A Systematic Introduction by Raul Rojas: Chapter 4

Handling zero rows/columns in covariance matrix during em-algorithm

I tried to implement GMMs but I have a few problems during the em-algorithm.
Let's say I've got 3D Samples (stat1, stat2, stat3) which I use to train the GMMs.
One of my training sets for one of the GMMs has in nearly every sample a "0" for stat1. During training I get really small Numbers (like "1.4456539880060609E-124") in the first row and column of the covariance matrix which leads in the next iteration of the EM-Algorithm to 0.0 in the first row and column.
I get something like this:
0.0 0.0 0.0
0.0 5.0 6.0
0.0 2.0 1.0
I need the inverse covariance matrix to calculate the density but since one column is zero I can't do this.
I thought about falling back to the old covariance matrix (and mean) or to replace every 0 with a really small number.
Or is there a another simple solution to this problem?

Simply your data lies in degenerated subspace of your actual input space, and GMM is not well suited in most generic form for such setting. THe problem is that empirical covariance estimator that you use simply fail for such data (as you said - you cannot inverse it). What you usually do? You chenge covariance estimator to the constrained/regularized ones, which contain:
Constant-based shrinking, thus instead of using Sigma = Cov(X) you do Sigma = Cov(X) + eps * I, where eps is prefedefined small constant, and I is identity matrix. Consequently you never have a zero values on the diagonal, and it is easy to prove that for reasonable epsilon, this will be inversible
Nicely fitted shrinking, like Oracle Covariance Estimator or Ledoit-Wolf Covariance Estimator which find best epsilon based on the data itself.
Constrain your gaussians to for example spherical family, thus N(m, sigma I), where sigma = avg_i( cov( X[:, i] ) is the mean covariance per dimension. This limits you to spherical gaussians, and also solves the above issue
There are many more solutions possible, but all based on the same thing - chenge covariance estimator in such a way, that you have a guarantee of invertability.

K means clustering for multidimensional data

if the data set has 440 objects and 8 attributes (dataset been taken from UCI machine learning repository). Then how do we calculate centroids for such datasets. (wholesale customers data)
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
if i calculate the mean of values of each row, will that be the centroid?
and how do I plot resulting clusters in matlab.

OK, first of all, in the dataset, 1 row corresponds to a single example in the data, you have 440 rows, which means the dataset consists of 440 examples. Each column contains the values for that specific feature (or attribute as you call it), e.g. column 1 in your dataset contains the values for the feature Channel, column 2 the values for the feature Region and so on.
K-Means
Now for K-Means Clustering, you need to specify the number of clusters (the K in K-Means). Say you want K=3 clusters, then the simplest way to initialise K-Means is to randomly choose 3 examples from your dataset (that is 3 rows, randomly drawn from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
You can think of your centroids as 3 bins and you want to put every example from the dataset into the closest(usually measured by the Euclidean distance; check the function norm in Matlab) bin.
After the first round of putting all examples into the closest bin, you recalculate the centroids by calculating the mean of all examples in their respective bins. You repeat the process of putting all the examples into the closest bin until no example in your dataset moves to another bin.
Some Matlab starting points
You load the data by X = load('path/to/the/dataset', '-ascii');
In your case X will be a 440x8 matrix.
You can calculate the Euclidean distance from an example to a centroid by
distance = norm(example - centroid1);,
where both, example and centroid1 have dimensionality 1x8.
Recalculating the centroids would work as follows, suppose you have done 1 iteration of K-Means and have put all examples into their respective closest bin. Say Bin1 now contains all examples that are closest to centroid1 and therefore Bin1 has dimensionality 127x8, which means that 127 examples out of 440 are in this bin. To calculate the centroid position for the next iteration you can then do centroid1 = mean(Bin1);. You would do similar things to your other bins.
As for plotting, you have to note that your dataset contains 8 features, which means 8 dimensions and which is not visualisable. I'd suggest you create or look for a (dummy) dataset which only consists of 2 features and would therefore be visualisable by using Matlab's plot() function.

How to implement a better sliding window algorithm?

So I have been writing my own codes for HoG and its variant to work with depth images. However, I am stuck with testing my trained SVM in the detection window part.
All that I've done right now is to first create image pyramids out of the original image, and run a sliding window of 64x128 size from top left corner to bottom right.
Here's a video capture of it: http://youtu.be/3cNFOd7Aigc
Now the issue is that I'm getting more false positives than I expected.
Is there a way that I can remove all these false positives (besides training with more images) ? So far I can get the 'score' from SVM, which is the distance to the margin itself. How can I use that to leverage my results ?
Does anyone have any insight in implementing a good sliding window algorithm ?

What you could do is add a processing step to find the locally strongest response from SVM. Let me explain.
What you appear to be doing right now:
for each sliding window W, record category[W] = SVM.hardDecision(W)
Hard decision means it return a boolean or integer, and for 2-category classification could be written like this:
hardDecision(W) = bool( softDecision(W) > 0 )
Since you mentioned OpenCV, in CvSVM::predict you should set returnDFVal to true :
returnDFVal – Specifies a type of the return value. If true and the problem is 2-class classification then the method returns the decision function value that is signed distance to the margin, else the function returns a class label (classification) or estimated function value (regression).
from the documentation.
What you could do is:
for each sliding window W, record score[W] = SVM.softDecision(W)
for each W, compute and record:
neighbors = max(score[W_left], score[W_right], score[W_up], score[W_bottom])
local[W] = score[W] > neighbors
powerful[W] = score[W] > threshold.
for each W, you have a positive if local[W] && powerful[W]
Since your classifier will have a positive response for windows cloth (in space and/or appearance) to your true positive, the idea is to record the scores for each window, and then only keep positives which
are a locally maximum score (greater that its neighbors) --> local
are strong enough --> powerful
You could set threshold to 0 and adjust it until you get satisfying results. Or you could calibrate it automatically using your training set.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart