spectral clustering eigenvectors and eigenvalues - machine-learning

What do the eigenvalues and eigenvectors in spectral clustering physically mean. I see that if λ_0 = λ_1 = 0 then we will have 2 connected components. But, what does λ_2,...,λ_k tell us. I don't understand the algebraic connectivity by multiplicity.
Can we draw any conclusions about the tightness of the graph or in comparison to two graphs?

The smaller the eigenvalue, the less connected. 0 just means "disconnected".
Consider this a value of what share of edges you need to cut to produce separate components. The cut is orthogonal to the eigenvector - there is supposedly some threshold t, such that nodes below t should go into one component, above t to the other.

That depends somewhat on the algorithm. For several of the spectral algorithms, the eigenstuff can be easily run through Principal Component Analysis to reduce the display dimensionality for human consumption. Power iteration clustering vectors are more difficult to interpret.
As Mr.Roboto already noted, the eigenvector is normal to the division brane (a plane after a Gaussian kernel transformation). Spectral clustering methods are generally not sensitive to density (is that what you mean by "tightness"?) per se -- they find data gaps. For instance, it doesn't matter whether you have 50 or 500 nodes within a unit sphere forming your first cluster; the game changer is whether there's clear space (a nice gap) instead of a thin trail of "bread crumb" points (a sequence of tiny gaps) leading to another cluster.

Related

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms.
But I'm not sure using what descriptor as frequency domain based feature, since there are amplitude spectrum, power spectrum and phase spectrum of a signal and I've read some references but still got confused about the significance. And what distance (similarity) function should be used as measurement when performing learning algorithms on frequency domain based feature vector(Euclidean distance? Cosine distance? Gaussian function? Chi-kernel or something else?)
Hope someone give me a clue or some material that I can refer to, thanks~
Edit
Thanks to #DrKoch, I chose a spatial element with the largest L-1 norm and plotted its log power spectrum in python and it did show some prominent peaks, below is my code and the figure
import numpy as np
import matplotlib.pyplot as plt
sp = np.fft.fft(signal)
freq = np.fft.fftfreq(signal.shape[-1], d = 1.) # time sloth of histogram is 1 hour
plt.plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
And I have several trivial questions to ask to make sure I totally understand your suggestion:
In your second suggestion, you said "ignore all these values."
Do you mean the horizontal line represent the threshold and all values below it should be assigned to value zero?
"you may search for the two, three largest peaks and use their location and probably widths as 'Features' for further classification."
I'm a little bit confused about the meaning of "location" and "width", does "location" refer to the log value of power spectrum (y-axis) and "width" refer to the frequency (x-axis)? If so, how to combine them together as a feature vector and compare two feature vector of "a similar frequency and a similar widths" ?
Edit
I replaced np.fft.fft with np.fft.rfft to calculate the positive part and plot both power spectrum and log power spectrum.
code:
f, axarr = plt.subplot(2, sharex = True)
axarr[0].plot(freq, np.abs(sp) ** 2)
axarr[1].plot(freq, np.log10(np.abs(sp) ** 2))
plt.show()
figure:
Please correct me if I'm wrong:
I think I should keep the last four peaks in first figure with power = np.abs(sp) ** 2 and power[power < threshold] = 0 because the log power spectrum reduces the difference among each component. And then use the log spectrum of new power as feature vector to feed classifiers.
I also see some reference suggest applying a window function (e.g. Hamming window) before doing fft to avoid spectral leakage. My raw data is sampled every 5 ~ 15 seconds and I've applied a histogram on sampling time, is that method equivalent to apply a window function or I still need apply it on the histogram data?
Generally you should extract just a small number of "Features" out of the complete FFT spectrum.
First: Use the log power spec.
Complex numbers and Phase are useless in these circumstances, because they depend on where you start/stop your data acquisiton (among many other things)
Second: you will see a "Noise Level" e.g. most values are below a certain threshold, ignore all these values.
Third: If you are lucky, e.g. your data has some harmonic content (cycles, repetitions) you will see a few prominent Peaks.
If there are clear peaks, it is even easier to detect the noise: Everything between the peaks should be considered noise.
Now you may search for the two, three largest peaks and use their location and probably widths as "Features" for further classification.
Location is the x-value of the peak i.e. the 'frequency'. It says something how "fast" your cycles are in the input data.
If your cycles don't have constant frequency during the measuring intervall (or you use a window before caclculating the FFT), the peak will be broader than one bin. So this widths of the peak says something about the 'stability' of your cycles.
Based on this: Two patterns are similar if the biggest peaks of both hava a similar frequency and a similar widths, and so on.
EDIT
Very intersiting to see a logarithmic power spectrum of one of your examples.
Now its clear that your input contains a single harmonic (periodic, oscillating) component with a frequency (repetition rate, cycle-duration) of about f0=0.04.
(This is relative frquency, proprtional to the your sampling frequency, the inverse of the time beetween individual measurment points)
Its is not a pute sine-wave, but some "interesting" waveform. Such waveforms produce peaks at 1*f0, 2*f0, 3*f0 and so on.
(So using an FFT for further analysis turns out to be very good idea)
At this point you should produce spectra of several measurements and see what makes a similar measurement and how differ different measurements. What are the "important" features to distinguish your mesurements? Thinks to look out for:
Absolute amplitude: Height of the prominent (leftmost, highest) peaks.
Pitch (Main cycle rate, speed of changes): this is position of first peak, distance between consecutive peaks.
Exact Waveform: Relative amplitude of the first few peaks.
If your most important feature is absoulute amplitude, you're better off with calculating the RMS (root mean square) level of our input signal.
If pitch is important, you're better off with calculationg the ACF (auto-correlation function) of your input signal.
Don't focus on the leftmost peaks, these come from the high frequency components in your input and tend to vary as much as the noise floor.
Windows
For a high quality analyis it is importnat to apply a window to the input data before applying the FFT. This reduces the infulens of the "jump" between the end of your input vector ant the beginning of your input vector, because the FFT considers the input as a single cycle.
There are several popular windows which mark different choices of an unavoidable trade-off: Precision of a single peak vs. level of sidelobes:
You chose a "rectangular window" (equivalent to no window at all, just start/stop your measurement). This gives excellent precission of your peaks which now have a width of just one sample. Your sidelobes (the small peaks left and right of your main peaks) are at -21dB, very tolerable given your input data. In your case this is an excellent choice.
A Hanning window is a single cosine wave. It makes your peaks slightly broader but reduces side-lobe levels.
The Hammimg-Window (cosine-wave, slightly raised above 0.0) produces even broader peaks, but supresses side-lobes by -42 dB. This is a good choice if you expect further weak (but important) components between your main peaks or generally if you have complicated signals like speech, music and so on.
Edit: Scaling
Correct scaling of a spectrum is a complicated thing, because the values of the FFT lines depend on may things like sampling rate, lenght of FFT, window, and even implementation details of the FFT algorithm (there exist several different accepted conventions).
After all, the FFT should show the underlying conservation of energy. The RMS of the input signal should be the same as the RMS (Energy) of the spectrum.
On the other hand: if used for classification it is enough to maintain relative amplitudes. As long as the paramaters mentioned above do not change, the result can be used for classification without further scaling.

Why Gaussian radial basis function maps the examples into an infinite-dimensional space?

I've just run through the Wikipedia page about SVMs, and this line caught my eyes:
"If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimensions." http://en.wikipedia.org/wiki/Support_vector_machine#Nonlinear_classification
In my understanding, if I apply Gaussian kernel in SVM, the resulting feature space will be m-dimensional (where m is the number of training samples), as you choose your landmarks to be your training examples, and you're measuring the "similarity" between a specific example and all the examples with the Gaussian kernel. As a consequence, for a single example you'll have as many similarity values as training examples. These are going to be the new feature vectors which are going to m-dimensional vectors, and not infinite dimensionals.
Could somebody explain to me what do I miss?
Thanks,
Daniel
The dual formulation of the linear SVM depends only on scalar products of all training vectors. Scalar product essentially measures similarity of two vectors. We can then generalize it by replacing with any other "well-behaved" (it should be positive-definite, it's needed to preserve convexity, as well as enables Mercer's theorem) similarity measure. And RBF is just one of them.
If you take a look at the formula here you'll see that RBF is basically a scalar product in a certain infinitely dimensional space
Thus RBF is kind of a union of polynomial kernels of all possible degrees.
The other answers are correct but don't really tell the right story here. Importantly, you are correct. If you have m distinct training points then the gaussian radial basis kernel makes the SVM operate in an m dimensional space. We say that the radial basis kernel maps to a space of infinite dimension because you can make m as large as you want and the space it operates in keeps growing without bound.
However, other kernels, like the polynomial kernel do not have this property of the dimensionality scaling with the number of training samples. For example, if you have 1000 2D training samples and you use a polynomial kernel of <x,y>^2 then the SVM will operate in a 3 dimensional space, not a 1000 dimensional space.
The short answer is that this business about infinite dimensional spaces is only part of the theoretical justification, and of no practical importance. You never actually touch an infinite-dimensional space in any sense. It's part of the proof that the radial basis function works.
Basically, SVMs are proved to work the way they do by relying on properties of dot products over vector spaces. You can't just swap in the radial basis function and expect it necessarily works. To prove that it does, however, you show that the radial basis function is actually like a dot product over a different vector space, and it's as if we're doing regular SVMs in a transformed space, which works. And it happens that infinite dimensioal-ness is OK, and that the radial basis function does correspond to a dot product in such a space. So you can say SVMs still work when you use this particular kernel.

Accuracy in depth estimation - Stereo Vision

I am doing a research in stereo vision and I am interested in accuracy of depth estimation in this question. It depends of several factors like:
Proper stereo calibration (rotation, translation and distortion extraction),
image resolution,
camera and lens quality (the less distortion, proper color capturing),
matching features between two images.
Let's say we have a no low-cost cameras and lenses (no cheap webcams etc).
My question is, what is the accuracy of depth estimation we can achieve in this field?
Anyone knows a real stereo vision system that works with some accuracy?
Can we achieve 1 mm depth estimation accuracy?
My question also aims in systems implemented in opencv. What accuracy did you manage to achieve?
Q. Anyone knows a real stereo vision system that works with some accuracy? Can we achieve 1 mm depth estimation accuracy?
Yes, you definitely can achieve 1mm (and much better) depth estimation accuracy with a stereo rig (heck, you can do stereo recon with a pair of microscopes). Stereo-based industrial inspection systems with accuracies in the 0.1 mm range are in routine use, and have been since the early 1990's at least. To be clear, by "stereo-based" I mean a 3D reconstruction system using 2 or more geometrically separated sensors, where the 3D location of a point is inferred by triangulating matched images of the 3D point in the sensors. Such a system may use structured light projectors to help with the image matching, however, unlike a proper "structured light-based 3D reconstruction system", it does not rely on a calibrated geometry for the light projector itself.
However, most (likely, all) such stereo systems designed for high accuracy use either some form of structured lighting, or some prior information about the geometry of the reconstructed shapes (or a combination of both), in order to tightly constrain the matching of points to be triangulated. The reason is that, generally speaking, one can triangulate more accurately than they can match, so matching accuracy is the limiting factor for reconstruction accuracy.
One intuitive way to see why this is the case is to look at the simple form of the stereo reconstruction equation: z = f b / d. Here "f" (focal length) and "b" (baseline) summarize the properties of the rig, and they are estimated by calibration, whereas "d" (disparity) expresses the match of the two images of the same 3D point.
Now, crucially, the calibration parameters are "global" ones, and they are estimated based on many measurements taken over the field of view and depth range of interest. Therefore, assuming the calibration procedure is unbiased and that the system is approximately time-invariant, the errors in each of the measurements are averaged out in the parameter estimates. So it is possible, by taking lots of measurements, and by tightly controlling the rig optics, geometry and environment (including vibrations, temperature and humidity changes, etc), to estimate the calibration parameters very accurately, that is, with unbiased estimated values affected by uncertainty of the order of the sensor's resolution, or better, so that the effect of their residual inaccuracies can be neglected within a known volume of space where the rig operates.
However, disparities are point-wise estimates: one states that point p in left image matches (maybe) point q in right image, and any error in the disparity d = (q - p) appears in z scaled by f b. It's a one-shot thing. Worse, the estimation of disparity is, in all nontrivial cases, affected by the (a-priori unknown) geometry and surface properties of the object being analyzed, and by their interaction with the lighting. These conspire - through whatever matching algorithm one uses - to reduce the practical accuracy of reconstruction one can achieve. Structured lighting helps here because it reduces such matching uncertainty: the basic idea is to project sharp, well-focused edges on the object that can be found and matched (often, with subpixel accuracy) in the images. There is a plethora of structured light methods, so I won't go into any details here. But I note that this is an area where using color and carefully choosing the optics of the projector can help a lot.
So, what you can achieve in practice depends, as usual, on how much money you are willing to spend (better optics, lower-noise sensor, rigid materials and design for the rig's mechanics, controlled lighting), and on how well you understand and can constrain your particular reconstruction problem.
I would add that using color is a bad idea even with expensive cameras - just use the gradient of gray intensity. Some producers of high-end stereo cameras (for example Point Grey) used to rely on color and then switched to grey. Also consider a bias and a variance as two components of a stereo matching error. This is important since using a correlation stereo, for example, with a large correlation window would average depth (i.e. model the world as a bunch of fronto-parallel patches) and reduce the bias while increasing the variance and vice versa. So there is always a trade-off.
More than the factors you mentioned above, the accuracy of your stereo will depend on the specifics of the algorithm. It is up to an algorithm to validate depth (important step after stereo estimation) and gracefully patch the holes in textureless areas. For example, consider back-and-forth validation (matching R to L should produce the same candidates as matching L to R), blob noise removal (non Gaussian noise typical for stereo matching removed with connected component algorithm), texture validation (invalidate depth in areas with weak texture), uniqueness validation (having a uni-modal matching score without second and third strong candidates. This is typically a short cut to back-and-forth validation), etc. The accuracy will also depend on sensor noise and sensor's dynamic range.
Finally you have to ask your question about accuracy as a function of depth since d=f*B/z, where B is a baseline between cameras, f is focal length in pixels and z is the distance along optical axis. Thus there is a strong dependence of accuracy on the baseline and distance.
Kinect will provide 1mm accuracy (bias) with quite large variance up to 1m or so. Then it sharply goes down. Kinect would have a dead zone up to 50cm since there is no sufficient overlap of two cameras at a close distance. And yes - Kinect is a stereo camera where one of the cameras is simulated by an IR projector.
I am sure with probabilistic stereo such as Belief Propagation on Markov Random Fields one can achieve a higher accuracy. But those methods assume some strong priors about smoothness of object surfaces or particular surface orientation. See this for example, page 14.
If you wan't to know a bit more about accuracy of the approaches take a look at this site, although is no longer very active the results are pretty much state of the art. Take into account that a couple of the papers presented there went to create companies. What do you mean with real stereo vision system? If you mean commercial there aren't many, most of the commercial reconstruction systems work with structured light or directly scanners. This is because (you missed one important factor in your list), the texture is a key factor for accuracy (or even before that correctness); a white wall cannot be reconstructed by a stereo system unless texture or structured light is added. Nevertheless, in my own experience, systems that involve variational matching can be very accurate (subpixel accuracy in image space) which is generally not achieved by probabilistic approaches. One last remark, the distance between cameras is also important for accuracy: very close cameras will find a lot of correct matches and quickly but the accuracy will be low, more distant cameras will find less matches, will probably take longer but the results could be more accurate; there is an optimal conic region defined in many books.
After all this blabla, I can tell you that using opencv one of the best things you can do is do an initial cameras calibration, use Brox's optical flow to find find matches and reconstruct.

How do I make a U-matrix?

How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.
A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.
The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.

Why do we maximize variance during Principal Component Analysis?

I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions

Resources