JUNG Pagerank result: Some vertices has value NaN - jung

I have a sparse graph with directed weighted edges. I have normalized the sum of outgoing weight edges of each vertex to be 1.
I run JUNG pagerank on it, and the score of some of the vertices is NaN. These vertices do not have outgoing edges. But this should not be an issue as JUNG api 2 states
If a vertex has no outgoing edges, then the probability of taking a random jump from that vertex is (by default) effectively 1.
I have tried a smaller subset of the graph and I do not encounter the NaN problem. Do you guys have any suggestions? Thank you.

Two possibilities occur to me: you've set alpha to something peculiar (unlikely) or some of your edge weights are zero (or NaN).
Try running it without specifying your own edge weights. If that works then there's something weird with your edge weights.

Related

What is the meaning of pixel 'value' when interpreting as inverse-fourier transform?

I’m working on a image signal project using C++ with DFT, IDEF.
I major in physics and have lots of experience dealing with 1d fourier transform..
HOWEVER, 2d dft is really not intuitive.
I studied a lot and now have a little understanding of what is 2d dft.
THIS IS WHAT I REALLY WANT TO KNOW.
In 1d, assume you have 2 functions each having frequency 30, 60(ignore unit).
then I can have a sine function with frequency 30, 60.(spatial domain)
When I take DFT to each sine function, I got value of 30,60 in frequency domain.
*** If I reduced the value of frequency (f = 30), then I get low amplitude in spatial domain, which means Asin(2pi30x), coefficient A reduce.
alright then.
when I have a image of 100x100 pixels and take 2d dft.
Then I also have 2d frequency domain(only magnitude).
*** What happen to pixels in spatial domain when I reduce the value of specific frequency?
suppose we have two frequency (10,10), (20,20) in frequency domain(u,v)
this means image in spatial domain is composed of these two frequency sinusoidal functions.
Also same as 1D, when I reduced the value of the specific frequency, I thinks it should reduce the amplitude of 2d sinusoidal function,, right?
Then How can i interpret pixel?
*** what can I interpret pixel in regard to sinusoidal function.
This question arises because I and colleague are working on project,
we are inserting specific frequency like (30,30) in frequency domain to original 1 image.
then after, when I idft, I got image what i want.
But my colleague is trying to insert frequency not in frequency domain, but in spatial domain, dealing with pixel value, which I can’t understand…

Shape of ROC curve

I did a prediction analysis on a dataset and drew the ROC curve.
The ROC curve looks like below,
Im not very much sure about the shape of the curve. Doesn't it need to be a wavy curve. But looking at the cure, can we decide, that there is an issue with this. I got arount 71% accuracy, that is ok for me. But I'm worrying about the shape of the curve, which is not wavy. For an example doesn't look like below. (taken from internet.)
It looks like you only plotted three points. The idea of a ROC curve is to show how the FP/TP ratio varies when you tweak the decision threshold in order to establish the performance at every point. Without information about how you plotted this or what parameters you have, it's hard to say anything more.
A typical example would be to tweak aggressivity level -- if you have a spam scanner which will classify as spam at a particular score, how does changing the score threshold change the TP/FP rate? So effectively the X axis will also reveal the threshold setting (but possibly stretched in a manner) and the curve at every point will show how many of the samples in your clean collection will be FPs at that threshold, and how many in your spam collection will be correctly blocked.
("Stretching" means that the threshold setting might not map linearly onto the FP rate. If nothing happens between thresholds 0.950 and 0.975, you don't plot that interval on the x axis at all. The points on the x axis are the threshold values where the TP/FP rate changes; some could be very close to each other in terms of threshold value, and other adjacent points could correspond to a large jump in the threshold value.)
A good ROC curve has a large area underneath it. An ideal ROC goes from 0 to 1.00 and stays there, but then you don't need the plot to help you decide how to deploy your solution anyway. But in reality, they will come in all kinds of shapes, from vaguely asymptotic towards the upper left (very good) to straight diagonal (pretty lousy) and even asymptotic towards the lower right (extremely poor; random verdicts would be better). The interesting points are the "knee" where the TP rate's growth slows down and the FP rate starts growing quicker (that's where you should stop increasing the threshold) and any irregularities, especially any which break monotony.
(In your example from the net, there is a spot around TP 0.6 where increasing the threshold will only increase FPs. Why is that? Is there a skew in the samples, or a problem in the implementation? Could it be fixed?)
It looks like you have plotted points using the predicted class of a classifier (.predict function in python's sklearn package) rather than the predicted class probability (.predict_proba function in python's sklearn package). This means there is only one threshold change, when the class switches from 0 to 1, rather than a range of values that would give you the smooth curve.
Replace your predict class with your prediction probability and this should fix your problem.

spectral clustering eigenvectors and eigenvalues

What do the eigenvalues and eigenvectors in spectral clustering physically mean. I see that if λ_0 = λ_1 = 0 then we will have 2 connected components. But, what does λ_2,...,λ_k tell us. I don't understand the algebraic connectivity by multiplicity.
Can we draw any conclusions about the tightness of the graph or in comparison to two graphs?
The smaller the eigenvalue, the less connected. 0 just means "disconnected".
Consider this a value of what share of edges you need to cut to produce separate components. The cut is orthogonal to the eigenvector - there is supposedly some threshold t, such that nodes below t should go into one component, above t to the other.
That depends somewhat on the algorithm. For several of the spectral algorithms, the eigenstuff can be easily run through Principal Component Analysis to reduce the display dimensionality for human consumption. Power iteration clustering vectors are more difficult to interpret.
As Mr.Roboto already noted, the eigenvector is normal to the division brane (a plane after a Gaussian kernel transformation). Spectral clustering methods are generally not sensitive to density (is that what you mean by "tightness"?) per se -- they find data gaps. For instance, it doesn't matter whether you have 50 or 500 nodes within a unit sphere forming your first cluster; the game changer is whether there's clear space (a nice gap) instead of a thin trail of "bread crumb" points (a sequence of tiny gaps) leading to another cluster.

accuracy of dense optical flow

Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).

Kmeans clustering on different distance function in Lab space

Problem:
To cluster the similar colour pixels in CIE LAB using K means.
I want to use CIE 94 for distance between 2 pixels
Formula of CIE94
What i read was Kmeans work in "Euclidean space" where the positional cordinates are minimised by cost function which is (sum of squared difference)
The reason of not Using kmeans in space other than euclidean is
"""algorithm is often presented as assigning objects to the nearest cluster by distance. The standard algorithm aims at minimizing the within-cluster sum of squares (WCSS) objective, and thus assigns by "least sum of squares", which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging""(source wiki)
So how to use distance CIE 94 in LAB SPACE for similar colour clustering ?
So how to approach the problem ? What should be the minimisation function here ? HOW to map euclidean space to lab space if for the k mean euclidean formula to work ? Any other approach here ?
The reason that CIE LAB is often used for clustering is because it reduces the color to 2 dimensions (as opposed to RGB with 3 color channels). You can easily think of the color for each pixel in a Cartesian coordinate system, instead of points (x,y) you have points (a,b) From here you simply perform a 2d kmeans.
Exactly how you implement kmeans is up to you. The nice thing about reducing colors to a 2d space is we can imagine the data on a grid, and now we can use any regular distance measure we want. Mahalonobis, euclidean, 1 norm, city block, etc. The possibilities are really endless here.
You don't have to use CIELAB, you can just as easily use YCbCr, YUV, or any other colorspace that represents color in 2 dimensions. IF you wanted to try a 3d kmeans you could use rgb, hsv, etc. One problem with higher dimensionality is sparsity of clusters (large variance) and most importantly, increased computation time.
Just for fun I've included two images clustered using kmeans, one in LAB and one in YCbCr, you can see the clustering is nearly identical (except that the labels are different), just proving that the exact color space is irrelevant, the main point is to match the dimensionality of your kmeans with that of your data
EDIT
You made some good points in your comments. I was merely demonstrating that by abstracting the problem you can imagine many variations for the same basic clustering algorithm. But you are right, there are advantages to using CIELAB
Back to the distance measure. Kmeans has two steps, assignment, and update (it is very similar to the Expectation Maximization algorithm). This distance is used in assignment step of k-means. Here is some psuedo code
for each pixel 1 to rows*cols
for each cluster 1 to k
dist[k] = calculate_distance(pixel, mu[k])
pixel_id = index k of minimum dist
you would create a function calculate_distance that uses the delta_e calculation from cielab94. This formula uses all 3 channels to calculate distance. Hopefully this answers your questions
NOTE
My examples only use the 2 color channels, ignoring the luminance channel. I used this technique since often the goal is group colors despite lighting disparities(such as shadows). The delta_E measure is not lighting invariant. This may or may not be a concern for your application, but it is something to keep in mind.
results using square euclidean distance
results using cityblock distance
There are k-means variations for other distance functions.
In particular k-medoids (PAM) works with arbitrary distance functions.

Resources