K-means initialization with further-first traversal and k-mean++ - machine-learning

I am confused about k-mean++ initialization. I understand k-mean++ choose and furthest data point as next data center. But how about outlier? What is the different between `initialization with further-first traversal and k-mean++ ?
I saw someone explain in this way:
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next
cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1)
= 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a,
where a = 1/(1+4+1).
What is this array or list is [0,1,2,4,5,6,100]. Obviously, 100 is the outlier in this case and it will be chosen as the data center at some point. Can someone give a better explanation?

K-means chooses points with probability.
But yes, with extreme outliers it is likely to chose the outlier.
That is fine, because so will k-means. Most likely the best SSQ solution has a one-element cluster containing only that point.
If you have such data, the k-means solutions tend to be rather useless, and you probably should choose another algorithm such as DBSCAN instead.

Related

How to rotate a non-squared image in frequency domain

I want to rotate an image in frequency domain. Inspired in the answers in Image rotation and scaling the frequency domain? I managed to rotate square images. (See the following python script using OpenCV)
M = cv2.imread("lenna.png")
M=np.float32(M)
hanning=cv2.createHanningWindow((M.shape[1],M.shape[0]),cv2.CV_32F)
M=hanning*M
sM = fftshift(M)
rotation_center=(M.shape[1]/2,M.shape[0]/2)
rot_matrix=cv2.getRotationMatrix2D(rotation_center,angle,1.0)
FsM = fftshift(cv2.dft(sM,flags = cv2.DFT_COMPLEX_OUTPUT))
rFsM=cv2.warpAffine(FsM,rot_matrix,(FsM.shape[1],FsM.shape[0]),flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT)
IrFsM = ifftshift(cv2.idft(ifftshift(rFsM),flags=cv2.DFT_REAL_OUTPUT))
This works fine with squared images. (Better results could be achieved by padding the image)
However, when only using a non-squared portion of the image, the rotation in frequency domain shows some kind of shearing effect.
Any idea on how to achieve this? Obivously I could pad the image to make it square, however the final purpose of all this is to rotate FFTs as fast as possible for an iterative image registration algorithm and this would slightly slow down the algorithm.
Following the suggestion of #CrisLuengo I found the affine transform needed to avoid padding the image. Obviously it will depend on the image size and the application but for my case avoidding the padding is very interesting.
The modified script looks now like:
#rot_matrix=cv2.getRotationMatrix2D(rotation_center,angle,1.0)
kx=1.0
ky=1.0
if(M.shape[0]>M.shape[1]):
kx= float(M.shape[0]) / M.shape[1]
else:
ky=float(M.shape[1])/M.shape[0]
affine_transform = np.zeros([2, 3])
affine_transform[0, 0] = np.cos(angle)
affine_transform[0, 1] = np.sin(angle)*ky/kx
affine_transform[0, 2] = (1-np.cos(angle))*rotation_center[0]-ky/kx*np.sin(angle)*rotation_center[1]
affine_transform[1, 0] = -np.sin(angle)*kx/ky
affine_transform[1, 1] = np.cos(angle)
affine_transform[1, 2] = kx/ky*np.sin(angle)*rotation_center[0]+(1-np.cos(angle))*rotation_center[1]
FsM = fftshift(cv2.dft(sM,flags = cv2.DFT_COMPLEX_OUTPUT))
rFsM=cv2.warpAffine(FsM,affine_transform, (FsM.shape[1],FsM.shape[0]),flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT)
IrFsM = ifftshift(cv2.idft(ifftshift(rFsM),flags=cv2.DFT_REAL_OUTPUT))

Dynamic Time Warping in Swift

I translated a DTW matlab function to Swift. The code looks as follows:
private func dtw(x1 : [Double], x2 : [Double]) -> Double {
let n1 = x1.count;
let n2 = x2.count;
var table = [[Double]](repeating: [Double](repeating: 0, count: n2 + 1), count: 2);
table[0][0] = 0;
for i in 1...n2 { table[0][i] = Double.infinity }
for i in 1 ... n1 {
table[1][0] = Double.infinity;
for j in 1 ... n2 {
let cost = abs(x1[i - 1] - x2[j - 1]);
var min = table[0][j - 1];
if (min > table[0][j]) {
min = table[0][j];
}
if (min > table[1][j - 1]) { min = table[1][j - 1]; }
table[1][j] = cost + min;
}
let swap = table[0];
table[0] = table[1];
table[1] = swap;
}
return table[0][n2];
}
This function takes an average of 16 ms to complete on an iPhone 11. For my use case, this is very slow. I want to investigate ways to improve speed. I recently read these two articles : DTW in Swift Orailly and Parallel programming with Swift. In the first article, there is a good quote:
Our implementation of DTW is naïve, and can be accelerated using parallel computing. To calculate the new row/column in a distance matrix, you don't need to wait until the previous one is finished; you only need it to be filled one cell ahead of your row/column
This would make the for j in 1 ... n2 { for loop an ideal candidate. ( I think ) Looking at the code, only these two operations should be thread-safe due to the read / write:
table[1][j - 1]
table[1][j]
The problem I am currently experiencing in introducing parallel computing ( from article 2 ) is that I cannot figure out how to tell swift run everything in parallel, except when I come to the two below lines, as they depend on their predocessor:
if (min > table[1][j - 1]) { min = table[1][j - 1]; }
table[1][j] = cost + min;
I suspect I could solve this issue with DispatchQueue.concurrentPerform and an NSLock(), if I implemented it correctly. ( I have not ) It could also be the wrong tool of choice, yielding me back to my question:
What can I do, to improve the speed of my DTW function where the only constraint in performing a task is that the previous execution in an array had to have completed ( parallelization, concurrency, etc. ) A code example would go a long way.
Your first problem is that you're creating an array of arrays. This is not an efficient data structure, and is not a "2 dimensional array" in the way most people mean (i.e a matrix). It is an array made up of other arrays, all of which can have arbitrary sizes, and this can be very expensive to mutate. As a rule, if you want a matrix, you should back it with a flat array and use multiplication to find its offsets, particularly if you're mutating it. Instead of table[i][j] you would use table[i * width + j].
But in your case it's even easier, since there are exactly two rows. So you don't a multi-dimensional array at all. You can just use two variables, and it'll be much more efficient. (In my tests, just making this change is about 30% faster than the original code.)
The major thing that slows you down is contention. You read and write to the same array in the loop. That gets in the way of various reordering and caching optimizations. In particular, it happens here:
if (min > table[1][j - 1]) { min = table[1][j - 1]; }
table[1][j] = cost + min;
If you rewrite that using two row variables rather than an array, it still looks like this:
if (min > row1[j - 1]) { min = row1[j - 1] }
row1[j] = cost + min
This forces the previous write to row1 to be fully completed before the next minimum can be computed, and then requires an array lookup to get the value back. But that's not really necessary. You can just cache the previous value between loops. Doing that means the loop only performs reads on row0 and only performs writes on row1. That's good for memory contention.
Putting those together, I wrote it this way. I changed the offsets to run from 0 rather than 1; it just made the code a little simpler to understand IMO. In my tests, this is about 3x faster than the original code for two arrays of 10k elements each.
func dtw(x1 : [Double], x2 : [Double]) -> Double {
let n1 = x1.count
let n2 = x2.count
var row0 = Array(repeating: Double.infinity, count: n2 + 1)
row0[0] = 0
var row1 = Array(repeating: 0.0, count: n2 + 1)
for i in 0 ..< n1 {
row1[0] = .infinity
// Keep track of the last value so we never have to read from row1.
var lastValue = Double.infinity
for j in 0 ..< n2 {
let cost = abs(x1[i] - x2[j])
// Don't be tempted to use the 3-value version of `min` here. It's much slower.
var minimum = min(row0[j], row0[j + 1])
minimum = min(minimum, lastValue)
lastValue = cost + minimum
row1[j + 1] = lastValue
}
swap(&row0, &row1)
}
return row0[n2];
}
This code is somewhat hard to make parallel, because the operations are not independent. Each row depends on the other rows. The key to good queue-based parallelism is the ability to split up fairly large chunks of independent work, and then efficiently combine them at the end. The cost of coordination will eat your benefits if the work units are too small. In many cases, vectorization (SIMD) is much more efficient than dispatching to multiple queues.
The cost function is independent, and I explored computing it with Accelerate (the main vectorization framework), but this generally made things slower. The compiler is very good at optimizing simple math in loops, and will do quite a lot of vectorizing for you if you let it. Accelerate is best when you need to do an expensive, consistent, and independent computation on a lot of values. And this loop isn't expensive or independent.

How to estimate? "simple" Nonlinear Regression + Parameter Constraints + AR residuals

I am new to this site so please bear with me. I want to
the nonlinear model as shown in the link: https://i.stack.imgur.com/cNpWt.png by imposing constraints on the parameters a>0 and b>0 and gamma1 in [0,1].
In the nonlinear model [1] independent variable is x(t) and dependent are R(t), F(t) and ξ(t) is the error term.
An example of the dataset can be shown here: https://i.stack.imgur.com/2Vf0j.png 68 rows of time series
To estimate the nonlinear regression I use the nls() function with no problem as shown below:
NLM1 = nls(**Xt ~ (aRt-bFt)/(1-gamma1*Rt), start = list(a = 10, b = 10, lamda = 0.5)**,algorithm = "port", lower=c(0,0,0),upper=c(Inf,Inf,1),data = temp2)
I want to estimate NLM1 with allowing for also an AR(1) on the residuals.
Basically I want the same procedure as we go from lm() to gls(). My problem is that in the gnls() function I dont know how to put contraints for the model parameters a, b, gamma1 and the model estimates wrong values for them.
nls() has the option for lower and upper bounds. I cant do the same on gnls()
In the gnls(): I need to add the contraints something like as in nls() lower=c(0,0,0),upper=c(Inf,Inf,1)
NLM1_AR1 = gnls( model = Xt ~ (aRt-bFt)/(1-gamma1*Rt), data = temp2, start = list(a =13, b = 10, lamda = 0.5),correlation = corARMA(p = 1))
Does any1 know the solution on how to do it?
Thank you

How to generate a Random Floating point Number in range, Swift

I'm fairly new to Swift, only having used Python and Pascal before. I was wondering if anyone could help with generating a floating point number in range. I know that cannot be done straight up. So this is what I've created. However, it doesn't seem to work.
func location() {
// let DivisionConstant = UInt32(1000)
let randomIntHeight = arc4random_uniform(1000000) + 12340000
let randomIntWidth = arc4random_uniform(1000000) + 7500000
XRandomFloat = Float(randomIntHeight / UInt32(10000))
YRandomFloat = Float(randomIntWidth / UInt32(10000))
randomXFloat = CGFloat(XRandomFloat)
randomYFloat = CGFloat(YRandomFloat)
self.Item.center = CGPointMake(randomXFloat, randomYFloat)
}
By the looks of it, when I run it, it is not dividing by the value of the DivisionConstant, so I commented this and replaced it with a raw value. However, self.Item still appears off screen. Any advice would be greatly appreciated.
This division probably isn't what you intended:
XRandomFloat = Float(randomIntHeight / UInt32(10000))
This performs integer division (discarding any remainder) and then converts the result to Float. What you probably meant was:
XRandomFloat = Float(randomIntHeight) / Float(10000)
This is a floating point number with a granularity of approximately 1/10000.
Your initial code:
let randomIntHeight = arc4random_uniform(1000000) + 12340000
generates a random number between 12340000 and (12340000+1000000-1). Given your final scaling, that means a range of 1234 and 1333. This seems odd for your final goals. I assume you really meant just arc4random_uniform(12340000), but I may misunderstand your goal.
Given your comments, I think you've over-complicated this. The following should give you a random point on the screen, assuming you want an integral (i.e. non-fractional) point, which is almost always what you'd want:
let bounds = UIScreen.mainScreen().bounds
let x = arc4random_uniform(UInt32(bounds.width))
let y = arc4random_uniform(UInt32(bounds.height))
let randomPoint = CGPoint(x: CGFloat(x), y: CGFloat(y))
Your problem is that you're adding the the maximum value to your random value, so of course it's always going to be offscreen.
I'm not sure what numbers you're hoping to generate, but what you're getting are results like:
1317.0, 764.0
1237.0, 795.0
1320.0, 814.0
1275.0, 794.0
1314.0, 758.0
1300.0, 758.0
1260.0, 809.0
1279.0, 768.0
1315.0, 838.0
1284.0, 763.0
1273.0, 828.0
1263.0, 770.0
1252.0, 776.0
1255.0, 848.0
1277.0, 847.0
1236.0, 847.0
1320.0, 772.0
1268.0, 759.0
You're then using this as the center of a UI element. Unless it's very large, it's likely to be off-screen.

Improving detection of the orange colour in MATLAB

One of my tasks is to detect some colours from ant colonies from the 16000 images. So, I've already done it very good with blue, pink and green, but now I need to improve detection of the orange colour. It's a bit tricky for me, since I am new one in a field of image processing. I put some examples what I have done and what was my problem.
Raw image:http://img705.imageshack.us/img705/2257/img4263u.jpg
Detection of the orange colour:http://img72.imageshack.us/img72/8197/orangedetection.jpg
Detection of the green colour:http://img585.imageshack.us/img585/1347/greendetection.jpg
I had used selectPixelsAndGetHSV.m to get the HSV value, and after it I used colorDetectHSV.m to detect pixels with the same HSV value.
Could you give me any sugesstion how to improve detection of the orange colour and not to detect whole ants and broods around them?
Thank you in advance!
function [K]=colorDetectHSV(RGB, hsvVal, tol)
HSV = rgb2hsv(RGB);
% find the difference between required and real H value:
diffH = abs(HSV(:,:,1) - hsvVal(1));
[M,N,t] = size(RGB);
I1 = zeros(M,N); I2 = zeros(M,N); I3 = zeros(M,N);
T1 = tol(1);
I1( find(diffH < T1) ) = 1;
if (length(tol)>1)
% find the difference between required and real S value:
diffS = abs(HSV(:,:,2) - hsvVal(2));
T2 = tol(2);
I2( find(diffS < T2) ) = 1;
if (length(tol)>2)
% find the difference between required and real V value:
difV = HSV(:,:,3) - hsvVal(3);
T3 = tol(3);
I3( find(diffS < T3) ) = 1;
I = I1.*I2.*I3;
else
I = I1.*I2;
end
else
I = I1;
end
K=~I;
subplot(2,1,1),
figure,imshow(RGB); title('Original Image');
subplot(2,1,2),
figure,imshow(~I,[]); title('Detected Areas');
You don't show what you are using as target HSV values. These may be the problem.
In the example you provided, a lot of areas are wrongly selected whose hue ranges from 30 to 40. These areas correspond to ants body parts. The orange parts you want to select actually have a hue ranging from approximately 7 to 15, and it shouldn't be difficult to differentiate them from ants.
Try adjusting your target values (especially hue) and you should get better results. Actually you can also probably disregard brightness and saturation, hue seems to be sufficient in this case.

Resources