Kmeans going exceptionally slow when clustering more than 3 documents [closed] - machine-learning

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I'm trying to use kmeans to cluster similar documents to each other.
I am using NLTK's KMeans.
When I only cluster 3 documents, it takes less than 5 seconds. But once I add in a fourth document, it doesn't finish (I cut it out after 10 minutes).
When there are 4 documents, the vector size is about 1000. The vectors are sparse too, but I have 8 gigs of RAM, so I'm not worried about that. 1000 shouldn't be that much.
Anyone have any ideas why it solves 3 documents in 5 seconds, but can't solve 4 documents...at least in 10 minutes before giving up? When I go into production, it will theoretically have to cluster 300 or 400 documents at a time.
I was thinking of trying a different kmeans library to see if the NLTK implementation is weak, but I don't want to waste my effort if I'm the problem.
Thanks all.

I switched to Pycluster library and it works now.

Related

What is so special about 10^9+7? [duplicate]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
In many programming problems (e.g. some Project Euler problems) we are asked to report the answer as the remainder left after dividing the answer by 1,000,000,007.
Why not any other number?
Edit:
2 years later, here's what I know: the number is a big prime, and any answer to such a question is so large that it makes sense to report a remainder instead (as the number may be too large for a native datatype to handle).
Let me play a telepathist. 1000...7 are prime numbers and 1000000007 is the biggest one that fits in 32-bit integer. Since prime numbers are used to calculate hash (by finding the remainder of the division by prime), 1000000007 is good for calculating 32-bit hash.

Strange movement latency on other systems, on my own it's fine [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I've been developing a platformer engine in XNA for a short time now, and I decided to see how what I had so far ran on my laptop (not what I'm developing the game on.) I've been getting some strange latency (when changing direction and walking/jumping), as well as a bit of lag. I've asked a friend to run it on his computer and he gets the same result, but on my desktop it's fine. Any thoughts?
I initially thought it may be something to do with different graphics cards, because my desktop has a relatively powerful one and the computers with these problems have integrated graphics, but as it's a small 2D game I wouldn't think there would be problems like this.
Profiling
You will want to profile you code to see what parts of it are taking the longest, there are many profilers you can use.
SlimTune (Free)
ANTS (Free trial, very nice)
There are other questions dedicated to this.
Once you have one up and running, you will probably want to install it on 2 computers to see what causes it to run slow for each.
Timesteps / Framerate
Anouther thing you may want to play around with is
IsFixedTimeStep = true/false
graphics.SynchronizeWithVerticalRetrace = true/false
I messed around with changing them and depending on the situation it makes it smoother and less laggy
Elapsed Time
And lastly, you may already be doing this, but always apply elapsed time to your movement.
DONT do this.
position += velocity;
DO this.
float elapsed = (float)gameTime.ElapsedGameTime.TotalSeconds;
position += velocity * elapsed * speed;

How to Classify Data In Opencv [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have 130 objects.Each object is defined by 13 points(2-d points),these 13 points form data_unit. Thus there are 130 data_units. I want to classify these data_units into 4 classes. How can we do this.k-means is not possible in this scenario what are the alternatives.
There is a whole set of classification methods based on technique called machine learning. The ones implemented in OpenCV are described here. You can try for example Support Vector Machines. Its a nice and fairly easy in use method, with some tricks to get past data that cannot be linearly separated.

50 People All over the World hit web app versus 50 people on the Same LAN? Any difference? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
At my office we use a LAN (connected to internet through Comcast) and 30-50 people all accessed a web app (on Heroku) simultaneously. The server responded as if it had been hit by 50K people. Am I barking up the wrong tree, or would it make a difference if 50 people on an office network hit an app at the same time versus 50 people spread across the globe?
Apologies for such a vague question, but it only just occurred to me as a possibility.
Thanks in advance.
Having 50 people access your app simultaneously will have roughly the same effect, no matter where they come from. There is no more effort required to serve local requests than global requests (or vice versa). If your app has a performance problem, then consider yourself lucky that you've exposed it before turning on public access!

Image processing with Hadoop MapReduce [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I am doing a project on motion estimation between two frames of a video sequence using Block Matching Algorithm and using SAD metrics. It involves computing SAD between each block of reference frame and each block of a candidate frame in window size to get the motion vector between the two frames.
I want to implement the same using Map Reduce. Splitting the frames in key-value pairs, but am not able to figure out the logic because everywhere I see I find the wordCount or query search problem which is not analogus to mine
I would also appreciate If you are able to provide me more Map Reduce examples.
Hadoop is being used in situations where computations can happen in parallel and using a single machine might take a lot of time for the processing. There is nothing stopping you using Hadoop for video processing. Check this and this for more information on where Hadoop can be used. Some of these are related to video processing.
Start with understanding the WordCount example and Hadoop in general. Run the example on Hadoop. And then work from there. Would also suggest to buy the Hadoop - The Definitive Guide book. Hadoop and its ecosystem is changing at a very fast pace and it's tough to keep up-to-date, but the book will definitely give you a start on Hadoop.

Resources