Calculating a lot of Lat/Lngs to a set of 2000 Lat/Lngs in Ruby - ruby-on-rails

I am trying to find the best way to solve the problem below:
Problem
I have (up to) 100,000 Lat/Lng points in Set A
I have (up to) 2000 Lat/Lng points in Set B
I need to find the nearest neighbour of points in set B to points in Set A.
Once they have been paired - I then need to calculate their distance which will be:
2000 Set A points to 2000 Set B Points.
These points are "in memory" they do not come from a database - they are the result of other calculations done the in the system.
Current Solution
Using a KDTree implementation in Ruby I can create a KDTree lookup that will match the points I have. I then use a haversine method in Ruby to calculate the distance of the points when they are paired.
KDtree code: Ruby KDTree Code
haversine Code: Haversine Code
Platform
I am running jruby - with rails as the web framework.
Issue
Its slow! Like 30 to 40 seconds slow... I think the main bottle neck is in the KDtree, but the point look up takes a long time too (i think). At smaller numbers in Set B its quick but the higher the number of points in Set B it gets a lot quicker.
The Question
Would anyone think of doing this differently? Is there something I am missing. I think a Java library might be a lot quicker, but how would I implement this, and which one would I use (Not strong in Java - I use Jruby for multithreading ruby code in the JVM)

Is it possible to persist the information to a database? Because then you can use GeoKit, which leverages a geo-aware database (MySQL, Postgres > 8.1, etc) so that you can do this:
Location.find(:all, :origin =>[37.792,-122.393], :within=>10, :order=>"distance asc")
Also, you can find the distance between two points, etc. The response time will be more on par with a DB query, and much faster than what you're seeing.

Just an idea in my mind. If you round your lat/long's to two decimal places then all the points with-in 1.11 km's will be the same. See this for more details. I'm not 100% sure about it but may be it works for you. Off-course for areas near the pols, this will not work as longitude shrinks there.
To speed up the distance calculation between two lat/long's, you can calculate euclidean distance by using simple distance formula rather than geographical distance. This distance will not be accurate off-course but will speed up your process.

Related

Cluster Analysis for crowds of people

I have location data from a large number of users (hundreds of thousands). I store the current position and a few historical data points (minute data going back one hour).
How would I go about detecting crowds that gather around natural events like birthday parties etc.? Even smaller crowds (let's say starting from 5 people) should be detected.
The algorithm needs to work in almost real time (or at least once a minute) to detect crowds as they happen.
I have looked into many cluster analysis algorithms, but most of them seem like a bad choice. They either take too long (I have seen O(n^3) and O(2^n)) or need to know how many clusters there are beforehand.
Can someone help me? Thank you!
Let each user be it's own cluster. When she gets within distance R to another user form a new cluster and separate again when the person leaves. You have your event when:
Number of people is greater than N
They are in the same place for the timer greater than T
The party is not moving (might indicate a public transport)
It's not located in public service buildings (hospital, school etc.)
(good number of other conditions)
One minute is plenty of time to get it done even on hundreds of thousands of people. In naive implementation it would be O(n^2), but mind there is no point in comparing location of each individual, only those in close neighbourhood. In first approximation you can divide the "world" into sectors, which also makes it easy to make the task parallel - and in turn easily scale. More users? Just add a few more nodes and downscale.
One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd)ale at time of limited activity.
How to avoid "single-linkage problem" mentioned by author in comments? One idea would be to think in terms of 'mass' and centre of gravity. First of all, do not mark something as event until the mass is not greater than e.g. 15 units. Sure, location is imprecise, but in case of events it should average around centre of the event. If your cluster grows in any direction without adding substantial mass, then most likely it isn't right. Look at methods like DBSCAN (density-based clustering), good inspiration can be also taken from physical systems, even Ising model (here you think in terms of temperature and "flipping" someone to join the crowd). It is not a novel problem and I am sure there are papers that cover it (partially), e.g. Is There a Crowd? Experiences in Using Density-Based Clustering and Outlier Detection.
There is little use in doing a full clustering.
Just uses good database index.
Keep a database of the current positions.
Whenever you get a new coordinate, query the database with the desired radius, say 50 meters. A good index will do this in O(log n) for a small radius. If you get enough results, this may be an event, or someone joining an ongoing event.

idea: getting locations nearby

currently, i have a table with locations (latitude, longitude). I calculate nearby calculations using sin, cos as described here
This seems rather slow. I am having the idea of pre-calculating the distance to a fixed point f and store it along the locations. When I now want to find locations nearby i just calculate distance to the same fix point and can then find them by doing some less or equal comparing.
Does my idea make sense? Is there a standard way to do that? I am in the thinking phase, so i do not have any code to show yet.
Your idea won't work unless all your locations are collinear, which most probably is not the case.
Are you using SQL to do the calculations? Are you properly using indexes? Maybe you could share a bit of your code with us.

Is there a cleverer Ruby algorithm than brute-force for finding correlation in multidimensional data?

My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.

Blackberry cache reverse geocode address info with proximity

Most people are limited to about 5 or 6 locations on a daily basis (work, home, school, store, etc). I want to speed up address display by caching a few of these most visited locations. I've been able to get the address info using both google maps GPS and JSON and Locator.reverseGeocode. What would be the best way to cache this information and to check proximity quickly? I found this GPS distance calculation example and have it working. Is there a faster way to check for proximity?
Please see similar question first: Optimization of a distance calculation function
There are several things we can change in distance calculations to improve performance:
Measure device speed and decrease or increase period of proximity test accordingly
Trigonometric calculations takes most of performence, but it may done much faster. First make bold distance calculations using lookup table method, then if distance is less than proximity limit + uncertainty limit, use CORDIC method for more precise calculation.
Use constants for Math.PI/180.0 and 180.0/Math.PI
several links that may be helpful:
Very useful explanations of CORDIC, especially doc from Parallax for dummies
Fast transcendent / trigonometric functions for Java
Cordic.java at Trac by Thomas B. Preusser
Cordic.java at seng440 proj
Sin/Cos look-up table source at processing.org by toxi

Which Improvements can be done to AnyTime Weighted A* Algorithm?

Firstly , For those of your who dont know - Anytime Algorithm is an algorithm that get as input the amount of time it can run and it should give the best solution it can on that time.
Weighted A* is the same as A* with one diffrence in the f function :
(where g is the path cost upto node , and h is the heuristic to the end of path until reaching a goal)
Original = f(node) = g(node) + h(node)
Weighted = f(node) = (1-w)g(node) +h(node)
My anytime algorithm runs Weighted A* with decaring weight from 1 to 0.5 until it reaches the time limit.
My problem is that most of the time , it takes alot time until this it reaches a solution , and if given somthing like 10 seconds it usaully doesnt find solution while other algorithms like anytime beam finds one in 0.0001 seconds.
Any ideas what to do?
If I were you I'd throw the unbounded heuristic away. Admissible heuristics are much better in that given a weight value for a solution you've found, you can say that it is at most 1/weight times the length of an optimal solution.
A big problem when implementing A* derivatives is the data structures. When I implemented a bidirectional search, just changing from array lists to a combination of hash augmented priority queues and array lists on demand, cut the runtime cost by three orders of magnitude - literally.
The main problem is that most of the papers only give pseudo-code for the algorithm using set logic - it's up to you to actually figure out how to represent the sets in your code. Don't be afraid of using multiple ADTs for a single list, i.e. your open list. I'm not 100% sure on Anytime Weighted A*, I've done other derivatives such as Anytime Dynamic A* and Anytime Repairing A*, not AWA* though.
Another issue is when you set the g-value too low, sometimes it can take far longer to find any solution that it would if it were a higher g-value. A common pitfall is forgetting to check your closed list for duplicate states, thus ending up in a (infinite if your g-value gets reduced to 0) loop. I'd try starting with something reasonably higher than 0 if you're getting quick results with a beam search.
Some pseudo-code would likely help here! Anyhow these are just my thoughts on the matter, you may have solved it already - if so good on you :)
Beam search is not complete since it prunes unfavorable states whereas A* search is complete. Depending on what problem you are solving, if incompleteness does not prevent you from finding a solution (usually many correct paths exist from origin to destination), then go for Beam search, otherwise, stay with AWA*. However, you can always run both in parallel if there are sufficient hardware resources.

Resources