What spatial indexing algorithm should I use? - ios

I want to implement some king of spatial indexing data structure for my MKAnnotations.
Currently it's horribly slow when I try to filter them based on distance criteria ( 3-4k of locations, currently extremely slow with a simple double for ... ).
I'd like to create clusters of MKAnnotations, to decide if it is close to another. Also, these locations are in a somewhat (creation) order and a "previous"/"next" functionality would be needed to "jump" between (this is not a must).
I've read about kd-tree and r-tree structures and they both seem to meet the fast distance/neighbor obtaining option for filtering/clustering, but I'm not sure which is the best for me or if there are other options too.
What algorithm/data structure should I use ?
Update: I store these locations in a Core Data database, they represent a path. When the map is opened they are fetched into an array and then I just use that array for distance calculations and annotation creation.
When the user moves/zooms the map, I loop through them and decide what needs to be changed on map, so it is kinda static the whole stuff. As I understood, if I'd be using a tree, I could store the locations there and when a zoom/move happens I just search through it and obtain the ones in the new region. Is this true ?
Even in the dynamic case, when I can add new locations to this array, it would be a single insertion and it's happening rarely.

It depends a lot on what your usage patterns are (how my writes, for example, in-memory or on-disk) and how your data looks like (that is how it is distributed).
R-trees are good because they are balanced, and allow updating. The R*-tree in my experience is clearly better than the other variants because of the split strategy it has. The benefit is that it produces more square pages than the other strategies, so that for many queries you will need to scan fewer pages.
kd-trees are good if you are in-memory and static. Updating them is very bad, you will need to rebuild the index quite often.
And if your data does not change very often, bulk-loading for the R-tree works very well. You can do Sort-Tile-Recursive bulk loading, which essentially requires (partially) sorting your data on X and Y alternatingly, so it takes a low O(n log n) to build the tree; very similar to bulk-loading an kd-tree, except that you multi-split instead of binary splitting. This is very popular.
Furthermore, you can keep track of the number of objects in each page. When displaying things on a map, you may want to stop early when a page would display too small on the screen (i.e. smaller than a marker). At this point, you would not scan that page, but only take the number of objects and display that as a clustered marker until the user zooms in.
For 2D data, with a limited value domain, do not overlook the simple things. Quadtrees can work really well, too! Simplicity can make it a lot easier to optimize things. Or a classic grid approach. If your users tend to spread their annotations in an area (and not put them all into one place), you can just compute integer x,y grid coordinates, and then hash them and make a list for each grid cell.

I am no iOS developer, but I looked over the docs and found this:
MKMapView.annotationsInMapRect:
Returns the annotation objects located in the specified map rectangle.
(NSSet *)annotationsInMapRect:(MKMapRect)mapRect
Parameters
mapRect: The portion of the map that you want to search for annotations.
Return Value
The set of annotation objects located in mapRect.
Discussion
This method offers a fast way to retrieve the annotation objects in a particular portion of the map. This method is much faster than doing a linear search of the objects in the annotations property yourself.
This suggests that the NKMapView already organizes annotations in a spatial index structure. Would this method meet your needs?
If not, I would look for existing open source implementations of any 2D spatial indexing structure and pick the one with best documentation, cleanest interfaces, etc. rather than worrying about efficiency. If you need to write the code form scratch, I think a quadtree would be the easiest to implement. On the other hand, the Wikipedia article on R-tree seems more specifically targeted towards mapping than the K-D Tree or Quadtree.

Related

Coredata performance: is there a penalty for loading many individual entities?

I'm working on an app that will include a set of points drawn from CLLocationManager and draw them on a map. I'll never really have a need for each point as an individual entity, they only have meaning in the context of the path.
Instead of creating a model representing the points, I could just store the path as a big JSON (or other more efficient string format) and thereby read only the single entity when it's time to pull the data out. It seems to me this could save overhead, is that true?
This is something that would need some testing. Finding the path directly which contains the points is probably a faster way then fetching all the points which correspond to a certain path but the part with writing them into strings seems a bit off. Parsing those strings will be slow. (JSON being a string).
For saving the points into paths I would suggest either to also add the point entity which is then linked through reference to the path. An alternative would be to use transformable data; Your point will be represented by 2 or 3 double values which could be put directly into a buffer (NSData for instance). The length of the data saved then defines the number of points as data.length/(sizeof(double)*dimensions). This would be extremely easily done in ObjectiveC while in Swift you may lose some hair when working with raw data and unsafe pointers.
It really depends on what you are implementing but if you plan to have very many paths in the database you can still expect a large delay when fetching the data. You might want to consider creating sectors. Each sector would be represented with the same data as the region (MKCoordinateRegion) where on database initialize you would iterate to create sectors for the whole earth. Then when you are inserting paths you check what regions the path intersects with and assign the path to those regions (many-to-many relation). Now when you show the map you check what regions are visible and fetch only those regions and then extract paths from them.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

Neo4j Structure for GPS coordinates log

I'm using neo4j for a, let's call it, social network where users will have the ability to log their position during workouts (think Runkeeper and Strava).
I'm thinking about how I want to save the coordinates.
Is it a good idea to have it like node(user)-has->node(workouts)<-is a-node(workout)-start->node(coord)-next->node(coord)-next->.... i.e. a linked list with coordinates for every workout?
I will never query the db for individual points, the workout will always be retrieved as a whole.
Is it a better way to solve this?
I can image that a graph db isn't the ideal db to store this type of data, but I don't want to add the complexity of adding another db right now.
Can someone give me any insight on this?
I would suggest you store it as:
user --has--> workout --positionedAt--> coord
This design feels more natural to me as the linked list design you mentioned in your question just produces a really deep traversal which might be annoying to query. In this way you can easily find all the coordinates for a particular workout by simply iterating edges on the workout vertex. I would recommend storing a datetime stamp on the positionedAt edge so that you can sort your coordinates easily.
The downside is that depending on how many coord vertices you intend to have you might end up with some fat workout vertices, but that may not really affect your use case. I can't think of a workout that would generate something like 100000 coordinates (and hence 100000 edges), but perhaps you can. If so, I suppose I could amend my answer a bit.

Tinkerpop Blueprints Vertex Query

I've been researching the Tinkerpop stack for quite a while. I think I have a good idea of what it can do and what databases it works well with. I've got a couple of different databases I'm thinking about right now, but haven't decided on a definite. So I've decided to write my code purely to the interfaces, and not take into account any implementation right now. Out of the databases I'm looking at, they implement TransactionalGraph and KeyIndexableGraph. I think that's good enough for what I need, but I have just one question.
I have different 'classes' of vertices. Using Blueprints, I believe that's best representable by having a field in each vertex containing the class name. Doing that, I can do something like graph.getVertices("classname", "User") and it would give me all of the user vertices. And since the getVertices function specifies that an implementation should make use of indexes, I'm guaranteed to get a fast lookup (if I index that field).
But let's say that I wanted to retrieve a vertex based on two properties. The vertex must have className=Users and username=admin. What's the best way to go about finding that single vertex? And is it possible to index over both of those properties, even though not all vertices will have a username field?
FYI - The databases I'm currently thinking of are OrientDB, Neo4j and Titan, but I haven't decided for sure yet. I'm also currently planning to use Gremlin if that helps at all.
Using a "class" or a "type" for vertices is a good way to segment them. Doing:
graph.createKeyIndex("classname",Vertex.class);
graph.getVertices("classname", "User");
is a pretty common pattern and should generally yield a fast lookup, though iterating an index of tens of millions of users might not be so great (if you intend to grow a particular classname to very big size). I think that leads to the second part of your question, in regards to doing a two property lookup.
Taking your example on the surface, the two element lookup would be something like (using Gremlin):
g.V('classname',"User").has('username','admin')
So, you narrow the vertices to just "User" vertices with a key index and then filter those for "admin". But, I'd model this differently. It would be even less expensive to simply do:
graph.createKeyIndex("username",Vertex.class);
graph.getVertices("username", "admin");
or in Gremlin:
g.V('username','admin')
If you know the username you want, there's no better/faster way to model this. You really only need the classname if you want to iterate over all "User" vertices. If you just want to find one (or a set of vertices with that username) then key indexing on that property is the better way.
Even if I don't create a key index on it, I still include a type or classname property on all vertices. I find it helpful in global operations where I may or may not care about speed, but just need an answer.
graph.getVertices() will iterate through all vertexes and look for ones with that property if you do not have the auto-index turned on in your graph implementation. If you already have data and cannot just turn on the auto-indexer, you should use is index = indexableGraph.getIndex() and then index.get('classname', 'User')
It's possible to perform a query over multiple objects, but without specifics, it's hard to say. For Neo4j they use Lucene, which means that query() will take a lucene query, such as className:Users AND username:admin, but I cannot speak for the others.
Yeah of those DB is good for playing with, I personally found neo4j to be the easiest, and as long as you understand their licensing structure, you shouldn't have any problems using them.

Architecture of finding movable geotagged objects

I currently have a Postgres DB filled with approx. 300.000 data-sets of moving vehicles all over the world. My very frequently repeated query is: Give me all vehicles in a 5/10/20mile radius. Currently I spend around 600 to 1200 ms in the DB to prepare the set of located vehicle-objects.
I am looking to vastly improve this time by ideally one or two orders of magnitude if possible. I am working in a Ruby on Rails 3.0beta environment if this is relevant.
Any ideas how to architect the whole system to accelerate this query? Any NoSQL database able to deliver this kind of geolocation performance? I know of MongoDB working on an extension to facilitate this scenario but haven't tried it yet. Any intelligent use of Redis to achieve this?
One problem with SQL-DBs here seems to be that I can't possibly use indexes because my vehicles are mostly moving around, meaning I had to constantly created DB indexes which, by itself, is probably more expensive than just doing the searching without index.
Looking forward to your thoughs, Thanks!
If you use the right algorithm for organizing your data, you will be able to use a spatial index which can dramatically speed up your queries.
The best practice for the geolocation domain is to use a geohash, quad-tree, R-tree or similar data structure (R-trees are the most generic, but it sounds like you're querying point data, so that may not matter). In each case, you can create a spatial index that uses a single, linear column where each value represents a bounding box of varying size and shape. This should let you answer most queries with a single range query in your database. Spatial indices can be implemented in SQL (PostGIS, MS SQL, MySQL all have spatial datatypes and spatial indices which use one of these techniques) or NoSQL (popular for its horizontal scalability; AppEngine has geomodel, SimpleGeo uses Cassandra, Foursquare uses MongoDB).
Using an index can be complicated by constantly moving points, but I would suspect that writes, even slightly heavier writes that update indices, wouldn't be your bottleneck.
Even though your vehicles are moving around all the time, I assume they have some kind of speed limit. What you can do is to create some kind of discrete coordinate system, one example would be the integer part of the lat/long coordinate. Then you put those values in separate columns, keeping the exact location in another column. You should then be able to index the integer columns, as the vehicles won't move so much that they change those values very often.
When doing a search, you first find out what "squares" are interesting, and restrict your query to the vechicles within those sqeares, using the indexed columns. Then you have to do a full search of all vehicles within each square. The number of vehicles you have to do a full search over should now only be a small fraction of all vechiles. The efficiency of this strategy of course depends on the distribution of your vechiles. If 50% of them are in a certain city somewhere this will not work, but assuming the largest group of vehicles in one place is 5-10% it should improve performance.

Resources