Get access to street data in iOS - ios

I have a graph G with n nodes. The graph is embedded in 2D space (so that there are well-defined angles and distances between each pair of nodes). Some nodes might be connected with edges to other nodes. Given a location L, this graph needs to be laid out on top of a map close to L, such that each node becomes a marker on a map, and such that there is a walkable path between each pair of connected of nodes. Since this will not be possible most of the time, I will allow the graph to be scaled/rotated and I will alow the distances and angles between nodes to be flexible within a certain range.
In order for me to write this specific algorithm, I would need to have some specific information about the streets near L. Does anybody know to get street data as a graph structure (so that I can get walkable paths)? I know the Google Maps API allows you to get directions between two points but I'm sure I cannot just keep getting directions without incurring any cost.
Edit: I've been reading a bit about OpenStreetMap API. It looks like this could be interesting. Maybe people can comment on this as well.

You can also check out these sites for osm and open gis data.
OpenStreetMap Extracts, here
OpenStreetMap derived data, here
GIS Datasets, here
in addition to these;
Planet.osm/diffs here
i hope it helps you...

The TIGER datasets offer a shape/line file for the USA, free. Lots of options including roads. https://www.census.gov/geo/maps-data/data/tiger-line.html

Related

Which graph algorithm to solve this?

Given is a undirected, weighted Graph. Each Node is either a "city" or a "village"; nodes are connected via the edges (=roads).
I start at Node 1 and have to bring a message to all of the cities.
In each city that is reached by someone, an arbitrary amount of people can help to spread the message (so they can start driving to another Node).
I'm now looking for minimum distance that have to be travelled on roads so that all cities got the message. (When x people use a specific road it also counts x times the edge weight)
Solution: I thought, it would be helpful to get a Minumum spanning tree between all the cities first. But the problem is, that there are villages as well. (--> "Steiner tree problem").
So I guess it might be possible to solve it somehow with Dijkstra or Bellman-Ford; or a combination of shortest path and MST?
Do you guys have an idea? I don't need code, but just a basic idea how to approach this would be very helpful:) Thanks.

Logic for selecting best nearby venues for display on a map

I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query

how to cluster users based on tags

I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?
Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?
In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).
I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.
I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..
You basically want to cluster the users according to their tags.
To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].
In your own case, each user can be similarly represented as a point in a 20,000-dimensional space.
You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.
Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.
Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)
Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.
You should consider using neo4j. You can model your data using the following node labels and relationship types.
If you are not familiar with neo4j's Cypher language notation, (:Foo) represents a node with the label Foo, and [:BAR] represents a relationship with the type BAR. The arrows around a relationship indicate its directionality. neo4j efficiently traverses relationships in both directions.
(:Cluster) -[:INCLUDES_TAG]-> (:Tag) <-[:HAS_TAG]- (:Program) <-[:WATCHED]- (:User)
You'd have k Cluster nodes, 20K Tag nodes, and several million WATCHED relationships.
With this model, starting with any given Cluster node, you can efficiently find all its related tags, programs, and users.

Neo4j Structure for GPS coordinates log

I'm using neo4j for a, let's call it, social network where users will have the ability to log their position during workouts (think Runkeeper and Strava).
I'm thinking about how I want to save the coordinates.
Is it a good idea to have it like node(user)-has->node(workouts)<-is a-node(workout)-start->node(coord)-next->node(coord)-next->.... i.e. a linked list with coordinates for every workout?
I will never query the db for individual points, the workout will always be retrieved as a whole.
Is it a better way to solve this?
I can image that a graph db isn't the ideal db to store this type of data, but I don't want to add the complexity of adding another db right now.
Can someone give me any insight on this?
I would suggest you store it as:
user --has--> workout --positionedAt--> coord
This design feels more natural to me as the linked list design you mentioned in your question just produces a really deep traversal which might be annoying to query. In this way you can easily find all the coordinates for a particular workout by simply iterating edges on the workout vertex. I would recommend storing a datetime stamp on the positionedAt edge so that you can sort your coordinates easily.
The downside is that depending on how many coord vertices you intend to have you might end up with some fat workout vertices, but that may not really affect your use case. I can't think of a workout that would generate something like 100000 coordinates (and hence 100000 edges), but perhaps you can. If so, I suppose I could amend my answer a bit.

What spatial indexing algorithm should I use?

I want to implement some king of spatial indexing data structure for my MKAnnotations.
Currently it's horribly slow when I try to filter them based on distance criteria ( 3-4k of locations, currently extremely slow with a simple double for ... ).
I'd like to create clusters of MKAnnotations, to decide if it is close to another. Also, these locations are in a somewhat (creation) order and a "previous"/"next" functionality would be needed to "jump" between (this is not a must).
I've read about kd-tree and r-tree structures and they both seem to meet the fast distance/neighbor obtaining option for filtering/clustering, but I'm not sure which is the best for me or if there are other options too.
What algorithm/data structure should I use ?
Update: I store these locations in a Core Data database, they represent a path. When the map is opened they are fetched into an array and then I just use that array for distance calculations and annotation creation.
When the user moves/zooms the map, I loop through them and decide what needs to be changed on map, so it is kinda static the whole stuff. As I understood, if I'd be using a tree, I could store the locations there and when a zoom/move happens I just search through it and obtain the ones in the new region. Is this true ?
Even in the dynamic case, when I can add new locations to this array, it would be a single insertion and it's happening rarely.
It depends a lot on what your usage patterns are (how my writes, for example, in-memory or on-disk) and how your data looks like (that is how it is distributed).
R-trees are good because they are balanced, and allow updating. The R*-tree in my experience is clearly better than the other variants because of the split strategy it has. The benefit is that it produces more square pages than the other strategies, so that for many queries you will need to scan fewer pages.
kd-trees are good if you are in-memory and static. Updating them is very bad, you will need to rebuild the index quite often.
And if your data does not change very often, bulk-loading for the R-tree works very well. You can do Sort-Tile-Recursive bulk loading, which essentially requires (partially) sorting your data on X and Y alternatingly, so it takes a low O(n log n) to build the tree; very similar to bulk-loading an kd-tree, except that you multi-split instead of binary splitting. This is very popular.
Furthermore, you can keep track of the number of objects in each page. When displaying things on a map, you may want to stop early when a page would display too small on the screen (i.e. smaller than a marker). At this point, you would not scan that page, but only take the number of objects and display that as a clustered marker until the user zooms in.
For 2D data, with a limited value domain, do not overlook the simple things. Quadtrees can work really well, too! Simplicity can make it a lot easier to optimize things. Or a classic grid approach. If your users tend to spread their annotations in an area (and not put them all into one place), you can just compute integer x,y grid coordinates, and then hash them and make a list for each grid cell.
I am no iOS developer, but I looked over the docs and found this:
MKMapView.annotationsInMapRect:
Returns the annotation objects located in the specified map rectangle.
(NSSet *)annotationsInMapRect:(MKMapRect)mapRect
Parameters
mapRect: The portion of the map that you want to search for annotations.
Return Value
The set of annotation objects located in mapRect.
Discussion
This method offers a fast way to retrieve the annotation objects in a particular portion of the map. This method is much faster than doing a linear search of the objects in the annotations property yourself.
This suggests that the NKMapView already organizes annotations in a spatial index structure. Would this method meet your needs?
If not, I would look for existing open source implementations of any 2D spatial indexing structure and pick the one with best documentation, cleanest interfaces, etc. rather than worrying about efficiency. If you need to write the code form scratch, I think a quadtree would be the easiest to implement. On the other hand, the Wikipedia article on R-tree seems more specifically targeted towards mapping than the K-D Tree or Quadtree.

Resources