Which graph algorithm to solve this? - graph-algorithm

Given is a undirected, weighted Graph. Each Node is either a "city" or a "village"; nodes are connected via the edges (=roads).
I start at Node 1 and have to bring a message to all of the cities.
In each city that is reached by someone, an arbitrary amount of people can help to spread the message (so they can start driving to another Node).
I'm now looking for minimum distance that have to be travelled on roads so that all cities got the message. (When x people use a specific road it also counts x times the edge weight)
Solution: I thought, it would be helpful to get a Minumum spanning tree between all the cities first. But the problem is, that there are villages as well. (--> "Steiner tree problem").
So I guess it might be possible to solve it somehow with Dijkstra or Bellman-Ford; or a combination of shortest path and MST?
Do you guys have an idea? I don't need code, but just a basic idea how to approach this would be very helpful:) Thanks.

Related

NEO4J How to make graph with relationships

I am completely new to NEO4j and using it for the first time ever now for my masters program. Ive read the documentation and watched tutorials online but can’t seem to figure out how I can represent my nodes in the way I want.
I have a dataframe with 3 columns, the first represents a page name, the second also represents a page name, and the third represents a similarity score between those two pages. How can I create a graph in NEO4J where the nodes are my unique page names and the relationships between nodes are drawn if there is a similarity score between them (so if the sim-score is 0 they don’t draw a relationship)? I want to show the similarity score as the text of the relationship.
Furthermore, I want to know if there is an easy way to figure out which node had the most relationships to other nodes?
I’ve added a screenshot of the header of my DF for clarity https://imgur.com/a/pg0knh6. I hope anyone can help me, thanks in advance!
Edit: What I have tried
LOAD CSV WITH HEADERS FROM 'file:///wiki-small.csv' AS line
MERGE (p:Page {name: line.First})
MERGE (p2:Page {name: line.Second})
MERGE (p)-[r:SIMILAR]->(p2)
ON CREATE SET r.similarity = toFloat(line.Sim)
Next block to remove the similarities relationships which are 0
MATCH ()-[r:SIMILAR]->() WHERE r.Sim=0
DELETE r
This works partially. As in it gives me the correct structure of the nodes but doesn't give me the similarity scores as relationship labels. I also still need to figure out how I can find the node with the most connections.
For the first question:
How can I create a graph in NEO4J where the nodes are my unique page names and the relationships between nodes are drawn if there is a similarity score between them (so if the sim-score is 0 they don’t draw a relationship)?
I think a better approach is to remove in advance the rows with similarity = 0.0 before ingesting them into Neo4j. Could it be something feasible? If your dataset is not so big, I think it is very fast to do in Python. Otherwise the solution you provide of deleting after inserting the data is an option.
In case of a big dataset, maybe it's better if you load the data using apoc.periodic.iterate or USING PERIODIC COMMIT.
Second question
I want to know if there is an easy way to figure out which node had the most relationships to other nodes?
This is an easy query. Again, you can do it with play Cypher or using APOC library:
# Plain Cypher
MATCH (n:Page)-[r:SIMILAR]->()
RETURN n.name, count(*) as cat
ORDER BY cnt DESC
# APOC
MATCH (n:Page)
RETURN apoc.node.degree(n, "SIMILAR>") AS output;
EDIT
To display the similarity scores, in Neo4j Desktop or in the others web interfaces, you can simply: click on a SIMILARITY arrow --> on the top of the running cell the labels are shown, click on the SIMILAR label marker --> on the bottom of the running cell, at the right of Caption, select the property that you want to show (similarity in your case)
Then all the arrows are displayed with the similarity score
To the second question: I think you should keep a clear separation between the way you store data and the way you visualize it. Having the similarity score (a property of the SIMILARITY edge) as a "label" is something that is best dealt with by using an adequate viz library or platform. Ours (Graphileon) could be such a platform, although there are also others.
We offer the possibility to "style" the edges with so-called selectors like
"label":"(%).property.simScore" that would use the simScore as a label. On top of that you could do thing like
"width":"evaluate((%).properties.simScore < 0.500 ? 3 : 10)"
or
"fillColor":"evaluate((%).properties.simScore < 0.500 ? grey : red)"
to distinguish visually high simScores.
Full disclosure : I work for Graphileon.

Get access to street data in iOS

I have a graph G with n nodes. The graph is embedded in 2D space (so that there are well-defined angles and distances between each pair of nodes). Some nodes might be connected with edges to other nodes. Given a location L, this graph needs to be laid out on top of a map close to L, such that each node becomes a marker on a map, and such that there is a walkable path between each pair of connected of nodes. Since this will not be possible most of the time, I will allow the graph to be scaled/rotated and I will alow the distances and angles between nodes to be flexible within a certain range.
In order for me to write this specific algorithm, I would need to have some specific information about the streets near L. Does anybody know to get street data as a graph structure (so that I can get walkable paths)? I know the Google Maps API allows you to get directions between two points but I'm sure I cannot just keep getting directions without incurring any cost.
Edit: I've been reading a bit about OpenStreetMap API. It looks like this could be interesting. Maybe people can comment on this as well.
You can also check out these sites for osm and open gis data.
OpenStreetMap Extracts, here
OpenStreetMap derived data, here
GIS Datasets, here
in addition to these;
Planet.osm/diffs here
i hope it helps you...
The TIGER datasets offer a shape/line file for the USA, free. Lots of options including roads. https://www.census.gov/geo/maps-data/data/tiger-line.html

Logic for selecting best nearby venues for display on a map

I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query

Most efficient way to pull items from database based on their x,y location in a grid

I'm working on an iOS game where players villages are displayed on a large kingdom map. Each village has a x,y location on that map, and each village is stored as an object in a database on a server (Parse.com).
What I want to be able to do is pull down all the villages around the current players village. Usually this would be straightforward as you would just use the shortest distance algorithm, but to use that I would need to download all of the villages in the database, and then run the algorithm on each one, then sort them according to distance from player, which is not exactly a quick/efficient way of doing it. So does anyone know of a more refined/efficient way of doing the above? what would be great would be to be able to pull down the villages around the current player actually in the query to the database, kill 2 birds with one stone so to speak, but I can't see any way to do that. I suspect the answer lies in perhaps storing more information about the location of the village in the database, so a query could pull the closest ones down without having to run an algorithm to make it happen.
Any ideas?
I'll leave this question up as I'm still interested in how to do it with basic math, the Manhatten distance approach should be ok, but for anyone using Parse.com it might be possible to use geoPoints maybe? It's a nutty idea, but I'm going to try it.
You can easily get all villages within D units of Manhattan distance away from point (X,Y) by querying for all villages matching the constraints X - D < x and x < X + D and Y - D < y and y < Y + D.
You can then do further filtering based on Euclidean distance on the client if you want to.

Cheapest cost traversal on Complete graph

I was wondering if there is an algorithm which:
given a fully connected graph of n-nodes (with different weights)... will give me the cheapest cycle to go from node A (a start node) to all other nodes, and return to node A? Is there a way to alter an algorithm like Primm's to accomplish this?
Thanks for your help
EDIT: I forgot to mention I'm dealing with a undirected graph so the in-degree = out-degree for each vertex.
Can you not modify Dijkstra, to find you the shortest path to all other nodes, and then when you have found it, the shortest path back to A?
You can try the iterative deepening A star search algorithm. It is always optimal. You need to define a heuristic though and this will depend on the problem you are trying to solve.
There need not be any such path. It exists if and only if the in-degree of every node equals its out-degree.
What you want is the cheapest Eulerian path. The problem of finding it is called the Traveling Salesman Problem. There is not, and cannot be, a fast algorithm to solve it.
Edit:
On second thought: The Traveling Salesman Problem searches for a tour that visits every node exactly once. You're asking for a tour that visits every node at least once. Thus, your problem might just be in P. I doubt it, though.

Resources