How to use Neo4J for temporary graph calculations? - neo4j

I'm completely new to Neo4J and I am struggling with a design/architecture question.
Setup
I have a given Graph with different nodes. That could be the a company graph with customer, products, projects, sales and so on (like in the movie example https://neo4j.com/developer/get-started/). This graph can change from time to time.
In my use case I would like to take this graph, adapt it and test some scenarios. E. g. I would add a new product, define a new sales person with responsibilities or increase the price of a product. To the extended graph I will "ask questions" or in other words, I would use graph algorithms to extract information. The changes I made, shouldn't affect the original graph.
Requirements
I do not wanna write my changes to the original graph, because every time the original graph should be the base for the analysis. Also for the reason that changing and analysing the graph can happen concurrently from different users.
I still wanna use the power of Cypher to make the analysis, so having the graph only in the memory wouldn't do it.
Problem
On the one hand I do not wanna change the original graph, on the other I wanna add and change information temporarily for a specific user. Using a relational DB I would just point with an ID to the "static" part of the data or I would do the calculation in Code instead of SQL.
Questions
Any best practices for that?
Can I use Cypher directly in code (none-persistent, directly on the data in the memoty)?
Should I make a copy of the Graph, whenever I use it (not really,
right?)?
Is there a concept to link user specific data to a static graph?
I am happy about all ideas, concepts and tricks! It's more about graph databases in general....Neo4J was so far my first choice.
Cheers
Chris

What about using feature flags in your graph by using different relationship types ?
For example, let's say you have a User that likes 10 movies in your original graph.
(user)-[:LIKES]->(movies)
Then for your experiments, you can have
(user)-[:LIKES_EXPERIMENT]->(othermovies)
This offers you the possibility to traverse the graph in the original way without loosing performance by just enforcing the relationship types. On the other hand it also offers you the possibility to use only the experiments or combining original data with experiments by specifying both relationship types in your traversals.
The same goes for properties, you could prefix properties with experiment_ for eg. And finally you could also play with different labels. There are tons of possibilities before having to use different graph data stores.
Another possibility is to use some kind of versioning like described here :
http://iansrobinson.com/2014/05/13/time-based-versioned-graphs/
But without the time factor.
There is also a nice plugin for it https://github.com/h-omer/neo4j-versioner-core

My suggestion is:
Copy the data folder of the original database to a new location: sudo cp /path/to/original/data/folder ~/neo4j
Run a Docker container mapping the copy of data folder as the container data folder.
Something like this:
docker run \
--publish=7474:7474 --publish=7687:7687 \
--volume=$HOME/neo4j/data:/data \
neo4j
You can specify another ports if :7474 and :7686 are being used.
Work over this copy.
You can transform these instructions in a .sh file to automate the process.

Related

How to update parsed data in rdflib ConjunctiveGraphs?

I am merging RDF data from various remote sources using ConsecutiveGraph.parse().
Now I would like to have a way to update the data of individual sources, without affecting the other ones and the relations between them.
The triples from the various sources might overlap, so it has to be ensured that only the triples coming from a specific source get deleted before the update.
Each graph in a ConjunctiveGraph has its own ID - whether you explicitly set it or not. Just update the particular graph you want and export them individually.
If you want to do something more complex than this, such as keeping track of where new data you’ve created perhaps in the default graph (the unnamed graph you get automatically), you’re going to need to use some other method of tracking triples. Look up “reification” for how to annotate triples with more information.

Neo4j to grafana

I want to present release data complexity which is associated with each node like at epic, userstory etc in grafana in form of charts but grafana do not support neo4j database.Is there any way Directly or indirectly to present neo4j database in grafana?
I'm having the same issues and found this question among others. From my research I cannot agree with this answer completely, so I felt I should point some things out, here.
Just to clarify: a graph database may seem structurally different from a relational or time series database, but it is possible to build Cypher queries that basically return graph data as tables with proper columns as it would be with any other supported data source. Therefore this sentence of the above mentioned answer:
So what you want to do is just not possible.
is not absolutely true, I'd say.
The actual problem is, there is no datasource plugin for Neo4j available at the moment. You would need to implement one on your own, which will be a lot of work (as far as I can see), but I suspect it to be possible. For me at least, this will be too much work to do, so I won't use any approach to read data directly from Neo4j into Grafana.
As a (possibly dirty) workaround (in my case), a service will regularly copy relevant portions of the Neo4j graph into a relational database (or a time series database, if the data model is sufficiently simple for that), which Grafana is aware of (see datasource plugins), so I can query it from there. This is basically the replication idea also given in the above mentioned answer. In this case you obviously end up with at least 2 different database systems and an additional service, which is not so insanely great, but at the moment it seems to be the quickest way to resolve the problem with the missing datasource plugin. Maybe this is applicable in your case, too.
Using neo4j's graphite metrics you can actually configure data to be sent to grafana, and from there build whichever dashboards you like.
Up until recently, graphite/grafana wasn't supported, but it is now (in the recent 3.4 series releases), along with prometheus and other options.
Update July 2021
There is a new plugin called Node Graph Panel (currently in beta) that can visualise graph structures in Grafana. A prerequisite for displaying your graph is to make sure that you have an API that exposes two data frames, one for nodes and one for edges, and that you set frame.meta.preferredVisualisationType = 'nodeGraph' on both data frames. See the Data API specification for more information.
So, one option would be to setup an API around your Neo4j instance that returns the nodes and edges according to the specifications above. Note that I haven't tried it myself (yet), but it seems like a viable solution to get Neo4j data into Grafana.
Grafana support those databases, but not Neo4j : Graphite, InfluxDB, OpenTSDB, Prometheus, Elasticsearch, CloudWatch
So what you want to do is just not possible.
You can replicate your Neo4j data inside of those database, but the datamodel is really different ... (timeseries vs graph).
If you just want to have some charts, you can use Apache Zeppeline for that.

database solution for multiple isolated graphs

I have an interesting problem that I don't know how to solve.
I have collected a large dataset of 80 million graphs (they are CFG as in Control Flow Graph produced by programs I have analysed from Github) which I need to be able to search efficiently.
I looked into existing solutions like Neo4j but they are all designed to store a global single graph.
In my case this is the opposite all graphs are independent -like rows in a table - but I need to search through all of them efficiently.
For example I want to find all CFGs that has a particular IF condition or a WHILE loop with a particular condition.
What's the best database for this use case?
I don't think that there's a reason not to simply store all those graphs in a single graph, whether it's Neo4j or a different graph database. It's not a problem to have many disparate graphs in a single graph where the disparate graphs are disconnected from one another.
As for searching them efficiently, you would either (1) identify properties in your CFGs that you want to search on and convert them to some indexed value of the graph or (2) introduce some graph structure (additional vertices/edges) between the CFGs that will allow you to do the searches you want via graph traversal.
Depending on what you need to search on approach 1 may not be flexible enough for you especially, if what you intend to search on is not completely known at the time of loading the data. Also, it is important to note that with approach 2 you do not really lose the fact that you have 80 million distinct graphs just because you provided some connection between them. Those physical connections don't change that basic logical fact. You just need to consider those additional connections when you write traversals that you expect to occur only within a single CFG.
I'm not sure what Neo4j supports in this area, but with Apache TinkerPop (an open source graph processing framework that lets you write vendor agnostic code over different graph databases, including Neo4j), you might consider doing some form of graph partitioning to help with approach 2. Or you might subgraph() the larger graph to only contain the CFG and then operate with that purely in memory when querying. Both of these approaches will help you to blind your query to just the individual CFG you want to traverse.
Ultimately, however, I see this issue as a modelling problem. You will just need to make some choices on how to best establish the schema for your use case and virtually any graph database should be able to support that.

Time Based Graph Data Modeling

I have a data modeling question. The data that I have is basically nodes with relations to other nodes. Nodes have properties. Edges are directional and have properties. I am exploring if a Graph DB like Neo4j will be appropriate or not.
The doubt is because: The data that I have is time based. It changes on the basis of time, and I need to keep track of the historical data as well. For example, I should be able to query:
What was the graph like on a particular date?
Who all did a given node depend on at a particular time?
What were the properties of the edge between two given nodes at a particular time?
I searched but couldn't find a satisfactory resource where I could understand how time can be factored into a Graph DB. Do you think my requirement can be inherently met using a Graph DB? Is there an example/resource/article which describes this for Neo4j or any other graph db?
I want to make sure that the database is scalable to about 100K nodes, and millions of edges. I am optimizing for time over space.
Is there an example/resource/article which describes this for Neo4j or
any other graph db?
Here is an excellent article from Ian Robinson blog about time-based versioned graphs.
Basically the article describes a way to represent a time-based versioned graphs adding some extra nodes and timestamped relationships to represent the state of the graph in a given timestamp.
The following image from the referenced article shows:
The price of produc_id : 1 has changed from 1.00 to 2.00. This is a state change.
The product_id : 1 is now sold by shop_id : 2 (and not by shop_id : 1). This is a structural change.
Do you think my requirement can be inherently met using a Graph DB?
Yes, but not in an easy or "natural" way. Versioning a time based model with a database that don't offers this functionality natively can be hard and expensive. From the article:
Neo4j doesn’t provide intrinsic support either at the level of its
labelled property graph model or in its Cypher query language for
versioning. Therefore, to version a graph we need to make our
application graph data model and queries version aware.
and
versioning necessarily creates a lot more data – both more nodes and
more relationships. In addition, queries will tend to be more complex,
and slower, because every MATCH must take account of one or more
versioned elements. Given these overheads, apply versioning with care.
Perhaps not all of your graph needs to be versioned. If that’s the
case, version only those portions of the graph that require it.
EDIT:
A few words from the book Graph Databases (by Ian Robinson, Jim Webber and Emil Eifrem) about versioning in graph databases. This book is available for download at Neo4J page:
Versioning:
A versioned graph enables us to recover the state of the
graph at a particular point in time. Most graph databases don’t
support versioning as a first-class concept. It is possible, however,
to create a versioning scheme inside the graph model. With this scheme
nodes and relationships are timestamped and archived whenever they are
modified The downside of such versioning schemes is that they leak
into any queries written against the graph, adding a layer of
complexity to even the simplest query.
This paragraph links the article indicated in the beginning of this answer.

Efficient way to build an indoor graph of a building to implement path finding?

Lets say, my data set is a shopping mall.
I have to build a graph for it. Whenever asked, I have to generate a path (shortest path) from one shop to another.
Now my question is,
Is it efficient to build a graph of the whole building and generate
the path?
Or build a graph (something like a subgraph) between
only the 2 nodes and all its connectors (edges) when a user needs to
find the path?
I have to implement this for a mobile application where all the data is loaded from a server.
My current code builds the whole graph. But I want to use this as a library for future use.
If it is only for the current building, then it works fine.
But assuming that in the future another type of data set is used which is way too big that the current one, then which one of these methods is more efficient?
These are the only 2 ways I can think of implementing it. If there is any other solution then that would be highly appreciated!
Secondly, I am using Dijkstra's Algorithm for path finding, is that suitable for this kind of a case?
Any help would be highly appreciated,
Thanks.
Is it efficient to build a graph of the whole building and generate the path?
Or build a graph (something like a subgraph) between only the 2 nodes and all its connectors (edges) when a user needs to find the
path?
If the graph is known a priori, the most efficient solution, in regards to query times, will be to generate the whole graph and preprocess it. Then, you will query the contracted graph and have a very fast query time. Look for example at Contraction hierarchies, since it is one of the most widely used techniques. Otherwise, when the graph has to be built in runtime, I think it is what you mean with your second point, you could use A* or bidirectional Dijkstra. In the first one I guess the best heuristic you can come up is the straight line distance, so probably not very helpful.
Secondly, I am using Dijkstra's Algorithm for path finding, is that
suitable for this kind of a case?
Yes it is, but I would always use bidirectional Dijkstra, it's not difficult to implement and, generally, a great improvement in time requirements over unidirectional Djikstra. Some related questions in SO: 1, 2

Resources