Queries on 200 GB graph - neo4j

I am in need to use a scalable solution to create a Geohash connected graph.
I find Cypher for APache Spark a project that let use cypher on spark dataframes to create a graph, however it can only create immutable graphs by mapping the different data-frames,so i didn't get the graph that i need.
I can get the graph that i need if i run some other cypher queries on a Neo4j Browser, however my stored graph is about 200 GB.
So i'm asking if that logic and fast to run queries on 200 GB of graph data using Neo4j browser and apoc functions ?

If you're asking if Neo4j can handle databases of this size, then the answer is yes. But you'll see different results depending on how your data is modeled and the kind of queries you want to run.
Performance correlates not necessarily with the size of the graph, but on the portion of the graph touched and traversed by your queries. Graph-wide analytical queries must touch the entire graph, while tightly defined queries that touch a smaller local part of the graph will be quite quick.
Anything you can do in your queries to constrain the portion of the graph you have to traverse or filter will help out your query speed, so good modeling and usage of indexes and constraints is key.

Related

Neo4j persistent named graph

I'm coming from the RDF world where named graphs are persistent and can be used like a collection of triples. Moreover you can query against one single named graph or over the whole triplestore. I'm looking for the same features (or a workaround to achive them) in Neo4j.
Neo4j's Graph Catalog is well documented. As I understood, named graphs in Neo4j are stored entirely in-memory (so lost after a restart) with a subset of nodes you define for analytic purpose.
Is there a way to create persistents named graphs in Neo4j?
A graph that is stored in the disk with the data and that permits to fast access to a subset of nodes (nodes can be added or removed from the named graph).
You could give every node in the same "named graph" the same label. Since a node can have multiple labels, this does not prevent you from using other labels for other purposes as well.

database solution for multiple isolated graphs

I have an interesting problem that I don't know how to solve.
I have collected a large dataset of 80 million graphs (they are CFG as in Control Flow Graph produced by programs I have analysed from Github) which I need to be able to search efficiently.
I looked into existing solutions like Neo4j but they are all designed to store a global single graph.
In my case this is the opposite all graphs are independent -like rows in a table - but I need to search through all of them efficiently.
For example I want to find all CFGs that has a particular IF condition or a WHILE loop with a particular condition.
What's the best database for this use case?
I don't think that there's a reason not to simply store all those graphs in a single graph, whether it's Neo4j or a different graph database. It's not a problem to have many disparate graphs in a single graph where the disparate graphs are disconnected from one another.
As for searching them efficiently, you would either (1) identify properties in your CFGs that you want to search on and convert them to some indexed value of the graph or (2) introduce some graph structure (additional vertices/edges) between the CFGs that will allow you to do the searches you want via graph traversal.
Depending on what you need to search on approach 1 may not be flexible enough for you especially, if what you intend to search on is not completely known at the time of loading the data. Also, it is important to note that with approach 2 you do not really lose the fact that you have 80 million distinct graphs just because you provided some connection between them. Those physical connections don't change that basic logical fact. You just need to consider those additional connections when you write traversals that you expect to occur only within a single CFG.
I'm not sure what Neo4j supports in this area, but with Apache TinkerPop (an open source graph processing framework that lets you write vendor agnostic code over different graph databases, including Neo4j), you might consider doing some form of graph partitioning to help with approach 2. Or you might subgraph() the larger graph to only contain the CFG and then operate with that purely in memory when querying. Both of these approaches will help you to blind your query to just the individual CFG you want to traverse.
Ultimately, however, I see this issue as a modelling problem. You will just need to make some choices on how to best establish the schema for your use case and virtually any graph database should be able to support that.

What is the time complexity of search query in Graph database?

What is the time complexity of search query in Graph database (especially Neo4j) ?
I'm having relational data with me. I'm confused to use a Relational database or Graph database to store that data. So, I want to store the data based on the performance and time complexity of the queries for that particular database. But, I'm unable to find the performance and time complexity of the queries for Graph database.
Can anyone help me out ?
The answer isn't so simple because the time complexities typically depend upon what you're doing in the query (the results of the query planner), there isn't a one-size-fits-all time complexity for all queries.
I can speak some for Neo4j (disclaimer: I am a Neo4j employee).
I won't say much on Lucene index lookups in Neo4j, as these are typically performed only to find starting nodes by index, and represent a fraction of execution time of a query. Relationship traversals tend to be where the real differences show themselves.
Once starting nodes are found, Neo4j walks the graph through relationship traversal, which for Neo4j is basically pointer chasing through memory, which tends to be where native graph dbs outperform relational dbs: The cost of chasing pointers is constant per traversal.
For relational dbs (including graph layers built on top of relational dbs), relationship traversal is usually accomplished by various table join algorithms, which have their own time complexity, typically O(log n) when the join is backed by a B-tree index. This can be quite efficient, but we are in the age of big data and data lakes, so data is getting larger, and the efficiency of the join does grow based on the data in the tables being joined. And this is fine for smaller numbers of joins, but there are some queries that require many joins (sometimes we won't have an upper bound on when to stop joining), and we may want to traverse through many kinds of tables/nodes (and sometimes we may not care what kind they are), so we may need the capability to join to or through any table arbitrarily, which isn't easily expressed or performed in a relational db.
This makes native graph databases like Neo4j well-positioned for handling queries over connected data, especially with a non-trivial or growing number of relationship traversals (or if the traversals are unbounded, such as for reachabilty queries, shortest-path, and others). The cost for queries is associated with the part of the graph touched or walked by the query, not the total number of nodes in the database, so it helps when you can adequately constrain the query to touch the smallest possible subgraph in the db.
As far as your question of whether to use a relational or graph database, that depends upon the connectedness of your data and the queries you plan to run.
If you do decide upon a graph database, then you have choices here as well, and a different set of criteria, such as native vs non-native implementation (Neo4j is a native graph database and takes advantage of index-free adjacency for relationship traversals), whether you need ACID (Neo4 is an ACID database), and if a rich and expressive query language is desired (Cypher is Neo4j's query language, feel free to learn and compare against others).
For more in-depth info, here's a DZone article on Why Graph Databases Outperfrom RDBMS on Connected Data, and an article on Explaining Graph Databases to a Developer by Neo4j's Chief Scientist Jim Webber.
Actually, the most probable scenario is to use both Neo4j and some DBMS (relational or nosql like Mongo). Because it is too hard to store all dataset in Neo4j.
Speed-wise traversing nodes in DBMS is 10-100++ times slower than Neo4j. Also Neo4j has built-in shortestPath (and many other) methods.
Also, can mention hybrid solutions, like ArangoDB. It has graph engine + document-based engine. But under the hood it is two separate tables, so it is almost as inconvenient as Neo4j+DBMS.
It would actually depend upon the size of your data and the complexity.
In Graph databases like neo4j the time complexity is depends on the kind of query and on the planners(executors) behind the query. Graph database perticularly Neo4j performs easier JOINS which give us a clear view of data.
For more information please visit this reference blog by Neo4j.
And you can also refer to this question as it is similar to yours.
Hope this helps!

"Resultset too large (over 1000 rows)" in neo4j browser

I'm using neo4j 2.1.2 community edition. I have loaded the CSV file which is having 2500 rows and i have created nodes and relationships among the columns. When i run the below cypher query
match (n) return count(*);
I'll get the nodes count as 17275. So when i match the nodes like match (n) return n and try to get the corresponding graph in a neo4j browser, it says
Resultset too large (over 1000 rows)
I know it's due to the nodes requested is more than 1000. So if i want to see the complete graph in neo4j browser, how can i do it?
The same query i tried in the neo4j web-admin, i wan able to get the data in tabular format but i wanted to see the data as a graph.
Also I'm not able to find neo4j-Shell in my neo4j installation bin directory. Why is that?
Thanks
Update 1
The Neo4J Web UI is built on top of D3.js using SVG: due to SVG performances in a browser when you have more than 500 nodes in a network, the user experience starts to degrade quite quickly.
Handling more than 1000 nodes adds to the technical challenge: in fact with so many nodes what happens most of the time is the "hairball" effect.
This is a blog post that might be useful (disclaimer: I am a developer for KeyLines) about visualizing big network with some design hints.
As you can imagine visualizing more than 1000 nodes is not that easy and that's why some companies such Cambridge Intelligence (KeyLines), Tom Sawyer (Perspective) or Linkourius came up with specific products for that.
You can of course build the visualization yourself for fun with open source libraries but keep in mind that it can take a very long time.
If your Neo4J project is not commercial I can suggest to have a look to Gephi to visualize it: it is a Desktop Application and it has a Neo4J adapter plugin. It can easily handle huge datasets but of course it lacks the same portability of a webapp.
In case you need ONLY a storage for your graph/data than a visualization is not required, you're right.
Original Answer
I think you might have to implement a custom visualization too see such graph in the browser, using one option of those in this page: http://www.neo4j.org/develop/visualize .
Alternatively have a look to this most extensive list here: Big data visualization using "search, show context, and expand on demand" concept
Or maybe have a different visualization approach with one of the following: Data Visualization libraries
Look at settings in neo4j browser. You can change Graph Visualization how you like. But browser can work much slower if you wanna see the complete graph.

BigData Vs Neo4J

I´ve been looking for a triple store for my project. In this project i want to store my data according to certain ontologies (OWL).
From my research i ended up with two tecnologies Neo4J and BigData that seems to fit well in this case.
I want to know if any of this two is more apropriated to use with RDF, RDFS, OWL and SPARQL Queries.
Neo4j can be used to store as entity-relationship-entity form. In case of Bigdata, you should not be upload your whole data into Neo4j because it will become very heavy and process will be very much slow. You should use complimentary db for storing actual data and store ids and some params into Neo4j for Graph traversal to perform sort of Graph Analytics. Neo4j is mainly build up for Graph Analytics that its power or you have to use Graph engine e.g GraphX (Spark).
Thanks,
You might want to try out the SparQL plugin for Neo4j, see here for a HTTP based test, and this Berlin Dataset Test for embedded usage.
Neo4J is a specific technology, while big data is more a generic term. I think what you're asking about OLAP and OLTP. As data gets bigger, there are differences between use cases for RDF style graph databases, which are often used for OLAP (On-line Analytical Processing) style analytics. In short, OLAP is designed for analytics that look across an big data set, while OLTP is more aimed at INSERT/DELETEs (on potentially big data).
OLAP-based traversals tend to process the entire graph, while OLTP based traversals tend to process smaller data sets by starting with one or a handful of vertices and traversing from there.
For example, let’s say you wanted to calculate the average age of friends of one particular user. Great use case for OLTP, since the query data set is small. However, if you wanted to calculate the average age of everyone on the database, OLAP is the preferred technology.
OLAP is optimal for deep analysis of a lot of data, while OLTP is better suited for fast running queries and a lot of INSERTs. If you’re trying to achieve a SLA where the analytics must complete within a certain timeframe, consider the type of analytics and which one is better suited. Or maybe you need both.

Resources