I'm working with a large data set that really warrants a graph db. My goal is to visualize identify trends in the data set to make decisions.
I'm currently using neo4j and i really like the tool, however the nodes returned are capped at 300. This number is only a fraction of my data, and doesnt really allow me to gain the insight i've been looking for, even with queries to filter out portions. Additionally, I'd really like to add node weights and color per conditions, which isn't possible using just neo4j.
Has anybody found a solution to this problem. I'd imagine there may be some client side libraries designed for these sorts of problems. Alternatively, I wouldn't be opposed to switching to some other graph db better suited to solve these problems.
I would suggest using Neo4j Bloom. This will provide you better visualization of your Neo4j data.
Related
Most of the reasons for using a graph database seem to be that relational databases are slow when making graph like queries.
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins. I might even be using a No-SQL database which is usually pretty fast at these kinds of flat queries.
If this is the case, is there a use case for Graph databases anymore when combined with GraphQL? Neo4j seems to be promoting GraphQL. I'd like to understand the advantages if any.
GraphQL doesn't negate the need for graph databases at all, the connection is very powerful and makes GraphQL more performant and powerful.
You mentioned:
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins.
This is a curious point, because if you do a lot of SELECT * FROM X and the data is connected by a graph loader, you're still doing the joins, you're just doing them in software outside of the database, at another layer, by another means. If even that software layer isn't joining anything, then what you gain by not doing joins in the database you're losing by executing many queries against the database, plus the overhead of the additional layer. Look into the performance profile of sequencing a series of those individual "easy selects". By not doing those joins, you may have lost 30 years value of computer science research...rather than letting the RDMBS optimize the query execution path, the software layer above it is forcing a particular path by choosing which selects to execute in which order, at which time.
It stands to reason that if you don't have to go through any layer of formalism transformation (relational -> graph) you're going to be in a better position. Because that formalism translation is a cost you must pay every time, every query, no exceptions. This is sort of equivalent to the obvious observation that XML databases are going to be better at executing XPath expressions than relational databases that have some XPath abstraction on top. The computer science of this is straightforward; purpose-built data structures for the task typically outperform generic data structures adapted to a new task.
I recommend Jim Webber's article on the motivations for a native graph database if you want to go deeper on why the storage format and query processing approach matters.
What if it's not a native graph database? If you have a graph abstraction on top of an RDBMS, and then you use GraphQL to do graph queries against that, then you've shifted where and how the graph traversal happens, but you still can't get around the fact that the underlying data structure (tables) isn't optimized for that, and you're incurring extra overhead in translation.
So for all of these reasons, a native graph database + GraphQL is going to be the most performant option, and as a result I'd conclude that GraphQL doesn't make graph databases unnecessary, it's the opposite, it shows where they shine.
They're like chocolate and peanut butter. Both great, but really fantastic together. :)
Yes GraphQL allows you to make some kind of graph queries, you can start from one entity, and then explore its neighborhood, and so on.
But, if you need performances in graph queries, you need to have a native graph database.
With GraphQL you give a lot of power to the end-user. He can make a deep GraphQL query.
If you have an SQL database, you will have two choices:
to compute a big SQL query with a lot of joins (really bad idea)
make a lot of SQL queries to retrieve the neighborhood of the neighborhood, ...
If you have a native graph database, it will be just one query with good performance! It's a graph traversal, and native graph database are made for this.
Moreover, if you use GraphQL, you consider your data model as a graph. So to store it as graph seems obvious and gives you less headache :)
I recommend you to read this post: The Motivation for Native Graph Databases
Answer for Graph Loader
With Graph loader you will do a lot of small queries (it's the second choice on my above answer) but wait no, ... there is a cache record.
Graph loaders just do batch and cache.
For comparaison:
you need to add another library and implement the logic (more code)
you need to manage the cache. There is a lot of documentation about this topic. (more memory and complexity)
due to SELECT * in loaders, you will always get more data than needed Example: I only want the id and name of a user not his email, birthday, ... (less performant)
...
The answer from FrobberOfBits is very good. There are many reasons to add (or avoid) using GraphQL, whether or not a graph database is involved. I wanted to add a small consideration against putting GraphQL in front of a graph. Of course, this is just one of what ought to be many other considerations involved with making a decision.
If the starting point is a relational database, then GraphQL (in front of that datbase) can provide a lot of flexibility to the caller – great for apps, clients, etc. to interact with data. But in order to do that, GraphQL needs to be aligned closely with the database behind it, and specifically the database schema. The database schema is sort of "projected out" to apps, clients, etc. in GraphQL.
However, if the starting point is a native graph database (Neo4j, etc.) there's a world of schema flexibility available to you because it's a graph. No more database migrations, schema updates, etc. If you have new things to model in the data, just go ahead and do it. This is a really, really powerful aspect of graphs. If you were to put GraphQL in front of a graph database, you also introduce the schema concept – GraphQL needs to be shown what is / isn't allowed in the data. While your graph database would allow you to continue evolving your data model as product needs change and evolve, your GraphQL interactions would need to be updated along the way to "know" about what new things are possible. So there's a cost of less flexibility, and something else to maintain over time.
It might be great to use a graph + GraphQL, or it might be great to just use a graph by itself. Of course, like all things, this is a question of trade-offs.
I have an interesting problem that I don't know how to solve.
I have collected a large dataset of 80 million graphs (they are CFG as in Control Flow Graph produced by programs I have analysed from Github) which I need to be able to search efficiently.
I looked into existing solutions like Neo4j but they are all designed to store a global single graph.
In my case this is the opposite all graphs are independent -like rows in a table - but I need to search through all of them efficiently.
For example I want to find all CFGs that has a particular IF condition or a WHILE loop with a particular condition.
What's the best database for this use case?
I don't think that there's a reason not to simply store all those graphs in a single graph, whether it's Neo4j or a different graph database. It's not a problem to have many disparate graphs in a single graph where the disparate graphs are disconnected from one another.
As for searching them efficiently, you would either (1) identify properties in your CFGs that you want to search on and convert them to some indexed value of the graph or (2) introduce some graph structure (additional vertices/edges) between the CFGs that will allow you to do the searches you want via graph traversal.
Depending on what you need to search on approach 1 may not be flexible enough for you especially, if what you intend to search on is not completely known at the time of loading the data. Also, it is important to note that with approach 2 you do not really lose the fact that you have 80 million distinct graphs just because you provided some connection between them. Those physical connections don't change that basic logical fact. You just need to consider those additional connections when you write traversals that you expect to occur only within a single CFG.
I'm not sure what Neo4j supports in this area, but with Apache TinkerPop (an open source graph processing framework that lets you write vendor agnostic code over different graph databases, including Neo4j), you might consider doing some form of graph partitioning to help with approach 2. Or you might subgraph() the larger graph to only contain the CFG and then operate with that purely in memory when querying. Both of these approaches will help you to blind your query to just the individual CFG you want to traverse.
Ultimately, however, I see this issue as a modelling problem. You will just need to make some choices on how to best establish the schema for your use case and virtually any graph database should be able to support that.
I am wondering how to use Neo4j to find the MST? Most examplesI found was using Hadoop to find it.
I don't think that this is possible in Cypher, given how current algorithms determine an MST (if I'm wrong on this, I'd love to know).
Instead, I'd recommend implementing one of the algorithms used for determining an MST, e.g. Prim's Algorithm. It's quite straight forward and, with the help of heaps and adjacency lists, is relatively performant.
A quick search for the algorithm will turn up many links.
I'm sure leveraging Neo4j's Core API or Traversal API might even help things integrate even more closely, possibly without needing to represent the entire graph as an adjacency list first. And of course you can do that with Neo4j in Embedded Mode or turn it into a Server Plugin in case you're running Neo4j in Server Mode.
Let us know what you come up with!
After taking part in a very interesting tutorial with a focus on Cypher, I was pleasantly surprised by the declarativeness of the Cypher query language. It's a very natural way of retrieving data from Neo4J in my opinion.
Before that, I had only used the native API. And while that is less declarative, you sort of get used to it after a while. The complex constructions are all very similar and vary only in the details for my specific project.
Still, Cypher looked more natural to me and so I am contemplating on building the second version of my application with mainly Cypher queries to interact with my database. But I encountered an issue.
There are numerous ways to convert my application into Cypher and after having tried several possible queries, all with the desired result, it appears even the fastest query is still about 20 times slower than the native API.
Now, I don't mind giving up some performance for declarativeness, but times 20 is a little bit to much for me in an application that's already struggling with performance. Is there a workaround for this issue, or do I just have to stick with the native API?
Your conclusion sounds very familiar to me. I've also had performance issues when I used Neo4j and Spring Data Neo4j together. In the parts where performance really mattered, I switched to the core Traversal API which right now is significantly faster than an average Cypher query. This has a lot to do with the fact that there's no processing of a query and the fact that you control every aspect of the traversal. Cypher can only guess what the most optimal strategy is. I'm convinced that it will gain speed in the (near) future, but if performance really matters, I'd say stick with the core API.
Also, If you would be using java and spring data neo4j, consider using the advanced mapping mode (AspectJ) which is a lot faster than the simple mapping mode.
I´ve been looking for a triple store for my project. In this project i want to store my data according to certain ontologies (OWL).
From my research i ended up with two tecnologies Neo4J and BigData that seems to fit well in this case.
I want to know if any of this two is more apropriated to use with RDF, RDFS, OWL and SPARQL Queries.
Neo4j can be used to store as entity-relationship-entity form. In case of Bigdata, you should not be upload your whole data into Neo4j because it will become very heavy and process will be very much slow. You should use complimentary db for storing actual data and store ids and some params into Neo4j for Graph traversal to perform sort of Graph Analytics. Neo4j is mainly build up for Graph Analytics that its power or you have to use Graph engine e.g GraphX (Spark).
Thanks,
You might want to try out the SparQL plugin for Neo4j, see here for a HTTP based test, and this Berlin Dataset Test for embedded usage.
Neo4J is a specific technology, while big data is more a generic term. I think what you're asking about OLAP and OLTP. As data gets bigger, there are differences between use cases for RDF style graph databases, which are often used for OLAP (On-line Analytical Processing) style analytics. In short, OLAP is designed for analytics that look across an big data set, while OLTP is more aimed at INSERT/DELETEs (on potentially big data).
OLAP-based traversals tend to process the entire graph, while OLTP based traversals tend to process smaller data sets by starting with one or a handful of vertices and traversing from there.
For example, let’s say you wanted to calculate the average age of friends of one particular user. Great use case for OLTP, since the query data set is small. However, if you wanted to calculate the average age of everyone on the database, OLAP is the preferred technology.
OLAP is optimal for deep analysis of a lot of data, while OLTP is better suited for fast running queries and a lot of INSERTs. If you’re trying to achieve a SLA where the analytics must complete within a certain timeframe, consider the type of analytics and which one is better suited. Or maybe you need both.