I'm using Spring Data Neo4j in my project and I've noticed it takes too much time while saving my node entity classes (>300ms/node), which actually are pretty simple (they only contain one property, a simple long id). The relationships between nodes are also quite simple (I'm simply trying to represent a social network). For the rest, I use cypher queries and the timings are pretty much faster and acceptable (~3-30ms).
This turns out to be a huge problem, since a basic part of my project is populating the graph and only then "trigger" queries. Any suggestions what the reason(s) could be? The Spring Data Neo4j version I'm using is 2.1.0.RELEASE and I'm using the repositories approach.
Thank you in advance!
It depends on the mapping mode you use, simple mapping is much slower as it has to merge your object graph back to Neo4j. Advanced mapping is much faster as it is a thin layer on top of Neo4j (read and write through).
You should create the transaction on a higher level anyway that spans a business operation.
Related
Most of the reasons for using a graph database seem to be that relational databases are slow when making graph like queries.
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins. I might even be using a No-SQL database which is usually pretty fast at these kinds of flat queries.
If this is the case, is there a use case for Graph databases anymore when combined with GraphQL? Neo4j seems to be promoting GraphQL. I'd like to understand the advantages if any.
GraphQL doesn't negate the need for graph databases at all, the connection is very powerful and makes GraphQL more performant and powerful.
You mentioned:
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins.
This is a curious point, because if you do a lot of SELECT * FROM X and the data is connected by a graph loader, you're still doing the joins, you're just doing them in software outside of the database, at another layer, by another means. If even that software layer isn't joining anything, then what you gain by not doing joins in the database you're losing by executing many queries against the database, plus the overhead of the additional layer. Look into the performance profile of sequencing a series of those individual "easy selects". By not doing those joins, you may have lost 30 years value of computer science research...rather than letting the RDMBS optimize the query execution path, the software layer above it is forcing a particular path by choosing which selects to execute in which order, at which time.
It stands to reason that if you don't have to go through any layer of formalism transformation (relational -> graph) you're going to be in a better position. Because that formalism translation is a cost you must pay every time, every query, no exceptions. This is sort of equivalent to the obvious observation that XML databases are going to be better at executing XPath expressions than relational databases that have some XPath abstraction on top. The computer science of this is straightforward; purpose-built data structures for the task typically outperform generic data structures adapted to a new task.
I recommend Jim Webber's article on the motivations for a native graph database if you want to go deeper on why the storage format and query processing approach matters.
What if it's not a native graph database? If you have a graph abstraction on top of an RDBMS, and then you use GraphQL to do graph queries against that, then you've shifted where and how the graph traversal happens, but you still can't get around the fact that the underlying data structure (tables) isn't optimized for that, and you're incurring extra overhead in translation.
So for all of these reasons, a native graph database + GraphQL is going to be the most performant option, and as a result I'd conclude that GraphQL doesn't make graph databases unnecessary, it's the opposite, it shows where they shine.
They're like chocolate and peanut butter. Both great, but really fantastic together. :)
Yes GraphQL allows you to make some kind of graph queries, you can start from one entity, and then explore its neighborhood, and so on.
But, if you need performances in graph queries, you need to have a native graph database.
With GraphQL you give a lot of power to the end-user. He can make a deep GraphQL query.
If you have an SQL database, you will have two choices:
to compute a big SQL query with a lot of joins (really bad idea)
make a lot of SQL queries to retrieve the neighborhood of the neighborhood, ...
If you have a native graph database, it will be just one query with good performance! It's a graph traversal, and native graph database are made for this.
Moreover, if you use GraphQL, you consider your data model as a graph. So to store it as graph seems obvious and gives you less headache :)
I recommend you to read this post: The Motivation for Native Graph Databases
Answer for Graph Loader
With Graph loader you will do a lot of small queries (it's the second choice on my above answer) but wait no, ... there is a cache record.
Graph loaders just do batch and cache.
For comparaison:
you need to add another library and implement the logic (more code)
you need to manage the cache. There is a lot of documentation about this topic. (more memory and complexity)
due to SELECT * in loaders, you will always get more data than needed Example: I only want the id and name of a user not his email, birthday, ... (less performant)
...
The answer from FrobberOfBits is very good. There are many reasons to add (or avoid) using GraphQL, whether or not a graph database is involved. I wanted to add a small consideration against putting GraphQL in front of a graph. Of course, this is just one of what ought to be many other considerations involved with making a decision.
If the starting point is a relational database, then GraphQL (in front of that datbase) can provide a lot of flexibility to the caller – great for apps, clients, etc. to interact with data. But in order to do that, GraphQL needs to be aligned closely with the database behind it, and specifically the database schema. The database schema is sort of "projected out" to apps, clients, etc. in GraphQL.
However, if the starting point is a native graph database (Neo4j, etc.) there's a world of schema flexibility available to you because it's a graph. No more database migrations, schema updates, etc. If you have new things to model in the data, just go ahead and do it. This is a really, really powerful aspect of graphs. If you were to put GraphQL in front of a graph database, you also introduce the schema concept – GraphQL needs to be shown what is / isn't allowed in the data. While your graph database would allow you to continue evolving your data model as product needs change and evolve, your GraphQL interactions would need to be updated along the way to "know" about what new things are possible. So there's a cost of less flexibility, and something else to maintain over time.
It might be great to use a graph + GraphQL, or it might be great to just use a graph by itself. Of course, like all things, this is a question of trade-offs.
What are the comparative advantages of querying a neo4j DB via
REST API
JDBC
as a Spring Data plugin
Performance will be better within Java using JDBC as opposed to a REST API. Here's a good explanation of why:
When you add complexity the code will run slower. Introducing a REST
service if it's not required will slow the execution down as the
system is doing more.
Abstracting the database is good practice. If you're worried about
speed you could look into caching the data in memory so that the
database doesn't need to be touched to handle the request.
Before optimizing performance though I'd look into what problem you're
trying to solve and the architecture you're using, I'm struggling to
think of a situation where the database options would be direct access
vs REST.
Regarding using neo4j as a plugin you can certainly do so, but I have to imagine the performance would not be as good as using JDBC.
From the book "Graph Databases" - Ian Robinson
Queries run fastest when the portions of the graph needed to satisfy
them reside in main memory (that is, in the filesystem cache and the
object cache). A single graph database instance today can hold many
billions of nodes, relationships, and properties, meaning that some
graphs will be just too big to fit into main memory.
If you add another layer to the app, this will be reflected in performance, so the bare you can consumes your data the better the performance but also the complexity and understanding of the code.
I'm currently doing some R and D regarding moving some business functionality from an Oracle RDBMS to Neo4j to reduce join complexity in the application queries. Due to the maintenance and visibility requirements for the data, I believe the stand alone server is the best option.
My thought is that within a java program I would pull the relevant data out of the Oracle tables, map it to a node object and persist it to neo4j (creating the appropriate relationships in the process).
I'm curious, with SDN over REST not being an optimal solution, what options are available for persistence. Are server plugins or unmanaged extensions the preferred method or am I overcomplicating the issue as tends to happen from time to time.
Thank you!
REST refers to a way to query the data over a network, not a way to store the data. Typically, you're going to store the data on some machine; you then have the option of either making it accessible via RESTful services with the neo4j server, or just using java applications to access the data.
I assume by SDN you're referring to spring data neo4j. Spring is a framework used for java applications, and SDN then refers to a plugin if you will for spring that allows java programmers to store models in neo4j. One could indeed use spring-data-neo4j to read data in, and then store it in Neo4J - but again this is a method of how the data gets into neo4j, it's not storage by itself.
The storage model in most cases is pretty much always the same. This link describes aspects of how storage actually happens.
Now -- to your larger business objective. In order to do this with neo4j, you're going to need to take a look at your oracle data and decide how it is best modeled as a graph. There's a big difference between an oracle RDBMS and Neo4J in terms of how the data is represented. Once you've settled on a graph design, you can then load your data into neo4j (many different options for doing that).
Will all of this "reduce join complexity in the application queries"? Well, yes, in the sense that Neo4j doesn't do joins. Will it improve the speed/performance of your application? There's just no way to tell. The answer to that depends on what your app is, what the queries are, how you model the data as a graph, and how you express the resulting queries over that graph.
I'm hoping to hear from any of you who have architected and implemented a decent sized Neo4j app (10's millions nodes/rels) - and what your recommendations are particularly w.r.t modelling and the various APIs (vanilla java/groovy Neo4j vs Spring-Data-Neo4j vs Grails GORM/Neo4j).
I'm interested if it actually pays off to add the extra OGM (object-graph-mapping) layer and associated abstractions?
Has anyone's experience been that it is best to stick to 'plain' graph-modelling with nodes+properties, relationships+properties, traversals and (e.g.) Cypher to model and store their data?
My concern is that 'forcing' a particular OGM abstraction onto a graph database will affect future flexibility in adapting/changing the domain model and/or flexibility in querying the data.
We're a Grails shop, and I have experimented with GORM/Neo4J and also with spring-data-neo4j.
The primary purpose for the dataset will be to model and query relationships amongst v.large numbers of people, their aliases, their associates and all sorts of criminal activity and history. There will be more than 50 main domain classes. There must be flexibility in the model (which will need to evolve rapidly in the early phases of the project) and in speed and flexibility of querying.
I have to confess, I'm struggling to find a compelling reason to use a OGM layer when I can use (e.g.) POJOs or POGOs, a little Groovy magic and some simple hand-rolled domain object <-> node/relationship mapping code. As far as I can tell, I think I would be happy just dealing with nodes & traversals & Cypher (aka KISS). But I would be very happy to hear others' experiences and recommendations.
Thanks for your time & thoughts,
TP
since I'm the author of the Grails Neo4j plugin, I might be biased. The main reason for creating the plugin was to apply the ease of Grails domain classes with their powerful out-of-the-box scaffolding to Neo4j for ~80% of the use cases. For the other 20% where specific requirements require stuff like traversals etc. we're using Neo4j APIs directly (traversals/cypher) and do not use the GORM API.
The current version of the Neo4j plugin suffers from a supernode issue since each domain instance is connected to a subreference node. If multiple concurrent requests (aka threads) add new domain instances there is chance to get a locking exception. I'm about to fix that either by a sub-subreference approach or by using indexing.
Cypher can also be used in the Neo4j Grails plugin.
Spring-Data-Neo4j on the other hand is a more advanced approach with finer control over mapping details, but requires usage of specific annotations. And I found no easy way to integrate that into Grails in a way scaffolding works.
We're using the predecessor version of the plugin in a productive application with ~60k users and ~10^6 rels. Due to NDA I cannot provide more details on that.
We do not use grails, but do use a hybrid plain neo4j / spring-data-neo4j solution. The reason is based on the fact that some of our domain data has a fixed schema and some doesn't. SDN takes a lot of the burden away and can be mixed with plain neo4j if the need arises.
We have classes that describe a data model, the objects for these classes we persist using SDN, with no additional tricks, we just use the basics from SDN. Then we have classes that contain the data for the model that is not known beforehand. These are stored in nodes contain special properties for describing what model type the data refers to. When neo4j 2 gets released, we will probably move that info into labels. Between these nodes there can be relations, also described by the aforementioned data model managed by sdn. We also have relations from the generic nodes to SDN nodes, which works fine, as everything ends up being the same things: nodes.
We have not encountered any issues yet using this approach. The thing we love the most is that the data of which we do not know in advanced how it will be modelled, is stored in the way you would have wanted to store data when you would have known it in advance, making the data actually match the model chosen, which is very hard to do when using any other type of (non-graph) database.
I am brand new to NOSQL databases (or any kind of database) and I need to build a graph database in Java. I have never used SpringSource before either. Will using Spring Data neo4j make the process of creating a graph database easier or will it complicate things? Should I just try to work with neo4j directly?
Thank you very much.
It depends on your use-case. SDN is a good fit when you are already working in a Spring Environment and have a rich domain model which you want to map in the graph.
SDN is a good fit in all the cases where you mostly work with a results of a few hundred or thousand POJO objects which have to interact with existing libraries, ui-layers or other application parts that deal with POJO's.
If you're not working in a Spring environment it is up to you, it adds some complexity in setup and dependencies. There are also other solutions like jo4neo or Tinkerpop Frames that work on top of Neo4j.
It is a slower than the native Neo4j API due to the indirection introduced.
For highest performance you can always fall back onto the Neo4j API.
In general the Core-API is fastest, a good thing in between is the cypher-query language which is very expressive.