Is there a way to set fetch size of a gremlin traversal. I have a very complicated traversal that I am doing in gremlin. The traversal is expected to result in a large amount of nodes and the iteration is fetching these nodes in batches so a long time is spent over the network. Is there a way to provide a fetch size to gremlin so that this time can be minimized.
You don't have an option in Gremlin itself to do that. Control over such a thing is handled by the means by which you are executing your Gremlin. For example, if you are using Titan with Cassandra you could change this setting:
storage.cassandra.thrift.frame_size_mb
which controls the maximum frame size to be used by thrift for transport. You can increase this value when retrieving very large result sets. You can read more about other such settings here and in other implementation specific configuration wiki pages.
Another example, would be related to issuing your Gremlin to Rexster. In this case, you have less options that work out of the box. Rexster generically works best for fast request/response. If you have traversals that you know ahead of time will be large result sets it might be better to write your own Rexster Extension. A good example to look at is how the FaunusRexsterInputFormatExtension works. This extension provides a way to stream back specified portions of the entire Graph. These are very long run operations over HTTP (on a large graph of course). You might find that to be a good model. .
Related
I'm researching graph databases for a work project. Since our data is highly connected, it appears that a graph database would be a good option for us.
One of the first graph DB options I've run into is neo4j, and for the most part, I like it. However, I have one question about neo4j to which I cannot find the answer: Can I get neo4j to store the entire graph in-memory? If so, how does one configure this?
The application I'm designing needs to be lightning-fast. I can't afford to wait for the db to go to disk to retrieve the data I'm searching for. I need the entire DB to be held in-memory to reduce the query time.
Is there a way to hold the entire neo4j DB in-memory?
Thanks!
Further to Bruno Peres' answer, if you want to run a regular server instance, Neo4j will load the entire graph into memory when resources are sufficient. This does indeed improve performance.
The Manual has a chapter on configuring memory.
The page cache portion holds graph data and indexes - this is configured via the dbms.memory.pagecache.size property in neo4j.conf. If it is large enough, the whole graph will be stored in memory.
The heap space portion is for query execution, state management, etc. This is set via the dbms.memory.heap.initial_size and
dbms.memory.heap.max_size properties. Generally these two properties should be set to the same value, so that the whole heap is allocated on startup.
If the sole purpose of the server is to run Neo4j, you can allocate most of the memory to the heap and page cache, leaving enough left over for operating system tasks.
Holding Very Large Graphs In Memory
At Graph Connect in San Francisco, 2016, Neo4j's CTO, Jim Webber, in his typical entertaining fashion, gave details on servers that have a very large amount of high performance memory - capable of holding an entire large graph in memory. He seemed suitably impressed by them. I forget the name of the machines, but if you're interested, the video archive should have details.
Neo4j isn't designed to hold the entire graph in main memory. This leaves you with a couple of options. You can either play around with the config parameters (as Jasper Blues already explained in more details) OR you can configure Neo4j to use RAMDisk.
The first option probably won't give you the best performance as only the cache is held in memory.
The challenge with the second approach is that everything is in-memory which means that the system isn't durable and the writes are inefficient.
You can take a look at Memgraph (DISCLAIMER: I'm the co-founder and CTO). Memgraph is a high-performance, in-memory transactional graph database and it's openCypher and Bolt compatible. The data is first stored in main memory before being written to disk. In other words, you can choose to make a tradeoff between write speed and safety.
I've been reading Neo4j's Operational Manual on Cache Sharding, and posts all over the web, however I can hardly find any detailed example on how to configure HAProxy for cache sharding(yes the one on Operation Manual is rather brief) on a real-world graph, which may contain multiple node labels.
Has anyone ever done this before? Would be lovely if you could share your experience.
Moreover, I'm a bit confused on the mechanism of the way to shard the graph using HAProxy. How do sub-graphs get cached on certain slaves, merely by providing rules in HAProxy? It surprised me to learn that cache sharding isn't handled by Neo4j.
The goal is to send queries hitting the same region of your graph always to the same instance. This of course means that the request data indicates the region. What to use as "region indicator" is heavily depending on the structure and shape of your graph.
In a lot of cases of customer facing applications people successfully used the current user id and set it as additional http header which is then evaluated by haproxy.
I want to run as many instances of Neo4J (using the Enterprise version) on a single VM as possible. What are the real minimum RAM requirements to fire up an instance?
Right now TaskManager is telling me that Java.exe is taking about 70,000K (70 Meg). Does that sound right?
I'm not worried about the performance, I just want to stuff as many instances as possible on a single box so people can do some low demand search of their graph.
One thing is what is recommended and second "it depends".
Neo4j is able to run on the Raspberry Pi. But you shouldn't expect great performance. Also I'm using AWS t2.micro for testing and it's enough.
The size is bound to fluctuate as the graph is loaded into memory to perform traversals and when it is paged back to the disk (When memory is running out).
If I may offer up a suggestion, you could run only one database instance and have unconnected graphs for each of your users. This would very likely be far more efficient in terms of server resources.
For example, If you have say (:Item) nodes which make up a graph for each user,
you could have them instead label them as (:Item-User1) with a unique prefix or postfix for each user.
Thus when you want to alter the query to run for each user you could just add that unique element and search the graph.
The Idea is to have a separate sub-graph for each user which is unconnected to other user's sub-graphs. Instead of having a separate database instance for each Individual user. As long as each user's sub-graph is unconnected from other user's sub-graphs there should be no security vulnerability where a user is given access to another user's data.
This way you could potentially have infinite number of users (within reason. quite possibly in the millions of users) each with their own sub graphs, with potentially no loss in performance, instead of the handful of database instances you could spin up on a single VM, which are likely to be competing for resources and choking out.
For neo4j 3.x, the documented minimum memory requirement is 2GB.
I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.
Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.
What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.
This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.