Performance issues with Neo4j Spatial and OSM data

Performance issues with Neo4j Spatial and OSM data - neo4j

This is my first project using Neo4j and the associated spatial plug in. I am experiencing performance well below what I was expecting and below what's needed for this project. As a noob I may be missing something or have misunderstood something. Help is appreciated and needed.
I am experiencing very slow response time for Neo4j and Spatial plugin when trying to find surrounding OSM ways to a point specified by lat/lon to process GPS reading from a driven trip. I am calling spatial.closest ("layer', {lon, lat), 0.01) which is taking 6-11 seconds to process and return approximately 25 - 100 nodes.
I am running Neo4j community edition 3.0.4 and spatial 0.20 running on MacBook Pro 16GB / 512GB SSD. The OSM data is massachusetts-latest.osm (Massachusetts, USA.) I am accessing it via bolt and Cypher. Instrumented testing has been done from browser client, python client, java client as well as a custom version of spatial that reports timing for the spatial stored procedure. The Neo4j database is approximately 44GB in size, contains 76.5M nodes and 118.2M relationships. The schema and data are 'as-is' from the OSMImport.
To isolate the performance I added a custom version of spatial.closest( ) named spatial.timedClosest( ). The timedClosest() stored procedure takes the same input and has the same calls as spatial.closest(), but returns a Stream instead of a Stream. The Stream has timing information for the stored procedure.
The stored procedure execution time is split evenly between the internal call to getLayerOrThrow( ) and SpatialTopologyUtils.findClosestEdges( ).
1) Why does getLayer(layerName) take so long to execute? I am very surprised to observe getLayer(layerName) takes so long: 2.5 - 5 seconds. There is only one layer, the OSM layer, directly off the root node. I see the same hit on calls to spatial.getLayer(). Since the layer is an argument to many of the spatial procedures, this is a big deal. Anyone have insight into this?
2) Is there a way to speed up SpaitalTopologyUtils.findClosestEdges( )? Are there additional indexes that could be added to speed up the spatial proximity search?
My understanding is Neo4j is capable of handling billions of nodes / relationships. For this project I am planning to load North America OSM data. From my understanding of spatial plug in, it has spatial management and searching capabilities that would provide a good starting foundation.

#Bo Guo, sorry for the delayed response. I've been away from Neo4j for a bit. I replaced the existing indexing with geohash indexing (https://en.wikipedia.org/wiki/Geohash). As OSM data was loaded the roadways and boundaries were tested for intersections in geohash regions. Geohash worked nicely for lookup. Loading of the OSM data was still a bear. North America from OSM data on 8 core mid-range AMD server with SATA SSDs would take several days to a week.

Related

Cosmos DB Gremlin Query timeout

I am currently creating a PoC using Cosmos DB Graph. The data itself is around 100k nodes and 630k edges.
In one subset of this data (1.7k nodes and 3.8k edges) I am trying to find the shortest path from A to B with the gremlin.
Somehow this is not possible.
I get a query timeout (30 seconds) or I get a loop error (cannot exceed 32 loops) !?!?
There must be something wrong (on my side or Cosmos side) - can you please help or give a hint?
I tried a lot of query variants already, but the errors are still there...
One of the basic queries I tried

The limits of the Gremlin API service are documented here: https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin-limits
It may be necessary that you are looking for an OLAP engine to process such a large shortest path query. You could consider Spark and their GraphFrame support to process it. Here is a sample: https://github.com/Azure/azure-cosmosdb-spark/blob/2.4/samples/graphframes/main.scala

Database synchronization time in cassandra

I have two controllers accessing distributed database. I am receiving some data from a device to the controllers and i store them in Cassandra database. I use Docker to install cassandra
The node 1 is on controller 1 and node 2 is on controller 2. I would like to know if there is a possibility to measure the time it takes to update the node 2, when i receive data at node 1.
I would like to draw a graph with it. So could someone tell me how do i measure it.
Thanks

Cassandra provides tools and insights of all this internal information with the nodetool gossipinfo command and cqlsh tracing.
In the scenario that you are proposing, I'm inferring that you are using a Replication Factor of 2, and that you are interested in the exact time that is taking to have the information written in all the nodes, you can measure the time required to do a write with the consistency level set to ALL, and compare it with similar writes using the consistency level of ONE. The difference of the times will be the propagation from one node to the other.
Finally, if you are interested in measuring the performance of the queries in Cassandra, there are several tools that enhance the tracing functionality, in our team we have been using zipkin with good results.

SOLR and VNodes and Tokens

Note: I have done a little reformatting and added some additional information.
Please take a look at this: Question_Answer
I want to ask - with DSE 5.0 and the upcoming changes that were mentioned at C* Summit this year for 5.1 and 5.2, will the same advice be useful?
Our use case is:
The platform MUST be available at all times. (Cassandra)
The data must be searchable. (SOLR / Lucene)
The platform MUST provide analytics / Data Warehousing / BI etc (Graph / Spark)
All of that is possible in a single product offering thanks to DSE! Thank you DataStax!
But our amount of data stored and our transaction count are VERY modest.
Our specification is for 100 concurrent sessions within the application - which of course doesn't even translate to 100 concurrent DB requests / operations.
For the most part our application resembles an everyday enterprise CRUD application.
While not ridiculous, AWS instances aren't exactly free.
Having a separate cluster for each workload (with enough replication for continuous availability), will be a cost issue for us.
While I understand, a proof of concept can offer some help - but without a real workload / real users - passing through the services / applications - in ways that only a "production" system and rogue users : can really provide an insight for. The best you can do is "loaded" functional testing.
In short, we're a little stuck here from a platform perspective.
We're, initially, thinking of having:
2 data centres for geographic isolation
2 racks per DC
2 nodes per Rack
RF of 3
CL of local_quorum
If we find we're hitting performance issues, we can scale out - add an extra rack or extra nodes to the initial 2 racks.
As for V-nodes or number of tokens, we have no idea.
The documentation for DSE Search says V-nodes adds 30% overhead, so it sounds like you shouldn't use V-nodes, but then in a table in the documentation it also says to use 16 or 32. How can it be both?
If we can successfully run all workloads on a single node (our requirements are genuinely minimal), do we run with V-nodes (16 or 32) or do we run a single token?
Lastly, is there another alternative?
Can you have Nodes with different workloads in the same data centre? Where individual nodes are set up with RAM / CPU requirements for a specific workload?
Assuming our 4 node per data centre (as a starting place only - we have no idea whether or not you can successfully run Search on a single node / or Spark on a single node)
Node 1: Just Cassandra
Node 2 : Cassandra and Search
Node 3 : Cassandra and Graph
Node 4 : Cassandra and Spark
If Search needs 64GB RAM - so be it... but the Cassandra only node could well work with just 8 or 16.
So we can cater, in terms of CPU and memory per workload type - but still only have a single DC. (We'll have 2 for redundancy - but effectively it is a single DC installation : mirrored)
Thanks in advance for your help.

Vnodes adds an additional overhead for the scatter-gather part of the search solution. In some benchmarks that's been as high as 30%. Some customers are willing to live with that overhead and want to use vnodes due to the benefits of dynamic scaling.
If you have or are planning a small cluster - and won't need to scale it on the fly - then I would definitely recommend sticking with single tokens. The hidden benefit of that approach, is that your repairs will be slightly faster also. This helps with Search as you are reading at the equivalent of CL.ONE.
It is possible to run all the features on the same DC (Search, Analytics and now Graph) but you will find that the overheads go up. You will need larger nodes with more memory and cpu resources to cope with the processing load. I'd probably start with 128 Gb of ram and go from there. I guess if your load is really light you might get away with less. As with everything benchmarking at the scale you're intending to run is key.
As an aside I'm not totally clear on your intentions re RF. You kind of imply 2 nodes and RF=3. I'm guessing it's just phrasing, but if not - it's worth noting you want at least as many nodes as the RF for best coverage!

What is the recommended hardware for the following neo4j setup?

I need to build and analyze a complex network using neo4j and would like to know what is the recommended hardware for the following setup:
There are three types of nodes.
There are three types of relationships.
At the steady state, the network will contain about 1M nodes of each type and about the same amount of edges
Every day, about 500K relationships are updated and 100K nodes and edges are added. Approximately the same amount of nodes/edges are also removed.
Network update will be done in daily batches and we can tolerate update times of 1-2 hours
Once the system is up, we will quire the database for shortest paths between different nodes. Not more than 500K times per day. We can live with batch query.
Most probably, I'll use REST API

I think you should take a look at Neo4j Hardware requirements.
For the server you're talking about, I think the first thing needed will obviously be a large bandwidth. If your requests are done in a short time, it'll be needed.
Apart from that, a "normal" server should be enough :
8 or more cores
At least 24Go ram
At least 1To SSD storage (this one is important and expensive)
A good bandwidth (like 1Gbps)
By the way, it's not a programming question, so I think you should have asked this to Neo4j.

You can use Neo4j Hardware sizing calculator for rough estimation of the HW needs.

Neo4j Huge database query performance configuration

I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.

Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.

What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart