Cosmos DB Gremlin Query timeout - timeout

I am currently creating a PoC using Cosmos DB Graph. The data itself is around 100k nodes and 630k edges.
In one subset of this data (1.7k nodes and 3.8k edges) I am trying to find the shortest path from A to B with the gremlin.
Somehow this is not possible.
I get a query timeout (30 seconds) or I get a loop error (cannot exceed 32 loops) !?!?
There must be something wrong (on my side or Cosmos side) - can you please help or give a hint?
I tried a lot of query variants already, but the errors are still there...
One of the basic queries I tried

The limits of the Gremlin API service are documented here: https://learn.microsoft.com/en-us/azure/cosmos-db/gremlin-limits
It may be necessary that you are looking for an OLAP engine to process such a large shortest path query. You could consider Spark and their GraphFrame support to process it. Here is a sample: https://github.com/Azure/azure-cosmosdb-spark/blob/2.4/samples/graphframes/main.scala

Related

Database synchronization time in cassandra

I have two controllers accessing distributed database. I am receiving some data from a device to the controllers and i store them in Cassandra database. I use Docker to install cassandra
The node 1 is on controller 1 and node 2 is on controller 2. I would like to know if there is a possibility to measure the time it takes to update the node 2, when i receive data at node 1.
I would like to draw a graph with it. So could someone tell me how do i measure it.
Thanks
Cassandra provides tools and insights of all this internal information with the nodetool gossipinfo command and cqlsh tracing.
In the scenario that you are proposing, I'm inferring that you are using a Replication Factor of 2, and that you are interested in the exact time that is taking to have the information written in all the nodes, you can measure the time required to do a write with the consistency level set to ALL, and compare it with similar writes using the consistency level of ONE. The difference of the times will be the propagation from one node to the other.
Finally, if you are interested in measuring the performance of the queries in Cassandra, there are several tools that enhance the tracing functionality, in our team we have been using zipkin with good results.

Performance issues with Neo4j Spatial and OSM data

This is my first project using Neo4j and the associated spatial plug in. I am experiencing performance well below what I was expecting and below what's needed for this project. As a noob I may be missing something or have misunderstood something. Help is appreciated and needed.
I am experiencing very slow response time for Neo4j and Spatial plugin when trying to find surrounding OSM ways to a point specified by lat/lon to process GPS reading from a driven trip. I am calling spatial.closest ("layer', {lon, lat), 0.01) which is taking 6-11 seconds to process and return approximately 25 - 100 nodes.
I am running Neo4j community edition 3.0.4 and spatial 0.20 running on MacBook Pro 16GB / 512GB SSD. The OSM data is massachusetts-latest.osm (Massachusetts, USA.) I am accessing it via bolt and Cypher. Instrumented testing has been done from browser client, python client, java client as well as a custom version of spatial that reports timing for the spatial stored procedure. The Neo4j database is approximately 44GB in size, contains 76.5M nodes and 118.2M relationships. The schema and data are 'as-is' from the OSMImport.
To isolate the performance I added a custom version of spatial.closest( ) named spatial.timedClosest( ). The timedClosest() stored procedure takes the same input and has the same calls as spatial.closest(), but returns a Stream instead of a Stream. The Stream has timing information for the stored procedure.
The stored procedure execution time is split evenly between the internal call to getLayerOrThrow( ) and SpatialTopologyUtils.findClosestEdges( ).
1) Why does getLayer(layerName) take so long to execute? I am very surprised to observe getLayer(layerName) takes so long: 2.5 - 5 seconds. There is only one layer, the OSM layer, directly off the root node. I see the same hit on calls to spatial.getLayer(). Since the layer is an argument to many of the spatial procedures, this is a big deal. Anyone have insight into this?
2) Is there a way to speed up SpaitalTopologyUtils.findClosestEdges( )? Are there additional indexes that could be added to speed up the spatial proximity search?
My understanding is Neo4j is capable of handling billions of nodes / relationships. For this project I am planning to load North America OSM data. From my understanding of spatial plug in, it has spatial management and searching capabilities that would provide a good starting foundation.
#Bo Guo, sorry for the delayed response. I've been away from Neo4j for a bit. I replaced the existing indexing with geohash indexing (https://en.wikipedia.org/wiki/Geohash). As OSM data was loaded the roadways and boundaries were tested for intersections in geohash regions. Geohash worked nicely for lookup. Loading of the OSM data was still a bear. North America from OSM data on 8 core mid-range AMD server with SATA SSDs would take several days to a week.

Neo4j Huge database query performance configuration

I am new to Neo4j and graph databases. Saying that, I have around 40000 independent graphs uploaded into a neo4j database using Batch insertion, so far everything went well. My current database folder size is 180Gb, the problem is querying, which is too slow. Just to count number of nodes, it takes forever. I am using a server with 1TB ram and 40 cores, therefore I would like to load the entire database into memory and perform queries on it.
I have looked into the configurations but not sure what changes I should make to cache the entire database into memory. So please suggest me the properties I should modify.
I also noticed that most of the time Neo4j is using only one or two cores, How can I increase it?
I am using the free version for a university research project therefore I am unable to use High-Performance Cache is there an alternative in free version?
My Solution:
I added more graphs to my database and now my database size is 400GB with more than a billion nodes. I took Stefan's comments and used java APIs to access my database and moved my database to RAM disk. It takes to 3 hours to walk through all the nodes and collect information from each node.
RAM disk and Java APIs gave a big boost in performance.
Counting nodes in a graph is a global operation that obviously needs to touch each and every node. If caches are not populated (or not configured according to your dataset) the drive of your hard disc is the most influencing factor.
To speed up things, be sure to have caches configured efficiently, see http://neo4j.com/docs/stable/configuration-caches.html.
With current versions of Neo4j, a Cypher query traverses the graph in single threaded mode. Since most graph applications out there are concurrently used by multiple users, this model saturates the available cores.
If you want to run a single query multithreaded, you need to use Java API.
In general Neo4j community edition has some limitation in scaling for more than 4 cores (due to a more performant lock manager implementation in Enterprise edition). Also the HPC (high performance cache) in Enterprise edition reduces the impact of full garbage collections significantly.
What Neo4j version are you using?
Please share your current config (conf/* and data/graph.db/messages.log) you can use the personal edition of Neo4j enterprise.
What kinds of use cases do you want to run?
Counting all nodes is probably not your main operation (there are ways in the Java API that make it faster).
For efficient multi-core usage, run multiple clients or write java-code that utilizes more cores during traversal with ThreadPools.

Cypher Queries work in local machine but does not work in server . Also query works good in embedded mode but not in Rest mode

The cypher queries that work fine in local machine (Windows) do not work in linux instance. The cypher queries also run great in embedded mode in the server/local but the same query does not work using the Rest mode (0 rows returned). The database size between local and server is hugely different, so are there any parameters we need to change in order of accomodate this difference in db size ?
I get a
com.sun.jersey.api.client.ClientHandlerException:
java.net.SocketTimeoutException: Read timed out
Example Queries are simple queries like: match n where n:LABEL_BRANDS return n .
The properties in neo4j.properties file are:
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
Neo4J Version I use 2.0.0-RC1 .
Also I get a "Disconnected from Neo4j. Please check if the chord is unplugged" error on opening the browser interface very frequently.
Would there be a mistake in setting some properties in config files, could you identify the mistake here. Thanks .
Upgrade to Neo4j 2.0
How big is the machine you run your Neo4j server on?
Try to configure a sensible amount of heap (8-16GB) and the rest of your RAM as memory-mapping according to the store-file sizes on disk
the query you've shown is a global scan which will return a lot of data over the wire on a large database. What are the actual graph queries/use-cases you want to run?
The error messages from the browsers also indicate that either your network setup is flaky or your server has issues. Please do upload your messages.log as Stefan has indicated to an accessible place and add the link to your question.

Setting fetch size of a gremlin traversal

Is there a way to set fetch size of a gremlin traversal. I have a very complicated traversal that I am doing in gremlin. The traversal is expected to result in a large amount of nodes and the iteration is fetching these nodes in batches so a long time is spent over the network. Is there a way to provide a fetch size to gremlin so that this time can be minimized.
You don't have an option in Gremlin itself to do that. Control over such a thing is handled by the means by which you are executing your Gremlin. For example, if you are using Titan with Cassandra you could change this setting:
storage.cassandra.thrift.frame_size_mb
which controls the maximum frame size to be used by thrift for transport. You can increase this value when retrieving very large result sets. You can read more about other such settings here and in other implementation specific configuration wiki pages.
Another example, would be related to issuing your Gremlin to Rexster. In this case, you have less options that work out of the box. Rexster generically works best for fast request/response. If you have traversals that you know ahead of time will be large result sets it might be better to write your own Rexster Extension. A good example to look at is how the FaunusRexsterInputFormatExtension works. This extension provides a way to stream back specified portions of the entire Graph. These are very long run operations over HTTP (on a large graph of course). You might find that to be a good model. .

Resources