DSE Graph - How to see the underlying Cassandra queries from Gremlin queries? - datastax-enterprise

If I execute a gremlin query in the gremlin-console, is there a way to see the Cassandra queries DSE Graph generates?

DSE Graph provides an extended set of attributes to the results of TinkerPop's profile() step - here is an example of the output:
gremlin> g.V().has('recipe','name','spaghetti').profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
DsegGraphStep([~label.=(recipe), name.=(spaghet... 1 1 97.087 81.00
query-optimizer 22.802
\_condition=(((label = recipe) & (true)) & name = spaghetti)
query-setup 1.134
\_isFitted=true
\_isSorted=false
\_isScan=false
index-query 19.838
\_indexType=Secondary
\_usesCache=false
\_statement=SELECT "community_id", "member_id" FROM "junk"."recipe_p" WHERE "name" = ? LIMIT ?; with para
ms (java.lang.String) spaghetti, (java.lang.Integer) 50000
\_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
al.empty, pagingState=null, pageSize=-1, user=Optional.empty, waitForSchemaAgreement=true, asyn
c=true}
DsegPropertyLoadStep 1 1 22.772 19.00
>TOTAL - - 119.860 -

Stephen is correct. This feature was added in 5.1.2. You'll see the JIRA (DSP-13293) in the release notes for 5.1.2. What version are you using?

Related

Best practices for improving Full Text Search in Neo4j, Apollo Server and Graphql

I am currently trying to build a search application on top of SNOMED CT international FULL RF 2 Release. The database is quite huge and I had decided to move on with a full text search index for optimal results. So there are primarily 3 types of nodes in a SNOMED CT database :
ObjectConcept
Descriptions
Role Group
There are multiple relationships between the nodes but I'm focussing on that later.
Currently I'm focussing on searching for ObjectConcept Nodes by a String property value called FSN which stands for fully specified name. For this I tried two things :
Create text indexes on FSN : After using this with MATCH queries the results were rather slow even when I was using the CONTAINS predicate and limiting return value to 15.
Create Full Text Indexes: According to the docs, the FT indexes are powered by Apache Lucene. I created this for FSN. After using the FT Indexes and using AND clauses in the search term, for example :
Search Term : head pain
Query Term: head AND pain
I observe quite impressive benefits in query time using the profiler in neo4j browser(around from 43ms to 10ms for some queries) however once I start querying the db using the apollo server, query times go as high as 2s - 3s. Sometimes upto 8-10s.
The query is as follows, implemented by a custom resolver in neo4j/graphql and apollo-server:
` const session = context.driver.session()
 let words = args.name.split(" ");
 let compoundQuery = "";
 if (words.length === 1) compoundQuery = words[0];
 else compoundQuery = words.join(" AND ");
 console.log(compoundQuery)
 compoundQuery+= AND (${args.type})
 return session
    .run(
   
    `CALL db.index.fulltext.queryNodes('searchIndex',$name) YIELD node, score
    RETURN node
    LIMIT 10
    `,
    { name: compoundQuery }
    )
  .then((res) => {
  session.close()
  return res.records.map((record) => {
  return record.get('node').properties
   })
  })
} `
I have the following questions:
Am I utilising FT indexes as much as I can or am I missing important optimisations out?
I was trying to implement elasticsearch with neo4j but I read elasticsearch and the FT indexes are both powered by lucene. So am I likely to gain improvements from using elasticsearch? If so? how should I go about it considering that I am using neo4j aura db and my graphql server is on ec2. I am confused as to how to use elasticsearch overall with the GRANDStack. Any help would be appreciated.
Any other Suggestions for optimising search would be greatly appreciated.
Thanks in advance!

Influxdb querying values from 2 measurements and using SUM() for the total value

select SUM(value)
from /measurment1|measurment2/
where time > now() - 60m and host = 'hostname' limit 2;
Name: measurment1
time sum
---- ---
1505749307008583382 4680247
name: measurment2
time sum
---- ---
1505749307008583382 3004489
But is it possible to get value of SUM(measurment1+measurment2) , so that I see only o/p .
Not possible in influx query language. It does not support functions across measurements.
If this is something you require, you may be interested in layering another API on top of influx that do this, like Graphite via Influxgraph.
For the above, something like this.
/etc/graphite-api.yaml:
finders:
- influxgraph.InfluxDBFinder
influxdb:
db: <your database>
templates:
# Produces metric paths like 'measurement1.hostname.value'
- measurement.host.field*
Start the graphite-api/influxgraph webapp.
A query /render?from=-60min&target=sum(*.hostname.value) then produces the sum of value on tag host='hostname' for all measurements.
{measurement1,measurement2}.hostname.value can be used instead to limit it to specific measurements.
NB - Performance wise (of influx), best to have multiple values in the same measurement rather than same value field name in multiple measurements.

Simple Neo4j query is very slow on large database

I have a Neo4J database with the following properties:
Array Store 8.00 KiB
Logical Log 16 B
Node Store 174.54 MiB
Property Store 477.08 MiB
Relationship Store 3.99 GiB
String Store Size 174.34 MiB
MiB Total Store Size 5.41 GiB
There are 12M nodes and 125M relationships.
So you could say this is a pretty large database.
My OS is windows 10 64bit, running on an Intel i7-4500U CPU #1.80Ghz with 8GB of RAM.
This isn't a complete powerhouse, but it's a decent machine and in theory the total store could even fit in RAM.
However when I run a very simple query (using the Neo4j Browser)
MATCH (n {title:"A clockwork orange"}) RETURN n;
I get a result:
Returned 1 row in 17445 ms.
I also used a post request with the same query to http://localhost:7474/db/data/cypher, this took 19seconds.
something like this:
http://localhost:7474/db/data/node/15000
is however executed in 23ms...
And I can confirm there is an index on title:
Indexes
ON :Page(title) ONLINE
So anyone have ideas on why this might be running so slow?
Thanks!
This has to scan all nodes in the db - if you re-run your query using n:Page instead of just n, it'll use the index on those nodes and you'll get better results.
To expand this a bit more - INDEX ON :Page(title) is only for nodes with a :Page label, and in order to take advantage of that index your MATCH() needs to specify that label in its search.
If a MATCH() is specified without a label, the query engine has no "clue" what you're looking for so it has to do a full db scan in order to find all the nodes with a title property and check its value.
That's why
MATCH (n {title:"A clockwork orange"}) RETURN n;
is taking so long - it has to scan the entire db.
If you tell the MATCH() you're looking for a node with a :Page label and a title property -
MATCH (n:Page {title:"A clockwork orange"}) RETURN n;
the query engine knows you're looking for nodes with that label, it also knows that there's an index on that label it can use - which means it can perform your search with the performance you're looking for.

How to get CPU usage percentage measured by Collectd in InfluxDB

I'm collecting cpu usage measured in jiffies by Collectd 5.4.0 and then storing the results in InfluxDB 0.9.4. I use the following query to get cpu percentage from InfluxDB:
SELECT MEAN(value) FROM cpu_value WHERE time >= '' and time <= '' GROUP BY type,type_instance
But when I plot the result it makes no sense. There is no pattern in cpu usage. Please let me know If I do something wrong.
Thanks
Since Collectd 5.5 you can get values in percentage instead of jiffies:
<Plugin cpu>
ReportByState = true
ReportByCpu = true
ValuesPercentage = true
</Plugin>
Then you can write query like:
SELECT mean("value") FROM "cpu_value" WHERE
"type_instance" =~ /user|system|nice|irq/
AND "type" = 'percent' AND $timeFilter
GROUP BY time($interval), "host"
If you can upgrade it might be the easiest option. Otherwise you can:
precompute percentage on client
use other client for reporting stats (such as statsd, telegraf, etc.)
With InfluxDB 0.12 you can perform arithmetic operations between fields like:
SELECT MEAN(usage_system) + MEAN(usage_user) AS cpu_total
FROM cpu
However for using this you would have to report from collectd user|system|nice|irq as FIELDS not TAGS.
this is my query, I use it with percent unit (on Axes tab), but stack+percent (on display tab) make sense as well
SELECT non_negative_derivative(mean("value"), 1s) FROM "cpu_value" WHERE "type_instance" =~ /(idle|system|user|wait)/ AND $timeFilter GROUP BY time($interval), "type_instance" fill(null)
The non_nagative_derivative(1s) can be replaced with derivative(1s), I had some of negative value when values was missing.

How to measure neo4j database size for community edition?

I understand that the Neo4j Community edition does not provide a dump. What is the best way to measure the size of the underlying Neo4j database? Will doing du provide a good estimate?
Thank you.
What do you mean:
the Neo4j Community edition does not provide a dump
The Enterprise edition doesn't provide anything like this either. Are you looking for statistics on the size of the DB in terms of raw disk or a count of Nodes/Relationships?
If Disk: Use du -sh on linux, or check the folder on Windows.
If Node/Relationship: You'll have to write Java code to actually evaluate the true size, as the count on the Web Console is not always true. You could also do a basic count by taking the pure size on disk and dividing by 9 for the node store, and 33 for the relationship store.
Java code would look like this:
long relationshipCounter = 0;
long nodeCounter = 0;
GlobalGraphOperations ggo = GlobalGraphOperations.at(db);
for (Node n : ggo.getAllNodes()) {
nodeCounter++;
try {
for (Relationship relationship : n.getRelationships()) {
relationshipCounter++;
}
} catch (Exception e) {
logger.error("Error with node: {}", n, e);
}
}
System.out.println("Number of Relationships: " + relationshipCounter);
System.out.println("Number of Nodes: " + nodeCounter);
The reason the Web Console isn't always true is it checks a file for the highest value, and Neo4j uses a delete marker for nodes, so there could be a range of "deleted" nodes that buff up the number of total nodes that are available. Eventually neo4j will compact and remove these nodes, but they don't do it in real time.
The reason why the file size may lie is the same as above. The only true way is to go through all nodes and relationships to check for the "isActive" marker.
You can use Cypher to get the number of nodes:
MATCH ()
RETURN COUNT(*) AS node_count;
and the number of relationships:
MATCH ()-->()
RETURN COUNT(*) AS rel_count;
Some ways to view the database size in terms of bytes:
du -hc $NEO4J_HOME/data/databases/graph.db/*store.db*
From the dashboard:
http://localhost:7474/webadmin/
And view the 'database disk usage' indicator.
'Server Info' from the dashboard:
http://localhost:7474/webadmin/#/info/org.neo4j/Store%20file%20sizes/
And view the 'TotalStoreSize' row.
Finally, the database drawer; scroll all the way down and view the 'Database' section.

Resources