Cypher function call changes return value depending of number of results - neo4j

I'm using a query to execute lucene search over a property in a Neo4j database. As I want to query by several different strings I've come with the following query
CALL db.index.fulltext.queryNodes('descs', 'abc') YIELD node
with collect(node) as matches1
CALL db.index.fulltext.queryNodes('descs', 'def') YIELD node
with matches1, collect(node) as matches2
RETURN apoc.coll.intersection(matches1,matches2) AS res
Query works fine sometimes but seems to behave weirdly when too many results are returned in any of the calls (still don't know the actual limit). More precisely, if any of the queries returns a big list, then the query returns just "[]". It seems to work perfectly if query returns a list with few nodes (or no nodes, which provides no results due to doing an intersection with an empty list). Is there any configurable (or non-configurable) limit on apoc.coll.intersection function or any of the other calls? Things are aggravated when using the '~' operand, as it usually returns more results. This means that similar queries will or will not work as expected depending on how many matches provides the queryNodes call.
Also, as there could be any number of words to search, is there a way to generalize this kind of query? queryNodes seems to not work well with spaces inside the text to search. Regex could be an option but it doesn't really work well when dealing with accents and/or searching for multiple words in any given order.

Related

Get start and end nodes of specific path in a large graph

I have a large graph (1,068,029 nodes and 2,602,897 relationships), and I work with it via the python API and make requests to the graph in my program flow.
I have the following queries -
First query
MATCH
(start_node)--(o:observed_data)--(i:indicator)--(m:malware)--(end_node:attack_pattern)
WHERE start_node.id in [id_list]
RETURN start_node.id, end_node.name
Second query
MATCH
(start_node)--(o1:observed_data)--(h:MD5)--(o2:observed_data)--(i:indicator)--(m:malware)--(end_node:attack_pattern)
WHERE start_node.id in [id_list]
RETURN start_node.id, end_node.name
When I trying to preform the first query with id_list of size 75,000 its passes OK and returns the wanted output, but when I trying to preform the second query - the graph gets stuck, even when I decreasing the id_list to 20,000.
The id_list is even larger than 75,000 but I split it into chunks in order to make the graph's response time faster, but if I will split it to too many chunks I will increase the number of requests to the graph, and increase the program run-time.
My question is - Is there a library's function of some sort (APOC or something like that) that performs the same action but in less time? Or maybe you have another solution that solves this problem without decreasing the id_list under 50,000?
The (start_node) in your MATCH patterns should specify a label (like (start_node:Foo)), to avoid having to scan every node in the DB. Also, you should create an index (or uniqueness constraint) for that start node.
You should make all the relationships in your MATCH patterns directional, if appropriate. That is, put an arrow on either end.
You should specify the relationship types in your patterns as well (like ()-[:BAR]->()), so that the query would not be forced to evaluate all relationship types.

What is the difference between the filter and search query parameters in Microsoft Graph Mail API?

While I was looking at the documentation for query parameters here, I noticed that there were two query parameters that seemingly did the exact same thing: filter and search.
I'm just wondering what the difference is between them and when is one used over the other.
While they're similar, they operate a little differently.
$search uses Keyword Query Language (KQL) and is only supported by message and person collections (i.e. you can't use $search on most endpoints). By default, it searches multiple properties. Most importantly, $search is a "contains" search, meaning it will look for your search word/phrase anywhere within a string.
For example, /messages?$search="bacon" will search for the word "bacon" anywhere in the from, subject, or body properties.
Unlike $search, the $filter parameter only searches the specified property and does not support "contains" search. It also works with just about every endpoint. In most places, it supports the following operators: equals (eq), not equals (ne), greater than (gt), greater than or equals (ge), less than (lt), less than or equals (le), and (and), or (or), not (not), and (on some endpoints) starts with (startsWith).
For example, /messages?$filter=subject eq 'bacon' will return only messages where the subject is "bacon".
Both search and filter reduce the result set that you ultimately receive, however they operate in different ways.
Search operates on the query against the entire graph and reduces the amount of information a search query returns. This is often optimized for queries that search is good at, e.g. performing searches for items that can be indexed.
Filter operates on the much smaller result set returned by the search to provide more fine grain filtering. Separating this out allows filtering to perform tasks that would not be performant against the full collection.
This is indicated in Microsoft's documentation:
Search: Returns results based on search criteria.
Filter: Filters results (rows). (results that could be returned by search)
For performance purposes, it's good to use both if you can, search to narrow the results (e.g. using search indexes) and then do fine grain filtering on the returned results.

Neo4j- incorrect count in multiple match query

When I am trying to execute this query
match(u:User)-[ro:OWNS]->(p:PushDevice) where p.type='gcm'
match(com:Comment)
return count(com) as total_comments,count(ro) as device
this is returning the same number in both total_comments and device which is the number of total comment.
I feel like your query should work, though I'm more confident that this will work:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice) WHERE p.type='gcm'
WITH count(ro) AS device
MATCH (com:Comment)
RETURN count(com) as total_comments, device
Your query is generating a row for every combination of your MATCH results. If you just returned the ro and com values, this would be more clear. See this console for an example. That console has 2 comments and a single OWNS relationship, but the result shows 2 rows (both rows have the same OWNS relationship). So, your query is essentially counting the number of rows -- not what you expected.
Here is an example of a query that would work as you you expected:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice {type:'gcm'})
WITH COUNT(ro) AS device
MATCH (com:Comment)
RETURN count(com) AS total_comments, device;
[EDITED]
This would also work logically, but is less performant (as it takes a cartesian product and then filters out duplicates):
MATCH (u:User)-[ro:OWNS]->(p:PushDevice { type: 'gcm' })
MATCH (com:Comment)
RETURN COUNT(DISTINCT com), COUNT(DISTINCT ro);
Observation
The power of neo4j comes from its efficient handling of relationships. So, the most efficient queries tend to be for connected subgraphs (where all nodes are connected by relationships).
Since your query is not for a single connected subgraph, getting the answer you want is naturally going to be a bit more convoluted and can be inefficient.
If you determine that the suggested queries are too slow, you can try making 2 separate queries instead. That may also make make your code easier to understand.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

No START clause VS. n = node(*)

I've read in Neo4J 2.0 docs that START clause is optional and
Cypher will try and infer start points from your query
I have experimentally found that
START user = node(*)
MATCH (user:User)-[r:KNOWS]-(user2:User)
RETURN user.username AS username, collect(user2.username) AS username2
gives the same results as
MATCH (user:User)-[r:KNOWS]-(user2:User)
RETURN user.username AS username, collect(user2.username) AS username2
for small data sets.
My question is: is it semantically the same? Will they always return same result set (I'm not talking about the order)? Even for large data sets? Does skipping START guarantee traversing all nodes? If they are semantically equal why would one ever use node(*)?
Your queries are not semantically the same, but they will always return the same result. The reason they will return the same result is that in your first query, having stated the 'universal node pattern' node(*) you immediately limit it with a further pattern in your MATCH clause. In your second query you state this more narrow pattern from the start, but since the two MATCH clauses are equivalent and the most narrow pattern declared in each query (and since the RETURN clauses are the same) the two queries return the same results.
The START clause used to be the way to state the initial pattern for a query and it was tied up with indexing. Using node(*) or relationship(*) was rarely recommended or useful, but the clause was used for index retrievals, as in START user=node:userIndex(name="Maciej Ziarko"). With 2.0 labels and label indexing was introduced and this is now the preferred way to bind nodes in a query.
Skipping START will not guarantee traversing all nodes (or perhaps more accurately: binding all nodes), but neither do you need a START clause to do so. Using MATCH user (without limiting what is bound to user with labels or relationships) you can still bind every node in your database. It is still rarely recommended or useful.

Resources