Specifically, does the Flatten PTransform in Beam perform any sort of:
Deduplication
Filtering
Purging of existing elements
Or does it just "merge" two different PCollections?
The Flatten transform does not do any sort of deduplication, or filtering of any kind. As mentioned, it simply merges the multiple PCollections into one that contains the elements of each of the inputs.
This means that:
with beam.Pipeline() as p:
c1 = p | "Branch1" >> beam.Create([1, 2, 3, 4])
c2 = p | "Branch2" >> beam.Create([4, 4, 5, 6])
result = (c1, c2) | beam.Flatten()
In this case, the result PCollection contains the following elements: [1, 2, 3, 4, 4, 4, 5, 6].
Note how the element 4 appears once in c1, and twice in c2. This is not deduplicated, filtered or removed in any way.
As a curious fact about Flatten, some runners optimize it away, and simply add the downstream transform in both branches. So, in short, no special filtering or dedups. Simply merging of PCollections.
Related
After having created a collection of nodes, some of the nodes should also have a relation attached based on a condition. In the example below the condition is simulated with WHERE n.number > 3 and the nodes are simple numbers:
WITH [2, 3, 4] as numbers
UNWIND numbers AS num
CREATE(n:Number {number: num})
WITH collect(n) AS nodes
UNWIND nodes AS n
WITH nodes, n WHERE n.number > 3
CREATE (n)-[:IM_SPECIAL]->(n)
RETURN nodes
Which returns:
╒════════════════════════════════════════╕
│"nodes" │
╞════════════════════════════════════════╡
│[{"number":2},{"number":3},{"number":4}]│
└────────────────────────────────────────┘
Added 3 labels, created 3 nodes, set 3 properties, created 1 relationship, started streaming 1 records in less than 1 ms and completed after 1 ms.
My problem is that nothing is returned unless I have at least one of these "special" nodes that is caught by the filter. The problem can be simulated by changing the input numbers to [1, 2, 3] which returns an empty result (no nodes) even though the nodes are created (as they should):
<empty result>
Added 3 labels, created 3 nodes, set 3 properties, completed after 2 ms.
I might be approaching the problem totally wrong but I've exhausted my Google skills... what Neo4J Cypher magic am I missing?
The documentation about Conditional Cypher Execution - Using correlated subqueries in 4.1+ describes how to solve this without the need for Apoc:
WITH [2, 3, 4] AS numbers
UNWIND numbers AS num
CREATE(n:Number {number: num})
WITH n
CALL {
WITH n
WITH n WHERE n.number > 3
CREATE (n)-[:IM_SPECIAL]->(n)
RETURN count(n)
}
RETURN collect(n) AS nodes
Thanks to Sanjay Singh and Jose Bacoy for putting me on the right track.
WITH nodes, n WHERE n.number > 3
Each clause of a Cypher query must yield a result for for the subsequent lines of the query to consume. The above line yields nothing if you start with [1,2,3].
For your purpose, this will work.
WITH [1,2,3,4] as numbers
UNWIND numbers AS num
CREATE(n:Number {number: num})
WITH n
CALL apoc.do.when(n.number>3,
'CREATE (n)-[:IM_SPECIAL]->(n) RETURN n',
'RETURN n',
{n:n}
)
YIELD value as m
WITH collect(m) AS nodes
RETURN nodes
I have an application where nodes and relations are shown. After a result is shown, nodes and relations can be added through the gui. When the user is done, I would like to get all the data from the database again (because I don't have all data by this point in the front-end) based on the Neo4j id's of all nodes and links. The difficult part for me is that there are "floating" nodes that don't have a relation in the result of the gui (they will have relations in the database, but I don't want these). Worth mentioning is that on my relations, I have the start and end node id. I was thinking to start from there, but then I don't have these floating nodes.
Let's take a look at this poorly drawn example image:
As you can see:
node 1 is linked (no direction) to node 2.
node 2 is linked to node 3 (from 2 to 3)
node 3 is linked to node 4 (from 3 to 4)
node 3 is also linked to node 5 (no direction)
node 6 is a floating node, without relations
Let's assume that:
id(relation between 1 and 2) = 11
id(relation between 2 and 3) = 12
id(relation between 3 and 4) = 13
id(relation between 3 and 5) = 14
Keeping in mind that behind the real data, there are way more relations between all these nodes, how can I recreate this very image again via Neo4j? I have tried doing something like:
match path=(n)-[rels*]-(m)
where id(n) in [1, 2, 3, 4, 5]
and all(rel in rels where id in [11, 12, 13, 14])
and id(m) in [1, 2, 3, 4, 5]
return path
However, this doesn't work properly because of multiple reasons. Also, just matching on all the nodes doesn't get me the relations. Do I need to union multiple queries? Can this be done in 1 query? Do I need to write my own plugin?
I'm using Neo4j 3.3.5.
You don't need to keep a list of node IDs. Every relationship points to its 2 end nodes. Since you always want both end nodes, you get them for free using just the relationship ID list.
This query will return every single-relationship path from a relationship ID list. If you are using the neo4j Browser, its visualization should knit together these short paths and display your original full paths.
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
By the way, all neo4j relationships have a direction. You may choose not to specify the direction when you create one (using MERGE) and/or query for one, but it still has a direction. And the neo4j Browser visualization will always show the direction.
[UPDATED]
If you also want to include "floating" nodes that are not attached to a relationship in your relationship list, then you could just use a separate floating node ID list. For example:
MATCH p=()-[r]-()
WHERE ID(r) IN [11, 12, 13, 14]
RETURN p
UNION
MATCH p=(n)
WHERE ID(n) IN [6]
RETURN p
I have a big neo4j db with info about celebs, all of them have relations with many others, they are linked, dated, married to each other. So I need to get random path from one celeb with defined count of relations (5). I don't care who will be in this chain, the only condition I have I shouldn't have repeated celebs in chain.
To be more clear: I need to get "new" chain after each query, for example:
I try to get chain started with Rita Ora
She has relations with
Drake, Jay Z and Justin Bieber
Query takes random from these guys, for example Jay Z
Then Query takes relations of Jay Z: Karrine
Steffans, Rosario Dawson and Rita Ora
Query can't take Rita Ora cuz
she is already in chain, so it takes random from others two, for
example Rosario Dawson
...
And at the end we should have a chain Rita Ora - Jay Z - Rosario Dawson - other celeb - other celeb 2
Is that possible to do it by query?
This is doable in Cypher, but it's quite tricky. You mention that
the only condition I have I shouldn't have repeated celebs in chain.
This condition could be captured by using node-isomorphic pattern matching, which requires all nodes in a path to be unique. Unfortunately, this is not yet supported in Cypher. It is proposed as part of the openCypher project, but is still work-in-progress. Currently, Cypher only supports relationship uniqueness, which is not enough for this use case as there are multiple relationship types (e.g. A is married to B, but B also collaborated with A, so we already have a duplicate with only two nodes).
APOC solution. If you can use the APOC library, take a look at the path expander, which supports various uniqueness constraints, including NODE_GLOBAL.
Plain Cypher solution. To work around this limitation, you can capture the node uniqueness constraint with a filtering operation:
MATCH p = (c1:Celebrity {name: 'Rita Ora'})-[*5]-(c2:Celebrity)
UNWIND nodes(p) AS node
WITH p, count(DISTINCT node) AS countNodes
WHERE countNodes = 5
RETURN p
LIMIT 1
Performance-wise this should be okay as long as you limit its results because the query engine will basically keep enumerating new paths until one of them passes the filtering test.
The goal of the UNWIND nodes(p) AS node WITH count(DISTINCT node) ... construct is to remove duplicates from the list of nodes by first UNWIND-ing it to separate rows, then aggregating them to a unique collection using DISTINCT. We then check whether the list of unique nodes still has 5 elements - if so, the original list was also unique and we RETURN the results.
Note. Instead of UNWIND and count(DISTINCT ...), getting unique elements from a list could be expressed in other ways:
(1) Using a list comprehension and ranges:
WITH [1, 2, 2, 3, 2] AS l
RETURN [i IN range(0, length(l)-1) WHERE NOT l[i] IN l[0..i] | l[i]]
(2) Using reduce:
WITH [1, 2, 2, 3, 2] AS l
RETURN reduce(acc = [], i IN l | acc + CASE NOT i IN acc WHEN true THEN [i] ELSE [] END)
However, I believe both forms are less readable than the original one.
We have a bunch of nodes with properties that are converted from BigDecimals to string during insert and vice versa during load.
This leads to typical problems during sorting. Values 1, 2, 3, 10 get sorted as 1, 10, 2, 3.
Does cypher has any means of doring natural sorting on strings? Or do we have to convert these properties to doubles or something like that?
Guess the best way is to store them as integers in your db. Also, in the current milestone release, there's a toInt() function which you could use to sort.
START n=node(*)
WITH toInt(n.stringValue) as nbr
RETURN n
ORDER BY nbr
Can you add a primary sort on string length?
CREATE ({val:"3"}),({val:"6"}),({val:"9"}),({val:"12"}),({val:"15"}),({val:"18"}),({val:"21"})
MATCH (n) RETURN n.val ORDER BY n.val
// 12, 15, 18, 21, 3, 6, 9
MATCH (n) RETURN length(n.val), n.val
// 3, 6, 9, 12, 15, 21
http://console.neo4j.org/r/kb0obm
If you keep converting them back and forth it sounds like it would be better to store them as their proper types in the database.
I have a question about extracting specific elements from array-valued properties in Neo4j. For example if the nodes in out database each have a property 'Scores', with Scores being an integer array of length 4. Is there a way to extract the first and fourth elements of every node in a path i.e. can we do something along the lines of -
start src=node(1), end =node(7)
match path=src-[*..2]-end
return extract(n in nodes(path)| n.Scores[1], n.Scores[4]);
p.s. I am using Neo4j 2.0.0-RC1
Does this work for you?
START src=node(1), end=node(7)
MATCH path=src-[*..2]-end
RETURN extract(n in nodes(path)| [n.Scores[0], n.Scores[3]] )
Basically that's creating a collection for each node of the 1st and 4th (indexes start at 0) score. See 8.2.1. Expressions in general
An expression in Cypher can be:
...
A collection of expressions:
["a", "b"], [1,2,3],["a", 2, n.property, {param}], [ ].