Is there a guarantee that all exploded rows in a stream will get updated into table at once? - ksqldb

If I have a stream s1 with messages column of type Array<Map<VARCHAR, VARCHAR>> like below
ROWTIME key messages
-------------------------------
t1 1 [{id: 1, k1: v1, k2: v2}, {id: 2, k1: v3, k2: v4}]
t2 2 [{id: 1, k1: v5, k2: v6}, {id: 2, k1: v7, k2: v8}]
.......
.......
I am creating another stream s2 using
create stream s2 as select explode(message) from s1 emit changes;
ROWTIME message
-----------------------------
t1 {id: 1, k1: v1, k2: v2}
t1 {id: 2, k1: v3, k2: v4}
t2 {id: 1, k1: v5, k2: v6}
t2 {id: 2, k1: v7, k2: v8}
...........
...........
My aim is to create a table with id, k1, k2 columns, I am publishing in array format in s1 to make sure that they both are updated in table together.
create stream s3 as select message['id'] as id, message['k1'] as k1, message['k2'] as k2 from s2 emit changes;
create table table1 as select id, latest_by_offset(k1), latest_by_offset(k2) from s3 group by id emit changes;
With above, is there any guarantee that all the messages (with any count, currently count is 2) which are exploded from a single array will get applied to table 1 at once? In other words is there a guarantee that below state is never possible, with only id 1 from t2 timestamp is applied on table 1 but id 2 from t2 timestamp is not applied.
ROWTIME id k1 k2
----------------------------------------
t1 2 v3 v4
t2 1 v5 v6

This isn't currently guaranteed by ksqlDB. Though it is potentially possible to enhance ksqlDB to support this. Probably worth raising a feature request.

Related

How to use Cypher to create weighted relationships from a node to an list/array of nodes?

I would like to create nodes using a list/array of relationships (from an imported CSV). Here, in the table "ID" is the node that should be linked to the list of nodes "Relationships", ultimately using "Distances" as the weights of the relationships.
| ID | Relationships | Distances |
| -- | ------------- | ------------- |
| 1 | [1, 3, 5] | [0, 0.8, 0.3] |
| 2 | [2, 3, 5] | [0, 0.4, 0.1] |
| 3 | [3, 2, 4] | [0, 0.2, 0.6] |
| 4 | [4, 3, 5] | [0, 0.8, 0.6] |
| 5 | [5, 3, 4] | [0, 0.1, 0.8] |
Note, the most similar (zero distance) item refers to the node itself.
A file with 100 entities can be found at:
https://github.com/SebastiaanK97/NetworkSimilarity.git
Thus far, I didn't accomplish setting the relationships to the right nodes:
LOAD CSV WITH HEADERS FROM 'file:///cc.csv' AS cc
WITH cc WHERE cc.id IS NOT NULL
CREATE (n:Nodes {id:toInteger(cc.id)})
CREATE (t:Targets {ids:apoc.convert.fromJsonList(cc.Relationships)})
WITH n, t
UNWIND t.ids as id
CREATE (t)-[:SIMILAR_TO]->(n)
RETURN t, n
Thank you for the help.
First, you can use UNWIND with a list of indexes using range, and then access each of your lists (Relationships and Distances) with [] indexing like this:
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/SebastiaanK97/NetworkSimilarity/main/cc.csv' AS cc
WITH cc WHERE cc.ID IS NOT NULL
CREATE (n:Node { id: toInteger(cc.ID) })
WITH cc, n
UNWIND range(0, size(cc.Distances)-1) as idx
WITH n, cc.Distances[idx] as dist, cc.Relationships[idx] as targetId
WHERE targetId <> n.id // exclude relationship to "self"
CREATE (target:Node {id: targetId})
CREATE (t)-[:SIMILAR_TO {distance: dist}]->(n)
RETURN t, n
However, this query is failing with your dataset due to the quotes around the lists in the CSV, which makes Cypher interpret it as a string and not a list. I've fixed it by using split like this:
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/SebastiaanK97/NetworkSimilarity/main/cc.csv' AS cc
WITH cc WHERE cc.ID IS NOT NULL
MERGE (n:Node { id: toInteger(cc.ID) })
WITH cc, n,
split(cc.Relationships, ",") as rels,
split(cc.Distances, ",") as dists
UNWIND range(0, size(rels)-1) as idx
WITH n, dists[idx] as dist, rels[idx] as targetId
WHERE targetId <> n.id
MERGE (t:Node {id: targetId})
CREATE (t)-[:SIMILAR_TO {distance: dist}]->(n)
RETURN t, n
Finally, you'll have noticed that, in order to prevent creating multiple nodes with the same ID, I've replaced your CREATE statements with MERGE. But be aware that it can lead to perf issues with large dataset due to the so-called "Eager operator". Read more about it here and here for instance.

QUERY results with count, group, order functions

I have 5 columns of data I want to return from a query, plus a count of the first column.
A couple of other things I want is to only include listings that are active (which is stored by the tag "Include" in Column M) and I want the data to be randomized (I do this by creating a random number generator in column P). Neither of these last 2 should be displayed. The data I wanted to be returned is located in Columns Q, R, S, T, U.
My data looks like this:
M N O P Q R S T U
Active Text Text RN Phone# ID Name Level Location
Include text text 0.51423 10000001 1223 Bob Level 2 Boston
Include text text 0.34342 10000005 2234 Dylan Level 3 San Francisco
Exclude text text 0.56453 10000007 2311 Janet Level 8 Des Moines
Include text text 0.23122 10000008 2312 Gina Level 8 Houston
Include text text 10000001 1225 Ronda Level 3 Boston
Include text text 10000001 1236 Nathan Level 2 Boston
So, ideally, results would look like:
count Phone# Phone# ID Name Level Location
3 10000001 1223 Bob Level 2 Boston
1 10000005 2234 Dylan Level 3 San Francisco
1 10000008 2312 Gina Level 8 Houston
I don't care what ID or Name shows up behind the phone number so long as it's one of the numbers on the list.
Now, I have been able to get the function to work separately (ORDER and COUNT), but can't get both to work in 1 function:
Worked:
=QUERY(Function!M:U, "SELECT count (Q), Q where O = 'Include' group by Q")
=QUERY(Function!M:U, "SELECT Q, R, S, T, U where O = 'Include' ORDER BY P DESC")
Did not work:
=QUERY(Function!M:U, "SELECT count (Q), Q group by Q, R, S, T, U where O = 'Include' group by Q ORDER BY P DESC, R, S, T, U")
=QUERY(Function!M:U, "SELECT count (Q), Q, R, S, T, U group by Q where O = 'Include' group by Q ORDER BY P DESC")
=QUERY(Function!M:U, "SELECT count (Q), Q group by Q where O = 'Include' group by Q ORDER BY P DESC, R, S, T, U")
Maybe someone has an idea of where I'm going wrong with combining the two different types of syntax? Help is much appreciated! :)
=ARRAYFORMULA({"count Phone#", "Phone#", "ID", "Name", "Level", "Location";
QUERY(Function!M3:U,
"select count(Q),Q where P is not null group by Q label count(Q)''", 0),
IFERROR(VLOOKUP(INDEX(QUERY(Function!M3:U,
"select Q,count(Q) where P is not null group by Q label count(Q)''", 0),,1),
QUERY(Function!M3:U,
"select Q,R,S,T,U where P is not null order by P desc", 0), {2, 3, 4, 5}, 0))})
cell P2:
=ARRAYFORMULA({"RN"; IF(M3:M="Include", RANDBETWEEN(ROW(A3:A),99^9), )})

Neo4j Rounding Relationship

everybody
I have two nodes (S1,S2).
S1 is
USING PERIODIC COMMIT
LOAD CSV with HEADERS FROM "file:/S1.csv" AS line
CREATE (a:S1 {ID: TOINT (line.ID)})
set a.Depth_m =TOINT (line.depth );
The S1 node property vlaues are :
ID Depth_m
1 100.06
2 100.20
3 100.37
4 101.29
5 101.50
6 101.88
7 102.42
8 102.70
9 102.92
S2 is
USING PERIODIC COMMIT
LOAD CSV with HEADERS FROM "file:/S2.csv" AS line
CREATE (b:S2 {ID: TOINT (line.ID)})
set b.Depth_m =TOINT (line.depth );
The S2 node property values are:
ID Depth_m
1 100.25
2 101.55
3 102.75
So, I want to establish a relationship between the values of the two nodes, provided in which values (Depth_m) of S1 and S2 are approximately same (with a small difference ~ 0.5).
E.g., result should be:
S1 S2
ID Depth_m ID Depth_m
1 100.20 =======>> 1 100.25
2 101.50 =======>> 2 101.55
3 102.70 =======>> 3 102.75
Can ROUND solves this issue? If it can do something, How I can use it?
Thanks)
This query (for handling the S2.csv file) should do what you want:
USING PERIODIC COMMIT
LOAD CSV with HEADERS FROM "file:/S2.csv" AS line
CREATE (b:S2 {ID: TOINT(line.ID), Depth: TOINT(line.depth)})
WITH b
MATCH (a:S1) WHERE ABS(a.Depth-b.Depth) <= 0.5
CREATE (a)-[:SIMILAR]->(b);

Merging tracks in Neo4j

(Using Neo4j 3.x and neo4j.v1 Python driver)
I have two tracks T1 and T2 to the same target. Somewhere before reaching the target, the two tracks meet at node X and become one until the target is reached.
Track T1: T----------X-----------A
Track T2: '-----Q
I use the following Cypher query to generate each one of the tracks:
UNWIND {coords} AS coordinates
UNWIND {pax} AS pax
CREATE (n:Node)
SET n = coordinates
SET n.pax = pax
RETURN n
using the parameter list, e.g. {'pax': 'A', 'coords': [{'id': 0, 'lon': '8.553095', 'lat': '47.373146'}, etc.]}
and then link the nodes using the id only for the purpose of keeping the sequence of the trackpoints:
UNWIND {pax} AS pax
MATCH (n:Node {pax: pax})
WITH n
ORDER BY n.id
WITH COLLECT(n) AS nodes
UNWIND RANGE(0, SIZE(nodes) - 2) AS idx
WITH nodes[idx] AS n1, nodes[idx+1] AS n2
MERGE (n1)-[:NEXT]->(n2)
From the (unknown) point X (CS1 in the picture above) on, both tracks have identical trackpoints. I can match those using:
MATCH (n:Node), (m:Node)
WHERE m <> n AND n.id < m.id AND n.lat = m.lat AND n.lon = m.lon
MERGE (n)-[:IS]->(m)
with lat, lon being the (identical) coordinates. This is just my clumsy way to determine the first joint trackpoint. What I really need is to have one (linked) track from point X onward with the pax property updated, e.g. as ['A', 'B']
Question 1 (generalized):
How can I merge two nodes with a relationship into one node with an updated property? C3 and S3 merge into a new node CS3.
Question 2:
How can I do this if I have two linked lists with a set of pairwise identical properties?
(Ax)-[:NEXT]-> (A1)-[:NEXT]->(A2)-[:NEXT]->(A3)
(Ax)-[:NEXT]-> (B1)-[:NEXT]->(B2)-[:NEXT]->(B3)
where Ax.x <> Bx.x but A1.x = B1.X and A2.x = B2.x etc.
Thank you all for your hints and helpful ideas.

Pig Join - How to join two table with multiple fields where one of the field in key is optional?

Problem: Pig Beginner- Below is my two input table
Table 1: Contain 3 columns (VID, TID and USID)
v1 TID101 US101
v2 TID102
v3 TID103
v4 TID104 US104
v5 US105
v6 US106
Table 2: Contain 3 columns (PID, TID, USID)
p1 TID101 US101
p2 TID102 US102
p3 TID103 US103
p4 TID104 US104
p5 TID105 US105
I would like to join table 1 and 2 and get output as below:
Expected Output:
v1 TID101 US101 p1
v2 TID102 p2
v3 TID103 p3
v4 TID104 US104 p4
v5 US105 p5
I tried inner join as below:
a= JOIN table1 BY (TID, USID), table2 BY (TID, USID);
b= FOREACH a GENERATE table1::vID, table1::TID, table1::USID, table2::PID;
But I get only below output:
Actual Output:
v1 TID101 US101 p1
v4 TID104 US104 p4
I could try left outer join but I feel that when I join by multiple keys, both keys are considered mandatory for joining and I cannot have "OR" condition. All I am trying is to get the PID from table2 if table1 record contain USID or TID. I am not sure what I miss and be interested to understand the best approach to arrive at the expected output. Please help!
Join on single column,union the result and distinct the final relation.
PigScript
A = LOAD 'test1.txt' USING PigStorage('\t') as (a1:chararray,a2:chararray,a3:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') as (b1:chararray,b2:chararray,b3:chararray);
A2 = JOIN A BY (a2),B by (b2);
A3 = JOIN B BY (b3),A by (a3);
C = FOREACH A2 GENERATE A::a1,A::a2,A::a3,B::b1;
D = FOREACH A3 GENERATE A::a1,B::b2,B::b3,B::b1;
E = UNION C,D;
E1 = DISTINCT E;
DUMP E1;
Output

Resources