How to get min and max of split value? - ksqldb

I have a stream that centralizes the entire data of my topic in this form:
CREATE STREAM all_content (
my_key_col STRUCT < `schema` STRUCT< `type` VARCHAR, `optional` BOOLEAN>, `payload` VARCHAR > KEY,
`schema` VARCHAR,
`payload` VARCHAR
) WITH (KAFKA_TOPIC = 'cb_bench_products-get_purge', FORMAT = 'JSON');
I create a new stream which will create a new topic which will take part of the data:
CREATE STREAM HISTORY_CONTENT
WITH (KAFKA_TOPIC='history_content', PARTITIONS=10, REPLICAS=1, FORMAT='JSON')
AS select * from ALL_CONTENT
WHERE my_key_col->`payload` LIKE 'history%';
Then when I do:
select my_key_col->`payload` from HISTORY_CONTENT;
It returns me:
+--------------------------------------------+--------------------------------------------+--------------------------------------------+
|payload_1 |schema |payload |
+--------------------------------------------+--------------------------------------------+--------------------------------------------+
|history::05000228023411_RO_RO11219082::1 |null |some_data_in_base_64 |
|history::05000228023411_RO_RO11219082::3 |null |some_data_in_base_64 |
|history::05000228023411_RO_RO11219082::8 |null |some_data_in_base_64 |
|history::04532000053245_FR_RO11219082::76 |null |some_data_in_base_64 |
|history::04532000053245_FR_RO11219082::45 |null |some_data_in_base_64 |
|history::04532000053245_FR_RO11219082::3 |null |some_data_in_base_64 |
|history::09999999999911_UA_RO11219082::5 |null |some_data_in_base_64 |
|history::09999999999911_UA_RO11219082::1 |null |some_data_in_base_64 |
|history::09999999999911_UA_RO11219082::8 |null |some_data_in_base_64 |
I would like to create another stream that takes me the minimum value for a given "payload_1", i.e.:
+--------------------------------------------+--------------------------------------------+--------------------------------------------+
|payload_1 |schema |payload |
+--------------------------------------------+--------------------------------------------+--------------------------------------------+
|history::05000228023411_RO_RO11219082::1 |null |some_data_in_base_64 |
|history::04532000053245_FR_RO11219082::3 |null |some_data_in_base_64 |
|history::09999999999911_UA_RO11219082::1 |null |some_data_in_base_64 |
And a stream with the maximum value:
+--------------------------------------------+--------------------------------------------+--------------------------------------------+
|payload_1 |schema |payload |
+--------------------------------------------+--------------------------------------------+--------------------------------------------+
|history::05000228023411_RO_RO11219082::8 |null |some_data_in_base_64 |
|history::04532000053245_FR_RO11219082::76 |null |some_data_in_base_64 |
|history::09999999999911_UA_RO11219082::8 |null |some_data_in_base_64 |
I tried this but without success:
CREATE TABLE HISTORY_MIN WITH (KAFKA_TOPIC='history_min', PARTITIONS=10, REPLICAS=1, FORMAT='JSON') AS select SPLIT(my_key_col->`payload` , '::')[2], MIN(CAST(SPLIT(my_key_col->`payload` , '::')[3] AS BIGINT)) from HISTORY_CONTENT GROUP BY SPLIT(my_key_col->`payload` , '::')[2];
CREATE TABLE HISTORY_MAX WITH (KAFKA_TOPIC='history_max', PARTITIONS=10, REPLICAS=1, FORMAT='JSON') AS select SPLIT(my_key_col->`payload` , '::')[2], MAX(CAST(SPLIT(my_key_col->`payload` , '::')[3] AS BIGINT)) from HISTORY_CONTENT GROUP BY SPLIT(my_key_col->`payload` , '::')[2];

Related

Google Sheet - Return unique id with maximum mark associated

Good day
I'm trying to return the maximum mark of each student. If the fails the training, they can try a new attempt, and my summary sheet should include only unique value with highest mark obtained.
Example of data:
| | A |B |
| -| - | - |
|1 |email | score|
|2|abc#mail.com | 1|
|3 |abd#mail.com | 3|
|4 |abc#mail.com | 3|
|5 |abc#mail.com | 4|
|6 |abe#mail.com | 5|
|7 |abe#mail.com | 4|
|8 |abe#mail.com | 7|
|9 |jvr#mail.com | 1|
|10 |jvr#mail.com | 7|
And i would like to return this table:
| | D |E |
|- | - | - |
|1 |email | score|
|2| abc#mail.com | 4|
|3 |abd#mail.com | 3|
|4 |abe#mail.com | 7|
|5 |jvr#mail.com | 7|
Code used in COL D2: <br>
=UNIQUE(A2:A,FALSE,FALSE)<br>
Code used in COL E2: <br>
=if(G2<>"", ARRAYFORMULA(VLOOKUP(G2,D2:E,2,false)),"")<br>
Code used in COL E3: <br>
=if(G3<>"", ARRAYFORMULA(VLOOKUP(G3,D3:E,2,false)),"")<br>
Is there any way to optimize this?
In D2 paste
=UNIQUE(A2:A)
In E2 paste this formula
=MAX(TRANSPOSE(FILTER($B$2:$B, $A$2:$A=D2)))
use:
=SORTN(SORT(A2:B, 2, 0), 9^9, 2, 1, 1)
A2:B range
2 sort by 2nd column
0 in descending order
9^9 return all rows
2 group by
1 first column
1 in ascending order

How to combine 2 queries result in single query from Google Sheets

I would like to combine 2 result in one, to make them display in 2 column
what I have:
+------------+------------+------------+------------+------------+------------+------------+
| B | C | D | E | F | G | H |
+------------+------------+------------+------------+------------+------------+------------+
| Supplier A | 40ft | 19-0201 | 02/09/2019 | 05/09/2019 | 05/09/2019 | $2,590.60 |
| Supplier B | 20ft | 19-0206 | 04/09/2019 | 06/09/2019 | 07/09/2019 | $7,198.10 |
| Supplier C | 40ft | 19-0208 | 04/09/2019 | 06/09/2019 | 07/09/2019 | $3,673.40 |
| Supplier B | 20ft | 19-0207 | 04/09/2019 | 07/09/2019 | 08/09/2019 | $5,592.20 |
| Supplier C | 20ft | 19-0203 | 06/09/2019 | 05/09/2019 | 06/09/2019 | $863.30 |
| Supplier B | 20ft | 19-0204 | 05/09/2019 | 05/09/2019 | 06/09/2019 | $4,190.20 |
| Supplier D | 28ft | 19-0205 | 05/09/2019 | 07/09/2019 | 08/09/2019 | $1,390.60 |
| Supplier E | 14ft | 19-0209 | 07/09/2019 | 09/09/2019 | 09/09/2019 | $180.30 |
| Supplier B | 10ft | 19-0211 | 08/09/2019 | 08/09/2019 | 09/09/2019 | $12,392.80 |
| Supplier C | 40ft | 19-0210 | 07/09/2019 | 10/09/2019 | 11/09/2019 | $6,591.30 |
| Supplier B | 20ft | 19-0202 | 03/09/2019 | 12/09/2019 | 13/09/2019 | $1,380.50 |
| Supplier F | 14ft | 19-0213 | 09/09/2019 | 12/09/2019 | 12/09/2019 | $4,576.30 |
this is first query code :
=ARRAYFORMULA(TEXTJOIN(CHAR(10),TRUE,SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(QUERY(B16:H34,"SELECT B, D, SUM(H) GROUP BY B, D ORDER BY B ASC LABEL SUM(H)'' FORMAT SUM(H) '$##,##0.00' ")),,COLUMNS(QUERY(B16:H34,"SELECT B, D, SUM(H) GROUP BY B, D ORDER BY B ASC LABEL SUM(H)'' FORMAT SUM(H) '$##,##0.00' ")))))," → "," → ")))&CHAR(10)&CHAR(10)&"Total Costing : "&TEXT(SUM(H16:H34),"$0,000.00")
+-----------------------------------+
| Supplier A 19-0201 $2,590.60 |
| Supplier B 19-0202 $1,380.50 |
| Supplier B 19-0204 $4,190.20 |
| Supplier B 19-0206 $7,198.10 |
| Supplier B 19-0207 $5,592.20 |
| Supplier B 19-0211 $12,392.80 |
| Supplier C 19-0203 $863.30 |
| Supplier C 19-0208 $3,673.40 |
| Supplier C 19-0210 $6,591.30 |
| Supplier D 19-0205 $1,390.60 |
| Supplier E 19-0209 $180.30 |
| Supplier F 19-0213 $4,576.30 |
| |
| Total Costing $50,618.60 |
and my second query code :
={QUERY({B16:H34},"SELECT Col1, SUM(Col7)/"& SUM(H16:H34)&" WHERE Col1 IS NOT NULL GROUP BY Col1 LABEL SUM(Col7)/"& SUM(H16:H34)&"'Scale Of Amount' FORMAT SUM(Col7)/"& SUM(H16:H34)&"'(0.00%)'");"Total Costing Scale","(100%)"}
+---------------+--------------+
| | SUM of Amount|
+---------------+--------------+
| Supplier A | (5.12%) |
| Supplier B | (60.75%) |
| Supplier C | (21.98%) |
| Supplier D | (2.75%) |
| Supplier E | (0.36%) |
| Supplier F | (9.04%) |
| Total Costing | (100.00%) |
How to make it show :
+-------------------------+-------------------------+
| | SUM of Amount |
+-------------------------+-------------------------+
| Supplier A 19-0201 | $2,590.60 (5.12%) |
| Supplier B 19-0202 | $1,380.50 |
| Supplier B 19-0204 | $4,190.20 |
| Supplier B 19-0206 | $7,198.10 |
| Supplier B 19-0207 | $5,592.20 |
| Supplier B 19-0211 | $12,392.80 (60.75%) |
| Supplier C 19-0203 | $863.30 |
| Supplier C 19-0208 | $3,673.40 |
| Supplier C 19-0210 | $6,591.30 (21.98%) |
| Supplier D 19-0205 | $1,390.60 (2.75%) |
| Supplier E 19-0209 | $180.30 (0.36%) |
| Supplier F 19-0213 | $4,576.30 (9.04%) |
| | |
| Total Costing | $50,618.60 (100.00%) |
This the formula that can be used:
=arrayformula({query( sort ( unique({B2:B,D2:D,sumif(B2:B&":"&D2:D,"=" &
B2:B&":"&D2:D,H2:H)}),1,true,2,true),"Select * where Col1 is not null"),
iferror(vlookup(transpose(split(join(",",rept("0,",query(filter(B2:B,B2:B<>""),
"Select Count(Col1) group by Col1 label Count(Col1) ''")-1) &
sequence(counta(unique(B2:B)))),",",true,false)),
{sequence(counta(unique(B2:B))),
query( unique({B2:B,sumif(B2:B,"="&B2:B,H2:H)/sum(H2:H)}),
"Select Col1, Col2 where Col1 is not null")},3,false),"");
{"Total","",sum(filter(H2:H,H2:H<>"")),1}})
Update 1:
= arrayformula
(
{
query (sort (unique({B2:B,D2:D,sumif(B2:B&":"&D2:D,"=" & B2:B&":"&D2:D,H2:H)}),1,true,2,true),"Select * where Col1 is not null"),
iferror (
vlookup(transpose(split(join(",",
rept
(
"0,",query(unique(filter({B2:B,B2:B&":"&D2:D},B2:B<>"")),"Select Count(Col1) group by Col1 label Count(Col1) ''")-1
) & sequence (counta(unique(B2:B)))),",",true,false)),
{
sequence(counta(unique(B2:B))),
query (unique ({B2:B,sumif(B2:B,"="&B2:B,H2:H)/sum(H2:H)}),"Select Col1, Col2 where Col1 is not null")
},3,false
),""
) ; {"Total","",sum(filter(H2:H,H2:H<>"")),1}
}
)

DISTINCT, MAX and list value from join table

How to combine distinct and max within join table below?
Table_details_usage
UID | VE_NO | START_MILEAGE | END_MILEAGE
------------------------------------------------
1 | ASD | 410000 | 410500
2 | JWQ | 212000 | 212350
3 | WYS | 521000 | 521150
4 | JWQ | 212360 | 212400
5 | ASD | 410520 | 410600
Table_service_schedule
SID | VE_NO | SV_ONMILEAGE | SV_NEXTMILEAGE
------------------------------------------------
1 | ASD | 400010 | 410010
2 | JWQ | 212120 | 222120
3 | WYS | 511950 | 521950
4 | JWQ | 212300 | 222300
5 | ASD | 410510 | 420510
How to get display as below (only max value)?
Get Max value from Table_service_schedule (SV_NEXTMILEAGE) and Get Max value from Table_details_usage (END_MILEAGE)
SID | VE_NO | SV_NEXTMILEAGE | END_MILEAGE
--------------------------------------------
5 | ASD | 420510 | 410600
4 | JWQ | 222300 | 212400
3 | WYS | 521950 | 521150
Something in the lines of:
SELECT
SID,
VE_NO,
SV_NEXTMILEAGE,
(select max(END_MILEAGE) from Table_details_usage d where d.VE_NO = s.VE_NO) END_MILEAGE
FROM Table_service_schedule s
WHERE SID = (SELECT max(SID) FROM Table_service_schedule s2 WHERE s2.VE_NO = s.VE_NO)
Probably could need to change the direct value ov SV_NEXTMILEAGE to max as well if the id:s aren't in order...

Joining two tables, A and B, where the column names of A should be joined with B based on the values of a column in B

The title is a little confusing, so here is an example of the problem I'm facing.
Table:
FORM_QUESTION
Fields:
student_id
q1_ans
q2_ans
q3_ans
q4_ans
q5_ans
x--------------x----------x----------x----------x----------x----------x
| student_id | q1_ans | q2_ans | q3_ans | q4_ans | q5_ans |
x--------------x----------x----------x----------x----------x----------x
| 1 | A | D | B | B | E |
| 2 | D | C | B | A | D |
| 3 | B | C | D | A | B |
x--------------x----------x----------x----------x----------x----------x
The FORM_QUESTION table stores a student's answers for each question.
Here is information on the second table:
Table:
FORM_VALID_ANS
Fields:
question_id
valid_answer
x---------------x----------------x
| question_id | valid_answer |
x---------------x----------------x
| q1_ans | A |
| q1_ans | B |
| q2_ans | A |
| q2_ans | B |
| q2_ans | C |
| q2_ans | D |
| q3_ans | A |
| q4_ans | A |
| q4_ans | B |
| q5_ans | A |
x---------------x----------------x
The second table, FORM_VALID_ANS, stores the valid, acceptable answers for a particular question. So, according to the above table, here is the acceptable list of values for each question:
q1_ans: A, B
q2_ans: A, B, C, D
q3_ans: A
q4_ans: A, B
q5_ans: A
As you can see, the values for the FORM_VALID_ANS.valid_answer include only the question field names for FORM_QUESTION ("q1_ans", "q2_ans", "q3_ans", "q4_ans", and "q5_ans"). I need to check each students' answer to make sure it is a valid value that they can enter, which is specified in the FORM_VALID_ANS.valid_answer field.
The desired output, where XXX represents an invalid value, and all other values would be answered value:
x--------------x----------x----------x----------x----------x----------x
| student_id | q1_ans | q2_ans | q3_ans | q4_ans | q5_ans |
x--------------x----------x----------x----------x----------x----------x
| 1 | A | D | XXX | B | XXX |
| 2 | XXX | C | XXX | A | XXX |
| 3 | B | C | XXX | A | XXX |
x--------------x----------x----------x----------x----------x----------x
Is it possible to join these two tables together and produce these results (or similar results)?

Neo4j/Cypher effective pagination with order by over large sub-graph

I have following simple relationship between (:User) nodes.
(:User)-[:FOLLOWS {timestamp}]->(:User)
If I paginate followers ordered by FOLLOWS.timestamp I'm running into performance problems when someone has millions of followers.
MATCH (u:User {Id:{id}})<-[f:FOLLOWS]-(follower)
WHERE f.timestamp <= {timestamp}
RETURN follower
ORDER BY f.timestamp DESC
LIMIT 100
What is suggested approach for paginating big sets of data when ordering is required?
UPDATE
follower timestamp
---------------------------------------
id(1000000) 1455967905
id(999999) 1455967875
id(999998) 1455967234
id(999997) 1455967123
id(999996) 1455965321
id(999995) 1455964123
id(999994) 1455963645
id(999993) 1455963512
id(999992) 1455961343
....
id(2) 1455909382
id(1) 1455908432
I want to slice this list down using timestamp which set on :FOLLOWS relationship. If I want to return batches of 4 followers I take current timestamp first and return 4 most recent, then 1455967123 and 4 most recent and so on. In order to do this the whole list should be order by timestamp which results in performance issues on millions of records.
If you're looking for the most recent followers, i.e. where the timestamp is greater than a given time, it only has to traverse the most recent ones.
You can do it with (2) in 20ms
If you are really looking for the oldest (first) followers it makes sense to skip ahead and don't look at the timestamp of every of the million followers (which takes about 1s on my system, see (3)). If you skip ahead the time goes down to 230ms, see (1)
In general we can see that on my laptop it does 2M db-operations per core and second.
(1) Look at first / oldest followers
PROFILE
> MATCH (u)<-[f:FOLLOWS]-(follower) WHERE id(u) = 0
> // skip ahead
> WITH f,follower SKIP 999000
> // do the actual check
> WITH f,follower WHERE f.ts < 500
> RETURN f, follower
> ORDER BY f.ts
> LIMIT 10;
+---------------------------------+
| f | follower |
+---------------------------------+
| :FOLLOWS[0]{ts:1} | Node[1]{} |
...
+---------------------------------+
10 rows
243 ms
Compiler CYPHER 2.3 Planner COST Runtime INTERPRETED
+-----------------+----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------+----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +ProduceResults | 1 | 10 | 0 | f, follower | f, follower |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Projection | 1 | 10 | 0 | anon[142], anon[155], anon[158], anon[178], f, follower, f, follower, u | anon[155]; anon[158] |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Top | 1 | 10 | 0 | anon[142], anon[155], anon[158], anon[178], f, follower, u | Literal(10); |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Projection | 1 | 499 | 499 | anon[142], anon[155], anon[158], anon[178], f, follower, u | anon[155]; anon[158]; anon[155].ts |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Projection | 1 | 499 | 0 | anon[142], anon[155], anon[158], f, follower, u | f; follower |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Filter | 1 | 499 | 0 | anon[142], f, follower, u | anon[142] |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Projection | 1 | 1000 | 1000 | anon[142], f, follower, u | f; follower; f.ts < { AUTOINT2} |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Skip | 1 | 1000 | 0 | f, follower, u | { AUTOINT1} |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +Expand(All) | 1 | 1000000 | 1000001 | f, follower, u | (u)<-[ f#12:FOLLOWS]-( follower#24) |
| | +----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
| +NodeByIdSeek | 1 | 1 | 1 | u | |
+-----------------+----------------+---------+---------+-------------------------------------------------------------------------+---------------------------------------+
Total database accesses: 1001501
(2) Look at most recent followers
PROFILE
> MATCH (u)<-[f:FOLLOWS]-(follower) WHERE id(u) = 0
> AND f.ts > 999500
> RETURN f, follower
> LIMIT 10;
+----------------------------------------------+
| f | follower |
+----------------------------------------------+
| :FOLLOWS[999839]{ts:999840} | Node[999840]{} |
...
+----------------------------------------------+
10 rows
23 ms
Compiler CYPHER 2.3 Planner COST Runtime INTERPRETED
+-----------------+----------------+-------+---------+----------------+---------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------+----------------+-------+---------+----------------+---------------------------------------------------------------+
| +ProduceResults | 1 | 10 | 0 | f, follower | f, follower |
| | +----------------+-------+---------+----------------+---------------------------------------------------------------+
| +Limit | 1 | 10 | 0 | f, follower, u | Literal(10) |
| | +----------------+-------+---------+----------------+---------------------------------------------------------------+
| +Filter | 1 | 10 | 16394 | f, follower, u | AndedPropertyComparablePredicates(f,f.ts,f.ts > { AUTOINT1}) |
| | +----------------+-------+---------+----------------+---------------------------------------------------------------+
| +Expand(All) | 1 | 16394 | 16395 | f, follower, u | (u)<-[f:FOLLOWS]-(follower) |
| | +----------------+-------+---------+----------------+---------------------------------------------------------------+
| +NodeByIdSeek | 1 | 1 | 1 | u | |
+-----------------+----------------+-------+---------+----------------+---------------------------------------------------------------+
Total database accesses: 32790
(3) Find oldest followers without skipping ahead
PROFILE
> MATCH (u)<-[f:FOLLOWS]-(follower) WHERE id(u) = 0
> AND f.ts < 500
> RETURN f, follower
> LIMIT 10;
+-------------------------------------+
| f | follower |
+-------------------------------------+
...
| :FOLLOWS[491]{ts:492} | Node[492]{} |
+-------------------------------------+
10 rows
1008 ms
Compiler CYPHER 2.3 Planner COST Runtime INTERPRETED
+-----------------+----------------+--------+---------+----------------+---------------------------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------+----------------+--------+---------+----------------+---------------------------------------------------------------+
| +ProduceResults | 1 | 10 | 0 | f, follower | f, follower |
| | +----------------+--------+---------+----------------+---------------------------------------------------------------+
| +Limit | 1 | 10 | 0 | f, follower, u | Literal(10) |
| | +----------------+--------+---------+----------------+---------------------------------------------------------------+
| +Filter | 1 | 10 | 999498 | f, follower, u | AndedPropertyComparablePredicates(f,f.ts,f.ts < { AUTOINT1}) |
| | +----------------+--------+---------+----------------+---------------------------------------------------------------+
| +Expand(All) | 1 | 999498 | 999499 | f, follower, u | (u)<-[f:FOLLOWS]-(follower) |
| | +----------------+--------+---------+----------------+---------------------------------------------------------------+
| +NodeByIdSeek | 1 | 1 | 1 | u | |
+-----------------+----------------+--------+---------+----------------+---------------------------------------------------------------+
Total database accesses: 1998998

Resources