Stream Table join ksql giving no output - join

So the issue is after joining Stream with Table I am getting no results.
I have already checked partition count on both sides, that is 1.
Key type for both entities is varchar, and name of keys in both sides are also same.
below is the data I have.
Step 0. SET 'auto.offset.reset' = 'earliest';
Step 1. Stream I have:
ksql> describe L_EMPLOYEE1_KEYED_STREAM extended;
Name : L_EMPLOYEE1_KEYED_STREAM
Type : STREAM
Timestamp field : Not set - using <ROWTIME>
Key format : KAFKA
Value format : JSON
Kafka topic : L_EMPLOYEE1_KEYED_STREAM (partitions: 1, replication: 1)
Statement : CREATE STREAM L_EMPLOYEE1_KEYED_STREAM WITH (KAFKA_TOPIC='L_EMPLOYEE1_KEYED_STREAM', PARTITIONS=1, REPLICAS=1) AS SELECT * FROM L_EMPLOYEE1 L_EMPLOYEE1 PARTITION BY L_EMPLOYEE1.L_EID EMIT CHANGES;
Field | Type
-----------------------------------
L_EID | VARCHAR(STRING) (key)
NAME | VARCHAR(STRING)
LNAME | VARCHAR(STRING)
L_ADD_ID | VARCHAR(STRING)
-----------------------------------
Queries that write from this STREAM
----------------------------------- CSAS_L_EMPLOYEE1_KEYED_STREAM_37 (RUNNING) : CREATE STREAM L_EMPLOYEE1_KEYED_STREAM WITH (KAFKA_TOPIC='L_EMPLOYEE1_KEYED_STREAM', PARTITIONS=1, REPLICAS=1) AS SELECT * FROM L_EMPLOYEE1 L_EMPLOYEE1 PARTITION BY L_EMPLOYEE1.L_EID EMIT CHANGES;
For query topology and execution plan please run: EXPLAIN <QueryId>
Runtime statistics by host
------------------------- Host | Metric | Value | Last Message
----------------------------------------------------------------------------- ksql-server:8088 | messages-per-sec | 0 | 2023-02-14T08:20:44.489Z ksql-server:8088 | total-messages | 2 | 2023-02-14T08:20:44.489Z
----------------------------------------------------------------------------- (Statistics of the local KSQL server interaction with the Kafka topic L_EMPLOYEE1_KEYED_STREAM)
Consumer Groups summary:
Consumer Group :
_confluent-ksql-default_query_CSAS_L_EMPLOYEE1_KEYED_STREAM_37
Kafka topic : L_EMPLOYEE1 Max lag : 0
Partition | Start Offset | End Offset | Offset | Lag
------------------------------------------------------ 0 | 0 | 2 | 2 | 0
------------------------------------------------------
Step 2. Table I have
ksql> describe ID_MAP_KEYED_TABLE extended;
Name : ID_MAP_KEYED_TABLE
Type : TABLE
Timestamp field : Not set - using <ROWTIME>
Key format : KAFKA
Value format : JSON
Kafka topic : ID_MAP_KEYED_STREAM (partitions: 1, replication: 1)
Statement : CREATE TABLE ID_MAP_KEYED_TABLE (L_EID STRING PRIMARY KEY, R_EID STRING, L_ADD_ID STRING, R_ADD_ID STRING) WITH (KAFKA_TOPIC='ID_MAP_KEYED_STREAM', KEY_FORMAT='KAFKA', VALUE_FORMAT='JSON');
Field | Type
-------------------------------------------
L_EID | VARCHAR(STRING) (primary key)
R_EID | VARCHAR(STRING)
L_ADD_ID | VARCHAR(STRING)
R_ADD_ID | VARCHAR(STRING)
-------------------------------------------
Runtime statistics by host
-------------------------
Host | Metric | Value | Last Message
-----------------------------------------------------------------------------
ksql-server:8088 | messages-per-sec | 0 | 2023-02-14T08:13:11.214Z
ksql-server:8088 | total-messages | 2 | 2023-02-14T08:13:11.214Z
-----------------------------------------------------------------------------
(Statistics of the local KSQL server interaction with the Kafka topic ID_MAP_KEYED_STREAM)
Step 3. data in both entities:
ksql> select * from L_EMPLOYEE1_KEYED_STREAM;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|L_EID |NAME |LNAME |L_ADD_ID |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|101 |Dhruv |S |201 |
|102 |Dhruv1 |S1 |202 |
Query Completed
Query terminated
ksql> select * from ID_MAP_KEYED_TABLE emit changes;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|L_EID |R_EID |L_ADD_ID |R_ADD_ID |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|101 |1001 |201 |2001 |
|102 |1002 |202 |2002 |
^CQuery terminated
Step 4. Join operation and its result:
ksql> select map.R_eid, L_emp.name, L_emp.lname, map.R_add_id from L_EMPLOYEE1_KEYED_STREAM L_emp inner join ID_MAP_KEYED_TABLE map on map.L_eid=L_emp.L_eid emit changes;
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|R_EID |NAME |LNAME |R_ADD_ID |
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
Absolutely blank. not sure why is this happening.
I tried getting results with where clause they're working fine individually and giving results.
please help me complete my POC in time.

Related

Mapping timeseries+static information into an ML model (XGBoost)

So lets say I have multiple probs, where one prob has two input DataFrames:
Input:
One constant stream of data (e.g. from a sensor) Second step: Multiple streams from multiple sensors
> df_prob1_stream1
timestamp | ident | measure1 | measure2 | total_amount |
----------------------------+--------+--------------+----------+--------------+
2019-09-16 20:00:10.053174 | A | 0.380 | 0.08 | 2952618 |
2019-09-16 20:00:00.080592 | A | 0.300 | 0.11 | 2982228 |
... (1 million more rows - until a pre-defined ts) ...
One static DataFrame of information, mapped to an unique identifier called ident, which needs to be assigned to the ident column in each df_probX_streamX in order to let the system recognize, that this data is related.
> df_global
ident | some1 | some2 | some3 |
--------+--------------+----------+--------------+
A | LARGE | 8137 | 1 |
B | SMALL | 1234 | 2 |
Output:
A binary classifier [0,1]
So how can I suitable train XGBoost to be able to make the best usage of one timeseries DataFrame in combination with one static DataFrame (containg additional context information) in one prob? Any help would be appreciated.

How to create a percentage / ratio column in Grafana / InfluxDB?

I have the data about the errors written to InfluxDB (example is simplified).
time | error | some_unique_data
--------|---------|--------------------
<time> | hello 1 | some unique data...
<time> | hello 2 | some unique data...
<time> | hello 2 | some unique data...
<time> | hello 3 | some unique data...
I can write the following query to see the sorted list of the most frequent errors in Grafana:
SELECT COUNT("some_unique_data") FROM "my_measument" WHERE $timeFilter GROUP BY error
which gives:
| error | count
|---------|-------
| hello 2 | 2
| hello 1 | 1
| hello 3 | 1
What I am missing is the column that would show me the impact of each error like this:
| error | count | impact
|---------|----------------
| hello 2 | 2 | 50%
| hello 1 | 1 | 25%
| hello 3 | 1 | 25%
What should I add to my query to get this impact field working?
Here is a useful answer, but unfortunately it doesn't solve your issue.
In your example:
| error | count
|---------|-------
| hello 2 | 2
| hello 1 | 1
| hello 3 | 1
you've used GROUP BY "error" to get the count of each error, so once you did this you don't have access to the full count anymore, so you must do it before GROUP BY.
You can't do:
SELECT COUNT(*) AS full_count, "some_unique_data" FROM "my_measument" WHERE $timeFilter
in order to get the full number of records because you can't use SELECT COUNT(), "field_name" FROM ...
So, getting the full count before doing GROUP BY isn't possible.
Well, let's try something else:
SELECT "fields_count" / SUM("fields_count") AS not_able_to_use_SUM_with_field , "error" FROM (
SELECT COUNT("some_unique_data") AS fields_count AS fields_sum FROM "my_measument" WHERE $timeFilter GROUP BY "error"
)
This previous query isn't working apparently. Then what to do?
Sorry, You can do Nothing :/
Here is another link to the documentation.

Building Activerecord / SQL query for jsonb value search

Currently, for a recurring search with different parameters, I have this ActiveRecord query built:
current_user.documents.order(:updated_at).reverse_order.includes(:groups,:rules)
Now, usually I tack on a where clause to this to perform this search. However, I now need to do a search through the jsonb field for all rows that have a certain value as in the key:value pair. I've been able to do something a similar to that in my SQL, with this syntax (the data field will only be exactly two levels nested):
SELECT
*
FROM
(SELECT
*
FROM
(SELECT
*
FROM
documents
) A,
jsonb_each(A.data)
) B,
jsonb_each_text(B.value) ASC C
WHERE
C.value = '30';
However, I want to use the current ActiveRecord search to make this query (which includes the groups/rules eager loading).
I'm struggling with the use of the comma, which I understand is an implicit join, which is executed before explicit joins, so when I try something like this:
select * from documents B join (select * from jsonb_each(B.data)) as A on true;
ERROR: invalid reference to FROM-clause entry for table "b"
LINE 1: ...* from documents B join (select * from jsonb_each(B.data)) a...
^
HINT: There is an entry for table "b", but it cannot be referenced from this part of the query.
But I don't understand how to reference the complete "table" the ActiveRecord query I have creates before I make a joins call, as well as make use of the comma syntax for implicit joins to work.
Also, I'm an SQL amateur, so if you see some improvements or other ways to do this, please do tell.
EDIT: Description of documents table:
Table "public.documents"
Column | Type | Modifiers | Storage | Stats target | Description
------------+-----------------------------+--------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('documents_id_seq'::regclass) | plain | |
document_id | character varying | | extended | |
name | character varying | | extended | |
size | integer | | plain | |
last_updated| timestamp without time zone | | plain | |
user_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
updated_at | timestamp without time zone | | plain | |
kind | character varying | | extended | |
uid | character varying | | extended | |
access_token_id | integer | | plain | |
data | jsonb | not null default '{}'::jsonb | extended | |
Indexes:
"documents_pkey" PRIMARY KEY, btree (id)
```
Sample rows, first would match a search for '30' (data is the last field):
2104 | 24419693037 | LsitHandsBackwards.jpg | | | 1 | 2017-06-25 21:45:49.121686 | 2017-07-01 21:32:37.624184 | box | 221607127 | 15 | {"owner": {"born": "to make history", "price": 30}}
2177 | /all-drive/uml flows/typicaluseractivity.svg | TypicalUserActivity.svg | 12375 | 2014-08-11 02:21:14 | 1 | 2017-07-07 14:00:11.487455 | 2017-07-07 14:00:11.487455 | dropbox | 325694961 | 20 | {"owner": {}}
You can use a query similar to the one you already showed:
SELECT
d.id, d.data
FROM
documents AS d
INNER JOIN json_each(d.data) AS x ON TRUE
INNER JOIN json_each(x.value) AS y ON TRUE
WHERE
cast(y.value as text) = '30';
Assuming your data would be the following one:
INSERT INTO documents
(data)
VALUES
('{"owner": {"born": "to make history", "price": 30}}'),
('{"owner": {}}'),
('{"owner": {"born": "to make history", "price": 50}, "seller": {"worth": 30}}')
;
The result you'd get is:
id | data
-: | :---------------------------------------------------------------------------
1 | {"owner": {"born": "to make history", "price": 30}}
3 | {"owner": {"born": "to make history", "price": 50}, "seller": {"worth": 30}}
You can check it (together with some step-by-step looks at the data) at dbfiddle here

Formatting JSON table from Postgresql request

I'm trying to create a Json format from a postgresql request.
Firstly I have used Rails to request my database in the format.json block of my controller and then used a json.builder file to format the json view. It worked until my requests return hundreds of thousands rows, so I searched how to optimize the json creation, avoiding all the ActiveRecord stack.
To do this I am using Postgresql 9.6 json functions to get directly my data in the right format, which is for example :
SELECT array_to_json('{{1157241840,-1.95},{1157241960,-1.96}}'::float[]);
[[1157241840, -1.95], [1157241960, -1.96]]
But using data from this kind of request :
SELECT date,value FROM measures;
The best I could obtain was something like this :
SELECT array_to_json(array_agg(t)) FROM (SELECT date,value FROM measures) t;
Resulting in :
[
{"date":"1997-06-13T19:12:00","value":1608.4},
{"date":"1997-06-13T19:12:00","value":-0.6}
]
which is quite different ... How would you build this SQL request ?
Thanks for your help !
My measures table look like this :
id | value | created_at | updated_at | parameter_id | quality_id | station_id | date | campain_id | elevation | sensor_id | comment_id
--------+-------+----------------------------+----------------------------+--------------+------------+------------+---------------------+------------+-----------+-----------+------------
799634 | -1.99 | 2017-02-21 09:41:09.062795 | 2017-02-21 09:41:09.118807 | 2 | | 1 | 2006-06-26 23:24:00 | 1 | -5.0 | |
1227314 | -1.59 | 2017-02-21 09:44:12.032576 | 2017-02-21 09:44:12.088311 | 2 | | 1 | 2006-11-30 19:48:00 | 1 | -5.0 | |
1227315 | 26.65 | 2017-02-21 09:44:12.032576 | 2017-02-21 09:44:12.088311 | 3 | | 1 | 2006-11-30 19:48:00 | 1 | -5.0 | |
If you need array of array you need to use json_build_array:
SELECT json_agg(json_build_array(date,value)) FROM measures;
If you want convert timestamp to epoch:
SELECT json_agg(json_build_array(extract(epoch FROM date)::int8, value)) FROM measures;
For test:
WITH measures AS (
SELECT 1157241840 as date, -1.95 as value
UNION SELECT 1157241960, -1.96
UNION SELECT 1157241980, NULL
)
SELECT json_agg(json_build_array(date,value)) FROM measures;
json_agg
----------------------------------------------------------------
[[1157241840, -1.95], [1157241960, -1.96], [1157241980, null]]
create table measures (date timestamp, value float);
insert into measures (date, value) values
(to_timestamp(1157241840),-1.95),
(to_timestamp(1157241960),-1.96);
select array_to_json(array_agg(array[extract(epoch from date), value]::float[]))
from measures
;
array_to_json
-----------------------------------------
[[1157241840,-1.95],[1157241960,-1.96]]

Copying a property from a node is slow with a lot of nodes

I'm migrating some properties in a labeled node and the query performance is very poor.
The old property is callerRef and the new property is code. There are 17m nodes that need to be updated that I want to process in batches. Absence of the code property on the entity indicates that it has not yet been upgraded.
profile match (e:Entity) where not has(e.code) with e limit 1000000 set e.key = e.callerKeyRef, e.code = e.callerRef;
There is one index in the Entity label and that is for code.
schema -l :Entity
Indexes
ON :Entity(code) ONLINE
No constraints
The heap has 8gbs allocated running Neo4j 2.2.4. The problem, if I'm reading the plan right, is that ALL nodes in the label are being hit even though a limit clause is specified. I would have thought that in an unordered query where a limit is requested that processing would stop after the limit criteria is met.
+-------------------+
| No data returned. |
+-------------------+
Properties set: 2000000
870891 ms
Compiler CYPHER 2.2
+-------------+----------+----------+-------------+--------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+-------------+----------+----------+-------------+--------------------------+
| EmptyResult | 0 | 0 | | |
| UpdateGraph | 1000000 | 6000000 | e | PropertySet; PropertySet |
| Eager | 1000000 | 0 | e | |
| Slice | 1000000 | 0 | e | { AUTOINT0} |
| Filter | 1000000 | 16990200 | e | NOT(hasProp(e.code)) |
| NodeByLabel | 16990200 | 16990201 | e | :Entity |
+-------------+----------+----------+-------------+--------------------------+
Total database accesses: 39980401
Am I missing something obvious? TIA
Indexes are supported only for = and IN (which basically are the same, because Cypher compiler transofrms all = operations in IN).
Neo4j is schema-less database. So, if there are no property - there are no index data. That why it needs to scan all nodes.
My suggestions:
First step: add code property to all necessary nodes with some default "falsy" value
Make update using node.code = "none" where clause
It might be faster to first assign a new label, say ToDo, to all the nodes that have yet to be migrated:
MATCH (e:Entity)
WHERE NOT HAS (e.code)
SET e:ToDo;
Then, you can iteratively match 1000000 (or whatever) ToDo nodes at a time, removing the ToDo label after migrating each node:
MATCH (e:ToDo)
WITH e
LIMIT 1000000
SET e.key = e.callerKeyRef, e.code = e.callerRef
REMOVE e:ToDo;

Resources