How to delete a value from ksqldb table or insert a tombstone value? - ksqldb

How is it possible to mark a row in a ksql table for deletion via Rest api or at least as a statement in ksqldb-cli?
CREATE TABLE movies (
title VARCHAR PRIMARY KEY,
id INT,
release_year INT
) WITH (
KAFKA_TOPIC='movies',
PARTITIONS=1,
VALUE_FORMAT = 'JSON'
);
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, 'Aliens', 1986);
This doesn't work for obvious reasons, but DELETE statement doesn't exist in ksqldb:
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, null, null);
Is there a way to create a recommended tombstone null value or do I need to write it directly to the underlying topic?

There is a way to do this that's a bit of a workaround. The trick is to use the KAFKA value format to write a tombstone to the underlying topic.
Here's an example, using your original DDL.
-- Insert a second row of data
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (42, 'Life of Brian', 1986);
-- Query table
ksql> SET 'auto.offset.reset' = 'earliest';
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
|Aliens |48 |1986 |
Limit Reached
Query terminated
Now declare a new stream that will write to the same Kafka topic using the same key:
CREATE STREAM MOVIES_DELETED (title VARCHAR KEY, DUMMY VARCHAR)
WITH (KAFKA_TOPIC='movies',
VALUE_FORMAT='KAFKA');
Insert a tombstone message:
INSERT INTO MOVIES_DELETED (TITLE,DUMMY) VALUES ('Aliens',CAST(NULL AS VARCHAR));
Query the table again:
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
Examine the underlying topic
ksql> print movies;
Key format: KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2021/02/22 11:01:05.966 Z, key: Aliens, value: {"ID":48,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:02:00.194 Z, key: Life of Brian, value: {"ID":42,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:04:52.569 Z, key: Aliens, value: <null>, partition: 0

Related

KSQLDB: Using CREATE STREAM AS SELECT with Differing KEY SCHEMAS

Here is the description of the problem statement:
STREAM_SUMMARY: A stream with one of the value columns as an ARRAY-of-STRUCTS.
Name : STREAM_SUMMARY
Field | Type
------------------------------------------------------------------------------------------------------------------------------------------------
ROWKEY | STRUCT<asessment_id VARCHAR(STRING), institution_id INTEGER> (key)
assessment_id | VARCHAR(STRING)
institution_id | INTEGER
responses | ARRAY<STRUCT<student_id INTEGER, question_id INTEGER, response VARCHAR(STRING)>>
------------------------------------------------------------------------------------------------------------------------------------------------
STREAM_DETAIL: This is a stream to be created from STREAM1, by "exploding" the the array-of-structs into separate rows. Note that the KEY schema is also different.
Below is the Key and Value schema I want to achieve (end state)...
Name : STREAM_DETAIL
Field | Type
-------------------------------------------------------------------------------------------------------
ROWKEY | **STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER> (key)**
assessment_id | VARCHAR(STRING)
institution_id | INTEGER
student_id | INTEGER
question_id | INTEGER
response | VARCHAR(STRING)
My objective is to create the STREAM_DETAIL from the STREAM_SUMMARY.
I tried the below:
CREATE STREAM STREAM_DETAIL WITH (
KAFKA_TOPIC = 'stream_detail'
) AS
SELECT
STRUCT (
`assessment_id` := "assessment_id",
`student_id` := EXPLODE("responses")->"student_id",
`question_id` := EXPLODE("responses")->"question_id"
)
, "assessment_id"
, "institution_id"
, EXPLODE("responses")->"student_id"
, EXPLODE("responses")->"question_id"
, EXPLODE("responses")->"response"
FROM STREAM_SUMMARY
EMIT CHANGES;
While the SELECT query works fine, the CREATE STREAM returned with the following error:
"Key missing from projection."
If I add the ROWKEY column in the SELECT clause in the above statement, things work, however, the KEY schema of the resultant STREAM is same as the original SREAM's key.
The "Key" schema that I want in the new STREAM is : STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER> (key)
Alternatively, I tried creating the STREAM_DETAIL by hand (using plain CREATE STREAM statement by providing key and value SCHEMA_IDs). Later I tried the INSERT INTO approach...
INSERT INTO STREAM_DETAIL
SELECT ....
FROM STREAM_SUMMARY
EMIT CHANGES;
The errors were the same.
Can you please guide on how can I achieve enriching a STREAM but with a different Key Schema? Note that a new/different Key schema is important for me since I use the underlying topic to be synced to a database via a Kafka sink connector. The sink connector requires the key schema in this way, for me to be able to do an UPSERT.
I am not able to get past this. Appreciate your help.
You can't change the key of a stream when it is created from another stream.
But there is a different approach to the problem.
What you want is re-key. And to do so you need to use ksqlDB table. Can be solved like -
CREATE STREAM IF NOT EXISTS INTERMEDIATE_STREAM_SUMMARY_FLATTNED AS
SELECT
ROWKEY,
EXPLODE(responses) as response
FROM STREAM_SUMMARY;
CREATE TABLE IF NOT EXISTS STREAM_DETAIL AS -- This also creates a underlying topic
SELECT
ROWKEY -> `assessment_id` as `assessment_id`,
response -> `student_id` as `student_id`,
response -> `question_id` as `question_id`,
LATEST_BY_OFFSET(ROWKEY -> `institution_id`, false) as `institution_id`,
LATEST_BY_OFFSET(response -> `response`, false) as `response`
FROM INTERMEDIATE_STREAM_SUMMARY_FLATTNED
GROUP BY ROWKEY -> `assessment_id`, response -> `student_id`, response -> `question_id`;
Key schema will be STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER>, you can check schema registry or print the topic to validate that. In ksqlDB describe table will show you flat key, but don't panic.
I have used similar and sync the final topic to database.

ksqlDB - join streams without nulls

I want to join 2 streams adding only non-null values.
Stream 1:
CREATE STREAM S1 (id BIGINT, rel_id BIGINT) WITH (KAFKA_TOPIC='T1' VALUE_FORMAT='AVRO' PARTITIONS=1);
Stream 2:
CREATE STREAM S2 (s1_rel_id BIGINT, started_at VARCHAR, ended_at VARCHAR) WITH (KAFKA_TOPIC='T2' VALUE_FORMAT='AVRO' PARTITIONS=1);
Next I create new stream with joining:
CREATE STREAM J1 AS SELECT S1.id AS ID, S1.rel_id AS REL_ID, started_at, ended_at FROM S2 JOIN S1 WITHIN 1 SECONDS ON S1.rel_id = S2.s1_rel_id WHERE S2.ended_at IS NOT NULL EMIT CHANGES;
On
SELECT * FROM J1 EMIT CHANGES LIMIT 10;
I receive a table with records containing 'null' in the field 'ended_at'. I tried use INNER JOIN and LEFT_JOIN. But as I understand the INNER JOIN (or simple JOIN) should ignore null values for events in the field 'eneded_at' - https://docs.ksqldb.io/en/latest/developer-guide/joins/join-streams-and-tables/#semantics-of-stream-stream-joins
Why do I still see nulls in records? )) Thanks for any comments.

Error "Invalid join condition: table-table joins require to join on the primary key of the right input table" on joining two tables on Kafka ksqlDB

I need to create a Kafka topic from a combination a nine other topics, all of them produced by Debezium PostgreSQL source connector, in AVRO format. To start, I'm trying (so far unsuccessfully) to combine fields from only two topics.
So, first a create a ksqlDB table based on "REQUEST" topic:
ksql> CREATE TABLE TB_REQUEST (ID STRUCT<REQUEST_ID BIGINT> PRIMARY KEY)
WITH (KAFKA_TOPIC='REQUEST', FORMAT='AVRO');
And everything seems fine to me:
ksql> DESCRIBE TB_REQUEST;
Name : TB_REQUEST
Field | Type
-----------------------------------------------------------------------------------------------------------------------
ID | STRUCT<REQUEST_ID BIGINT> (primary key)
BEFORE | STRUCT<REQUEST_ID BIGINT, REQUESTER_ID INTEGER, STATUS_ID>
AFTER | STRUCT<REQUEST_ID BIGINT, REQUESTER_ID INTEGER, STATUS_ID>
SOURCE | STRUCT<VERSION VARCHAR(STRING), CONNECTOR VARCHAR(STRING), NAME VARCHAR(STRING), TS_MS BIGINT, SNAPSHOT VARCHAR(STRING), DB VARCHAR(STRING), SEQUENCE VARCHAR(STRING), SCHEMA VARCHAR(STRING), TABLE VARCHAR(STRING), TXID BIGINT, LSN BIGINT, XMIN BIGINT>
OP | VARCHAR(STRING)
TS_MS | BIGINT
TRANSACTION | STRUCT<ID VARCHAR(STRING), TOTAL_ORDER BIGINT, DATA_COLLECTION_ORDER BIGINT>
-----------------------------------------------------------------------------------------------------------------------
For runtime statistics and query details run: DESCRIBE <Stream,Table> EXTENDED;
Then I create another table from "EMPLOYEE" topic:
ksql> CREATE TABLE TB_EMPLOYEE (ID STRUCT<EMPLOYEE_ID INT> PRIMARY KEY)
WITH (KAFKA_TOPIC='EMPLOYEE', FORMAT='AVRO');
Again, everything seems ok.
ksql> DESCRIBE TB_EMPLOYEE;
Name : TB_EMPLOYEE
Field | Type
-----------------------------------------------------------------------------------------------------------------------
ID | STRUCT<EMPLOYEE_ID INTEGER> (primary key)
BEFORE | STRUCT<EMPLOYEE_ID INTEGER, NAME VARCHAR(STRING), HIRING_DATE DATE>
AFTER | STRUCT<EMPLOYEE_ID INTEGER, NAME VARCHAR(STRING), HIRING_DATE DATE>
SOURCE | STRUCT<VERSION VARCHAR(STRING), CONNECTOR VARCHAR(STRING), NAME VARCHAR(STRING), TS_MS BIGINT, SNAPSHOT VARCHAR(STRING), DB VARCHAR(STRING), SEQUENCE VARCHAR(STRING), SCHEMA VARCHAR(STRING), TABLE VARCHAR(STRING), TXID BIGINT, LSN BIGINT, XMIN BIGINT>
OP | VARCHAR(STRING)
TS_MS | BIGINT
TRANSACTION | STRUCT<ID VARCHAR(STRING), TOTAL_ORDER BIGINT, DATA_COLLECTION_ORDER BIGINT>
-----------------------------------------------------------------------------------------------------------------------
For runtime statistics and query details run: DESCRIBE <Stream,Table> EXTENDED;
But by trying to create my target table joining previous ones by Employee Id.
ksql> CREATE TABLE REQUEST_EMPLOYEE AS
SELECT RQ.ID->REQUEST_ID, RQ.AFTER->REQUESTER_ID, RQ.AFTER->STATUS_ID, EM.ID->EMPLOYEE_ID, EM.AFTER->NAME AS REQUESTER
FROM TB_REQUEST RQ
JOIN TB_EMPLOYEE EM ON RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID;
I got the following error:
Could not determine output schema for query due to error: Invalid join condition: table-table joins require to join on the primary key of the right input table. Got RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID.
Statement: CREATE TABLE REQUEST_EMPLOYEE WITH (KAFKA_TOPIC='REQUEST_EMPLOYEE', PARTITIONS=1, REPLICAS=1) AS SELECT
RQ.ID->REQUEST_ID REQUEST_ID,
RQ.AFTER->REQUESTER_ID REQUESTER_ID,
RQ.AFTER->STATUS_ID STATUS_ID,
EM.ID->EMPLOYEE_ID EMPLOYEE_ID,
EM.AFTER->NAME REQUESTER
FROM TB_REQUEST RQ
INNER JOIN TB_EMPLOYEE EM ON ((RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID))
EMIT CHANGES;
Looking at output from "DESCRIBE TB_EMPLOYEE" command it looks like to me that "EM.ID->EMPLOYEE_ID" is the right choice. What am I missing?
Thanks in advance.
PS: ksqlDB version in 0.21.0
I think you should use at least one row key in your join statement, in previous versions of KsqlDB the only way to join Tables was by rowkeys, in your current version 0.21.0 it is possible using a foreign key.
Check the following example:
CREATE TABLE orders_with_users AS
SELECT * FROM orders JOIN users ON orders.u_id = users.u_id EMIT CHANGES;
Where u_id is defined as primary key thus is the rowkey.
CREATE TABLE users (
u_id VARCHAR PRIMARY KEY
name VARCHAR
) WITH (
kafka_topic = 'users',
partitions = 3,
value_format = 'json'
);
The below sentence is similar
CREATE TABLE orders_with_users AS
SELECT * FROM orders JOIN users ON orders.u_id = users.ROWKEY EMIT CHANGES;
Another observation, KsqlDB is considering the key for your TB_EMPLOYE as STRUCT<EMPLOYEE_ID INTEGER>, Not only Integer.
then is waiting for one comparison between structs. (With the same schema)
Then you can perform the follow steps before to create your table.
CREATE STREAM STREAM_EMPLOYEE (ID STRUCT<EMPLOYEE_ID INT> KEY)
WITH (KAFKA_TOPIC='EMPLOYEE', FORMAT='AVRO');
CREATE STREAM STREAM_REKEY_EMPLOYEE
AS SELECT ID->EMPLOYEE_ID employee_id, * FROM STREAM_EMPLOYEE
PARTITION BY ID->EMPLOYEE_ID
EMIT CHANGES;
CREATE TABLE TB_EMPLOYEE (employee_id PRIMARY KEY)
WITH (KAFKA_TOPIC='STREAM_REKEY_EMPLOYEE', FORMAT='AVRO');
And use the employee_id field to join, try to use your primary keys as primitive types.

ksql table adds extra characters to rowkey

I have some kafka topics with avro format, I created an stream and a table to be able to join with ksql, but the result of the join comes always as null.
Following the troubleshoot, I found that the key is prepended with some character, which dependes on the length of the string. I suppose it has to do with something about avro, but I can't find where is the problem.
CREATE TABLE entity_table ( Id VARCHAR, Info info )
WITH
(
KAFKA_TOPIC = 'pisos',
VALUE_FORMAT='avro',
KEY = 'Id');
select * from entity_table;
1562839624583 | $99999999999.999999 | 99999999999.510136 | 1
1562839631250 | &999999999990.999999 | 99999999999.510136 | 2
How are you populating the Kafka topic? KSQL currently only supports string keys. If you can't change how the topic is populated you could do:
CREATE STREAM entity_src WITH (KAFKA_TOPIC = 'pisos', VALUE_FORMAT='avro');
CREATE STREAM entity_rekey AS SELECT * FROM entity_src PARTITION BY ID;
CREATE TABLE entity_table with (KAFKA_TOPIC='entity_rekey', VALUE_FORMAT='AVRO');
BTW you don't need to specify the schema if you are using Avro.

mass inserting into model in rails, how to auto increment id field?

I have a model for stocks and a model for stock_price_history.
I want to mass insert with this
sqlstatement = "INSERT INTO stock_histories SELECT datapoint1 AS id,
datapoint2 AS `date` ...UNION SELECT datapoint9,10,11,12,13,14,15,16,
UNION SELECT datapoint 17... etc"
ActiveRecord::Base.connection.execute sqlstatement
However, I don't actually want to use datapoint1 AS id. If I leave it blank I get an error that my model has 10 fields and I'm inserting only 9 and that it is missing the primary key.
Is there a way to force an auto increment on the id when inserting by SQL?
Edit: Bonus question cause I'm a noob. I am developing in SQLite3 and deploying to a Posgres (i.e. Heroku), Will I need to modify the above mass insert statement so it's for a posgres database?
2nd edit: my initial question had Assets and AssetHistory instead of Stocks and Stock_Histories. I changed it to Stocks / Stock price histories because I thought it was more intuitive to understand. Therefore some answers refer to Asset Histories for this reason.
You can change your SQL and be more explicit about which fields you're inserting, and leave id out of the list:
insert into asset_histories (date) select datapoint2 as `date` ...etc
Here's a long real example:
jim=# create table test1 (id serial not null, date date not null, name text not null);
NOTICE: CREATE TABLE will create implicit sequence "test1_id_seq" for serial column "test1.id"
CREATE TABLE
jim=# create table test2 (id serial not null, date date not null, name text not null);
NOTICE: CREATE TABLE will create implicit sequence "test2_id_seq" for serial column "test2.id"
CREATE TABLE
jim=# insert into test1 (date, name) values (now(), 'jim');
INSERT 0 1
jim=# insert into test1 (date, name) values (now(), 'joe');
INSERT 0 1
jim=# insert into test1 (date, name) values (now(), 'bob');
INSERT 0 1
jim=# select * from test1;
id | date | name
----+------------+------
1 | 2013-03-14 | jim
2 | 2013-03-14 | joe
3 | 2013-03-14 | bob
(3 rows)
jim=# insert into test2 (date, name) select date, name from test1 where name <> 'jim';
INSERT 0 2
jim=# select * from test2;
id | date | name
----+------------+------
1 | 2013-03-14 | joe
2 | 2013-03-14 | bob
(2 rows)
As you can see, only the selected rows were inserted, and they were assigned new id values in table test2. You'll have to be explicit about all the fields you want to insert, and ensure that the ordering of the insert and the select match.
Having said all that, you might want to look into the activerecord-import gem, which makes this sort of thing a lot more Railsy. Assuming you have a bunch of new AssetHistory objects (not persisted yet), you could insert them all with:
asset_histories = []
asset_histories << AssetHistory.new date: some_date
asset_histories << AssetHistory.new date: some_other_date
AssetHistory.import asset_histories
That will generate a single efficient insert into the table, and handle the id for you. You'll still need to query some data and construct the objects, which may not be faster than doing it all with raw SQL, but may be a better alternative if you've already got the data in Ruby objects.

Resources