KSQLDB: Using CREATE STREAM AS SELECT with Differing KEY SCHEMAS - ksqldb

Here is the description of the problem statement:
STREAM_SUMMARY: A stream with one of the value columns as an ARRAY-of-STRUCTS.
Name : STREAM_SUMMARY
Field | Type
------------------------------------------------------------------------------------------------------------------------------------------------
ROWKEY | STRUCT<asessment_id VARCHAR(STRING), institution_id INTEGER> (key)
assessment_id | VARCHAR(STRING)
institution_id | INTEGER
responses | ARRAY<STRUCT<student_id INTEGER, question_id INTEGER, response VARCHAR(STRING)>>
------------------------------------------------------------------------------------------------------------------------------------------------
STREAM_DETAIL: This is a stream to be created from STREAM1, by "exploding" the the array-of-structs into separate rows. Note that the KEY schema is also different.
Below is the Key and Value schema I want to achieve (end state)...
Name : STREAM_DETAIL
Field | Type
-------------------------------------------------------------------------------------------------------
ROWKEY | **STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER> (key)**
assessment_id | VARCHAR(STRING)
institution_id | INTEGER
student_id | INTEGER
question_id | INTEGER
response | VARCHAR(STRING)
My objective is to create the STREAM_DETAIL from the STREAM_SUMMARY.
I tried the below:
CREATE STREAM STREAM_DETAIL WITH (
KAFKA_TOPIC = 'stream_detail'
) AS
SELECT
STRUCT (
`assessment_id` := "assessment_id",
`student_id` := EXPLODE("responses")->"student_id",
`question_id` := EXPLODE("responses")->"question_id"
)
, "assessment_id"
, "institution_id"
, EXPLODE("responses")->"student_id"
, EXPLODE("responses")->"question_id"
, EXPLODE("responses")->"response"
FROM STREAM_SUMMARY
EMIT CHANGES;
While the SELECT query works fine, the CREATE STREAM returned with the following error:
"Key missing from projection."
If I add the ROWKEY column in the SELECT clause in the above statement, things work, however, the KEY schema of the resultant STREAM is same as the original SREAM's key.
The "Key" schema that I want in the new STREAM is : STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER> (key)
Alternatively, I tried creating the STREAM_DETAIL by hand (using plain CREATE STREAM statement by providing key and value SCHEMA_IDs). Later I tried the INSERT INTO approach...
INSERT INTO STREAM_DETAIL
SELECT ....
FROM STREAM_SUMMARY
EMIT CHANGES;
The errors were the same.
Can you please guide on how can I achieve enriching a STREAM but with a different Key Schema? Note that a new/different Key schema is important for me since I use the underlying topic to be synced to a database via a Kafka sink connector. The sink connector requires the key schema in this way, for me to be able to do an UPSERT.
I am not able to get past this. Appreciate your help.

You can't change the key of a stream when it is created from another stream.
But there is a different approach to the problem.
What you want is re-key. And to do so you need to use ksqlDB table. Can be solved like -
CREATE STREAM IF NOT EXISTS INTERMEDIATE_STREAM_SUMMARY_FLATTNED AS
SELECT
ROWKEY,
EXPLODE(responses) as response
FROM STREAM_SUMMARY;
CREATE TABLE IF NOT EXISTS STREAM_DETAIL AS -- This also creates a underlying topic
SELECT
ROWKEY -> `assessment_id` as `assessment_id`,
response -> `student_id` as `student_id`,
response -> `question_id` as `question_id`,
LATEST_BY_OFFSET(ROWKEY -> `institution_id`, false) as `institution_id`,
LATEST_BY_OFFSET(response -> `response`, false) as `response`
FROM INTERMEDIATE_STREAM_SUMMARY_FLATTNED
GROUP BY ROWKEY -> `assessment_id`, response -> `student_id`, response -> `question_id`;
Key schema will be STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER>, you can check schema registry or print the topic to validate that. In ksqlDB describe table will show you flat key, but don't panic.
I have used similar and sync the final topic to database.

Related

Error "Invalid join condition: table-table joins require to join on the primary key of the right input table" on joining two tables on Kafka ksqlDB

I need to create a Kafka topic from a combination a nine other topics, all of them produced by Debezium PostgreSQL source connector, in AVRO format. To start, I'm trying (so far unsuccessfully) to combine fields from only two topics.
So, first a create a ksqlDB table based on "REQUEST" topic:
ksql> CREATE TABLE TB_REQUEST (ID STRUCT<REQUEST_ID BIGINT> PRIMARY KEY)
WITH (KAFKA_TOPIC='REQUEST', FORMAT='AVRO');
And everything seems fine to me:
ksql> DESCRIBE TB_REQUEST;
Name : TB_REQUEST
Field | Type
-----------------------------------------------------------------------------------------------------------------------
ID | STRUCT<REQUEST_ID BIGINT> (primary key)
BEFORE | STRUCT<REQUEST_ID BIGINT, REQUESTER_ID INTEGER, STATUS_ID>
AFTER | STRUCT<REQUEST_ID BIGINT, REQUESTER_ID INTEGER, STATUS_ID>
SOURCE | STRUCT<VERSION VARCHAR(STRING), CONNECTOR VARCHAR(STRING), NAME VARCHAR(STRING), TS_MS BIGINT, SNAPSHOT VARCHAR(STRING), DB VARCHAR(STRING), SEQUENCE VARCHAR(STRING), SCHEMA VARCHAR(STRING), TABLE VARCHAR(STRING), TXID BIGINT, LSN BIGINT, XMIN BIGINT>
OP | VARCHAR(STRING)
TS_MS | BIGINT
TRANSACTION | STRUCT<ID VARCHAR(STRING), TOTAL_ORDER BIGINT, DATA_COLLECTION_ORDER BIGINT>
-----------------------------------------------------------------------------------------------------------------------
For runtime statistics and query details run: DESCRIBE <Stream,Table> EXTENDED;
Then I create another table from "EMPLOYEE" topic:
ksql> CREATE TABLE TB_EMPLOYEE (ID STRUCT<EMPLOYEE_ID INT> PRIMARY KEY)
WITH (KAFKA_TOPIC='EMPLOYEE', FORMAT='AVRO');
Again, everything seems ok.
ksql> DESCRIBE TB_EMPLOYEE;
Name : TB_EMPLOYEE
Field | Type
-----------------------------------------------------------------------------------------------------------------------
ID | STRUCT<EMPLOYEE_ID INTEGER> (primary key)
BEFORE | STRUCT<EMPLOYEE_ID INTEGER, NAME VARCHAR(STRING), HIRING_DATE DATE>
AFTER | STRUCT<EMPLOYEE_ID INTEGER, NAME VARCHAR(STRING), HIRING_DATE DATE>
SOURCE | STRUCT<VERSION VARCHAR(STRING), CONNECTOR VARCHAR(STRING), NAME VARCHAR(STRING), TS_MS BIGINT, SNAPSHOT VARCHAR(STRING), DB VARCHAR(STRING), SEQUENCE VARCHAR(STRING), SCHEMA VARCHAR(STRING), TABLE VARCHAR(STRING), TXID BIGINT, LSN BIGINT, XMIN BIGINT>
OP | VARCHAR(STRING)
TS_MS | BIGINT
TRANSACTION | STRUCT<ID VARCHAR(STRING), TOTAL_ORDER BIGINT, DATA_COLLECTION_ORDER BIGINT>
-----------------------------------------------------------------------------------------------------------------------
For runtime statistics and query details run: DESCRIBE <Stream,Table> EXTENDED;
But by trying to create my target table joining previous ones by Employee Id.
ksql> CREATE TABLE REQUEST_EMPLOYEE AS
SELECT RQ.ID->REQUEST_ID, RQ.AFTER->REQUESTER_ID, RQ.AFTER->STATUS_ID, EM.ID->EMPLOYEE_ID, EM.AFTER->NAME AS REQUESTER
FROM TB_REQUEST RQ
JOIN TB_EMPLOYEE EM ON RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID;
I got the following error:
Could not determine output schema for query due to error: Invalid join condition: table-table joins require to join on the primary key of the right input table. Got RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID.
Statement: CREATE TABLE REQUEST_EMPLOYEE WITH (KAFKA_TOPIC='REQUEST_EMPLOYEE', PARTITIONS=1, REPLICAS=1) AS SELECT
RQ.ID->REQUEST_ID REQUEST_ID,
RQ.AFTER->REQUESTER_ID REQUESTER_ID,
RQ.AFTER->STATUS_ID STATUS_ID,
EM.ID->EMPLOYEE_ID EMPLOYEE_ID,
EM.AFTER->NAME REQUESTER
FROM TB_REQUEST RQ
INNER JOIN TB_EMPLOYEE EM ON ((RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID))
EMIT CHANGES;
Looking at output from "DESCRIBE TB_EMPLOYEE" command it looks like to me that "EM.ID->EMPLOYEE_ID" is the right choice. What am I missing?
Thanks in advance.
PS: ksqlDB version in 0.21.0
I think you should use at least one row key in your join statement, in previous versions of KsqlDB the only way to join Tables was by rowkeys, in your current version 0.21.0 it is possible using a foreign key.
Check the following example:
CREATE TABLE orders_with_users AS
SELECT * FROM orders JOIN users ON orders.u_id = users.u_id EMIT CHANGES;
Where u_id is defined as primary key thus is the rowkey.
CREATE TABLE users (
u_id VARCHAR PRIMARY KEY
name VARCHAR
) WITH (
kafka_topic = 'users',
partitions = 3,
value_format = 'json'
);
The below sentence is similar
CREATE TABLE orders_with_users AS
SELECT * FROM orders JOIN users ON orders.u_id = users.ROWKEY EMIT CHANGES;
Another observation, KsqlDB is considering the key for your TB_EMPLOYE as STRUCT<EMPLOYEE_ID INTEGER>, Not only Integer.
then is waiting for one comparison between structs. (With the same schema)
Then you can perform the follow steps before to create your table.
CREATE STREAM STREAM_EMPLOYEE (ID STRUCT<EMPLOYEE_ID INT> KEY)
WITH (KAFKA_TOPIC='EMPLOYEE', FORMAT='AVRO');
CREATE STREAM STREAM_REKEY_EMPLOYEE
AS SELECT ID->EMPLOYEE_ID employee_id, * FROM STREAM_EMPLOYEE
PARTITION BY ID->EMPLOYEE_ID
EMIT CHANGES;
CREATE TABLE TB_EMPLOYEE (employee_id PRIMARY KEY)
WITH (KAFKA_TOPIC='STREAM_REKEY_EMPLOYEE', FORMAT='AVRO');
And use the employee_id field to join, try to use your primary keys as primitive types.

Can't insert rows into SQL Server table that contains triggers ( F Sharp SQL Provider )

I am using the Sqlprovider driver and hitting an issue on creating new records - that appears to make the driver useless.
let foundProductMaybe = query {
for p in ctx.Dbo.Products do
where (p.DefaultSupplierSku.Value = pl.supplierSku)
select (Some p)
exactlyOneOrDefault
}
match foundProductMaybe with
| Some foundProduct ->
updateProduct(foundProduct,pl,ctx)
| None -> addProduct(pl, ctx)
product.Id <- Guid.NewGuid()
product.Code <- "some code"
.... etc
ctx.SubmitUpdates()
I get the error:
System.Data.SqlClient.SqlException: 'The target table 'dbo.Products' of the DML statement cannot have any enabled triggers if the statement contains an OUTPUT clause without INTO
Is there a workaround for this?
This is an issue related to SQL Server, not necessarily SQLServerProvider, it seems to me. Here's an article that discusses the mechanism of this behaviour. https://techcommunity.microsoft.com/t5/sql-server/update-with-output-clause-8211-triggers-8211-and-sqlmoreresults/ba-p/383457
The code that generates the OUTPUT statemens in the SqlProvider appears to be here: https://github.com/fsprojects/SQLProvider/blob/8afaad203efe2b3b900a2ad1a6d8a35d66ebe40a/src/SQLProvider.Runtime/Providers.MsSqlServer.fs#L370
The OUTPUT clause is generated only if the table has primary key.
Perhaps you can change the table and replace the primary key with UNIQUE constraint, which is pretty close in the functionality to the PK constraint and should not affect you in your case.
https://learn.microsoft.com/en-us/sql/relational-databases/tables/create-unique-constraints?view=sql-server-ver15
The Unique constraints (and primary keys) are implemented as indexes on the table. Since you use a nonsequential GUID, you might consider ensuring that these indexes are created as NONCLUSTERED.

How to delete a value from ksqldb table or insert a tombstone value?

How is it possible to mark a row in a ksql table for deletion via Rest api or at least as a statement in ksqldb-cli?
CREATE TABLE movies (
title VARCHAR PRIMARY KEY,
id INT,
release_year INT
) WITH (
KAFKA_TOPIC='movies',
PARTITIONS=1,
VALUE_FORMAT = 'JSON'
);
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, 'Aliens', 1986);
This doesn't work for obvious reasons, but DELETE statement doesn't exist in ksqldb:
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, null, null);
Is there a way to create a recommended tombstone null value or do I need to write it directly to the underlying topic?
There is a way to do this that's a bit of a workaround. The trick is to use the KAFKA value format to write a tombstone to the underlying topic.
Here's an example, using your original DDL.
-- Insert a second row of data
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (42, 'Life of Brian', 1986);
-- Query table
ksql> SET 'auto.offset.reset' = 'earliest';
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
|Aliens |48 |1986 |
Limit Reached
Query terminated
Now declare a new stream that will write to the same Kafka topic using the same key:
CREATE STREAM MOVIES_DELETED (title VARCHAR KEY, DUMMY VARCHAR)
WITH (KAFKA_TOPIC='movies',
VALUE_FORMAT='KAFKA');
Insert a tombstone message:
INSERT INTO MOVIES_DELETED (TITLE,DUMMY) VALUES ('Aliens',CAST(NULL AS VARCHAR));
Query the table again:
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
Examine the underlying topic
ksql> print movies;
Key format: KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2021/02/22 11:01:05.966 Z, key: Aliens, value: {"ID":48,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:02:00.194 Z, key: Life of Brian, value: {"ID":42,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:04:52.569 Z, key: Aliens, value: <null>, partition: 0

ksql table adds extra characters to rowkey

I have some kafka topics with avro format, I created an stream and a table to be able to join with ksql, but the result of the join comes always as null.
Following the troubleshoot, I found that the key is prepended with some character, which dependes on the length of the string. I suppose it has to do with something about avro, but I can't find where is the problem.
CREATE TABLE entity_table ( Id VARCHAR, Info info )
WITH
(
KAFKA_TOPIC = 'pisos',
VALUE_FORMAT='avro',
KEY = 'Id');
select * from entity_table;
1562839624583 | $99999999999.999999 | 99999999999.510136 | 1
1562839631250 | &999999999990.999999 | 99999999999.510136 | 2
How are you populating the Kafka topic? KSQL currently only supports string keys. If you can't change how the topic is populated you could do:
CREATE STREAM entity_src WITH (KAFKA_TOPIC = 'pisos', VALUE_FORMAT='avro');
CREATE STREAM entity_rekey AS SELECT * FROM entity_src PARTITION BY ID;
CREATE TABLE entity_table with (KAFKA_TOPIC='entity_rekey', VALUE_FORMAT='AVRO');
BTW you don't need to specify the schema if you are using Avro.

Newly assigned Sequence is not working

In PostgreSQL, I created a new table and assigned a new sequence to the id column. If I insert a record from the PostgreSQL console it works but when I try to import a record from from Rails, it raises an exception that it is unable to find the associated sequence.
Here is the table:
\d+ user_messages;
Table "public.user_messages"
Column | Type | Modifiers | Storage | Description
-------------+-----------------------------+------------------------------------------------------------+----------+-------------
id | integer | not null default nextval('new_user_messages_id'::regclass) | plain |
But when I try to get the sequence with the SQL query which Rails uses, it returns NULL:
select pg_catalog.pg_get_serial_sequence('user_messages', 'id');
pg_get_serial_sequence
------------------------
(1 row)
The error being raised by Rails is:
UserMessage.import [UserMessage.new]
NoMethodError: undefined method `split' for nil:NilClass
from /app/vendor/bundle/ruby/1.9.1/gems/activerecord-3.2.3/lib/active_record/connection_adapters/postgresql_adapter.rb:910:in `default_sequence_name'
This problem only occurs when I use the ActiveRecord extension for importing bulk records, single records get saved through ActiveRecord.
How do I fix it?
I think your problem is that you set all this up by hand rather than by using a serial column. When you use a serial column, PostgreSQL will create the sequence, set up the appropriate default value, and ensure that the sequence is owned by the table and column in question. From the fine manual:
pg_get_serial_sequence(table_name, column_name)
get name of the sequence that a serial or bigserial column uses
But you're not using serial or bigserial so pg_get_serial_sequence won't help.
You can remedy this by doing:
alter sequence new_user_messages_id owned by user_messages.id
I'm not sure if this is a complete solution and someone (hi Erwin) will probably fill in the missing bits.
You can save yourself some trouble here by using serial as the data type of your id column. That will create and hook up the sequence for you.
For example:
=> create sequence seq_test_id;
=> create table seq_test (id integer not null default nextval('seq_test_id'::regclass));
=> select pg_catalog.pg_get_serial_sequence('seq_test','id');
pg_get_serial_sequence
------------------------
(1 row)
=> alter sequence seq_test_id owned by seq_test.id;
=> select pg_catalog.pg_get_serial_sequence('seq_test','id');
pg_get_serial_sequence
------------------------
public.seq_test_id
(1 row)

Resources