ksql table adds extra characters to rowkey - avro

I have some kafka topics with avro format, I created an stream and a table to be able to join with ksql, but the result of the join comes always as null.
Following the troubleshoot, I found that the key is prepended with some character, which dependes on the length of the string. I suppose it has to do with something about avro, but I can't find where is the problem.
CREATE TABLE entity_table ( Id VARCHAR, Info info )
WITH
(
KAFKA_TOPIC = 'pisos',
VALUE_FORMAT='avro',
KEY = 'Id');
select * from entity_table;
1562839624583 | $99999999999.999999 | 99999999999.510136 | 1
1562839631250 | &999999999990.999999 | 99999999999.510136 | 2

How are you populating the Kafka topic? KSQL currently only supports string keys. If you can't change how the topic is populated you could do:
CREATE STREAM entity_src WITH (KAFKA_TOPIC = 'pisos', VALUE_FORMAT='avro');
CREATE STREAM entity_rekey AS SELECT * FROM entity_src PARTITION BY ID;
CREATE TABLE entity_table with (KAFKA_TOPIC='entity_rekey', VALUE_FORMAT='AVRO');
BTW you don't need to specify the schema if you are using Avro.

Related

KSQLDB: Using CREATE STREAM AS SELECT with Differing KEY SCHEMAS

Here is the description of the problem statement:
STREAM_SUMMARY: A stream with one of the value columns as an ARRAY-of-STRUCTS.
Name : STREAM_SUMMARY
Field | Type
------------------------------------------------------------------------------------------------------------------------------------------------
ROWKEY | STRUCT<asessment_id VARCHAR(STRING), institution_id INTEGER> (key)
assessment_id | VARCHAR(STRING)
institution_id | INTEGER
responses | ARRAY<STRUCT<student_id INTEGER, question_id INTEGER, response VARCHAR(STRING)>>
------------------------------------------------------------------------------------------------------------------------------------------------
STREAM_DETAIL: This is a stream to be created from STREAM1, by "exploding" the the array-of-structs into separate rows. Note that the KEY schema is also different.
Below is the Key and Value schema I want to achieve (end state)...
Name : STREAM_DETAIL
Field | Type
-------------------------------------------------------------------------------------------------------
ROWKEY | **STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER> (key)**
assessment_id | VARCHAR(STRING)
institution_id | INTEGER
student_id | INTEGER
question_id | INTEGER
response | VARCHAR(STRING)
My objective is to create the STREAM_DETAIL from the STREAM_SUMMARY.
I tried the below:
CREATE STREAM STREAM_DETAIL WITH (
KAFKA_TOPIC = 'stream_detail'
) AS
SELECT
STRUCT (
`assessment_id` := "assessment_id",
`student_id` := EXPLODE("responses")->"student_id",
`question_id` := EXPLODE("responses")->"question_id"
)
, "assessment_id"
, "institution_id"
, EXPLODE("responses")->"student_id"
, EXPLODE("responses")->"question_id"
, EXPLODE("responses")->"response"
FROM STREAM_SUMMARY
EMIT CHANGES;
While the SELECT query works fine, the CREATE STREAM returned with the following error:
"Key missing from projection."
If I add the ROWKEY column in the SELECT clause in the above statement, things work, however, the KEY schema of the resultant STREAM is same as the original SREAM's key.
The "Key" schema that I want in the new STREAM is : STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER> (key)
Alternatively, I tried creating the STREAM_DETAIL by hand (using plain CREATE STREAM statement by providing key and value SCHEMA_IDs). Later I tried the INSERT INTO approach...
INSERT INTO STREAM_DETAIL
SELECT ....
FROM STREAM_SUMMARY
EMIT CHANGES;
The errors were the same.
Can you please guide on how can I achieve enriching a STREAM but with a different Key Schema? Note that a new/different Key schema is important for me since I use the underlying topic to be synced to a database via a Kafka sink connector. The sink connector requires the key schema in this way, for me to be able to do an UPSERT.
I am not able to get past this. Appreciate your help.
You can't change the key of a stream when it is created from another stream.
But there is a different approach to the problem.
What you want is re-key. And to do so you need to use ksqlDB table. Can be solved like -
CREATE STREAM IF NOT EXISTS INTERMEDIATE_STREAM_SUMMARY_FLATTNED AS
SELECT
ROWKEY,
EXPLODE(responses) as response
FROM STREAM_SUMMARY;
CREATE TABLE IF NOT EXISTS STREAM_DETAIL AS -- This also creates a underlying topic
SELECT
ROWKEY -> `assessment_id` as `assessment_id`,
response -> `student_id` as `student_id`,
response -> `question_id` as `question_id`,
LATEST_BY_OFFSET(ROWKEY -> `institution_id`, false) as `institution_id`,
LATEST_BY_OFFSET(response -> `response`, false) as `response`
FROM INTERMEDIATE_STREAM_SUMMARY_FLATTNED
GROUP BY ROWKEY -> `assessment_id`, response -> `student_id`, response -> `question_id`;
Key schema will be STRUCT<asessment_id VARCHAR(STRING), student_id INTEGER, question_id INTEGER>, you can check schema registry or print the topic to validate that. In ksqlDB describe table will show you flat key, but don't panic.
I have used similar and sync the final topic to database.

Error "Invalid join condition: table-table joins require to join on the primary key of the right input table" on joining two tables on Kafka ksqlDB

I need to create a Kafka topic from a combination a nine other topics, all of them produced by Debezium PostgreSQL source connector, in AVRO format. To start, I'm trying (so far unsuccessfully) to combine fields from only two topics.
So, first a create a ksqlDB table based on "REQUEST" topic:
ksql> CREATE TABLE TB_REQUEST (ID STRUCT<REQUEST_ID BIGINT> PRIMARY KEY)
WITH (KAFKA_TOPIC='REQUEST', FORMAT='AVRO');
And everything seems fine to me:
ksql> DESCRIBE TB_REQUEST;
Name : TB_REQUEST
Field | Type
-----------------------------------------------------------------------------------------------------------------------
ID | STRUCT<REQUEST_ID BIGINT> (primary key)
BEFORE | STRUCT<REQUEST_ID BIGINT, REQUESTER_ID INTEGER, STATUS_ID>
AFTER | STRUCT<REQUEST_ID BIGINT, REQUESTER_ID INTEGER, STATUS_ID>
SOURCE | STRUCT<VERSION VARCHAR(STRING), CONNECTOR VARCHAR(STRING), NAME VARCHAR(STRING), TS_MS BIGINT, SNAPSHOT VARCHAR(STRING), DB VARCHAR(STRING), SEQUENCE VARCHAR(STRING), SCHEMA VARCHAR(STRING), TABLE VARCHAR(STRING), TXID BIGINT, LSN BIGINT, XMIN BIGINT>
OP | VARCHAR(STRING)
TS_MS | BIGINT
TRANSACTION | STRUCT<ID VARCHAR(STRING), TOTAL_ORDER BIGINT, DATA_COLLECTION_ORDER BIGINT>
-----------------------------------------------------------------------------------------------------------------------
For runtime statistics and query details run: DESCRIBE <Stream,Table> EXTENDED;
Then I create another table from "EMPLOYEE" topic:
ksql> CREATE TABLE TB_EMPLOYEE (ID STRUCT<EMPLOYEE_ID INT> PRIMARY KEY)
WITH (KAFKA_TOPIC='EMPLOYEE', FORMAT='AVRO');
Again, everything seems ok.
ksql> DESCRIBE TB_EMPLOYEE;
Name : TB_EMPLOYEE
Field | Type
-----------------------------------------------------------------------------------------------------------------------
ID | STRUCT<EMPLOYEE_ID INTEGER> (primary key)
BEFORE | STRUCT<EMPLOYEE_ID INTEGER, NAME VARCHAR(STRING), HIRING_DATE DATE>
AFTER | STRUCT<EMPLOYEE_ID INTEGER, NAME VARCHAR(STRING), HIRING_DATE DATE>
SOURCE | STRUCT<VERSION VARCHAR(STRING), CONNECTOR VARCHAR(STRING), NAME VARCHAR(STRING), TS_MS BIGINT, SNAPSHOT VARCHAR(STRING), DB VARCHAR(STRING), SEQUENCE VARCHAR(STRING), SCHEMA VARCHAR(STRING), TABLE VARCHAR(STRING), TXID BIGINT, LSN BIGINT, XMIN BIGINT>
OP | VARCHAR(STRING)
TS_MS | BIGINT
TRANSACTION | STRUCT<ID VARCHAR(STRING), TOTAL_ORDER BIGINT, DATA_COLLECTION_ORDER BIGINT>
-----------------------------------------------------------------------------------------------------------------------
For runtime statistics and query details run: DESCRIBE <Stream,Table> EXTENDED;
But by trying to create my target table joining previous ones by Employee Id.
ksql> CREATE TABLE REQUEST_EMPLOYEE AS
SELECT RQ.ID->REQUEST_ID, RQ.AFTER->REQUESTER_ID, RQ.AFTER->STATUS_ID, EM.ID->EMPLOYEE_ID, EM.AFTER->NAME AS REQUESTER
FROM TB_REQUEST RQ
JOIN TB_EMPLOYEE EM ON RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID;
I got the following error:
Could not determine output schema for query due to error: Invalid join condition: table-table joins require to join on the primary key of the right input table. Got RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID.
Statement: CREATE TABLE REQUEST_EMPLOYEE WITH (KAFKA_TOPIC='REQUEST_EMPLOYEE', PARTITIONS=1, REPLICAS=1) AS SELECT
RQ.ID->REQUEST_ID REQUEST_ID,
RQ.AFTER->REQUESTER_ID REQUESTER_ID,
RQ.AFTER->STATUS_ID STATUS_ID,
EM.ID->EMPLOYEE_ID EMPLOYEE_ID,
EM.AFTER->NAME REQUESTER
FROM TB_REQUEST RQ
INNER JOIN TB_EMPLOYEE EM ON ((RQ.AFTER->REQUESTER_ID = EM.ID->EMPLOYEE_ID))
EMIT CHANGES;
Looking at output from "DESCRIBE TB_EMPLOYEE" command it looks like to me that "EM.ID->EMPLOYEE_ID" is the right choice. What am I missing?
Thanks in advance.
PS: ksqlDB version in 0.21.0
I think you should use at least one row key in your join statement, in previous versions of KsqlDB the only way to join Tables was by rowkeys, in your current version 0.21.0 it is possible using a foreign key.
Check the following example:
CREATE TABLE orders_with_users AS
SELECT * FROM orders JOIN users ON orders.u_id = users.u_id EMIT CHANGES;
Where u_id is defined as primary key thus is the rowkey.
CREATE TABLE users (
u_id VARCHAR PRIMARY KEY
name VARCHAR
) WITH (
kafka_topic = 'users',
partitions = 3,
value_format = 'json'
);
The below sentence is similar
CREATE TABLE orders_with_users AS
SELECT * FROM orders JOIN users ON orders.u_id = users.ROWKEY EMIT CHANGES;
Another observation, KsqlDB is considering the key for your TB_EMPLOYE as STRUCT<EMPLOYEE_ID INTEGER>, Not only Integer.
then is waiting for one comparison between structs. (With the same schema)
Then you can perform the follow steps before to create your table.
CREATE STREAM STREAM_EMPLOYEE (ID STRUCT<EMPLOYEE_ID INT> KEY)
WITH (KAFKA_TOPIC='EMPLOYEE', FORMAT='AVRO');
CREATE STREAM STREAM_REKEY_EMPLOYEE
AS SELECT ID->EMPLOYEE_ID employee_id, * FROM STREAM_EMPLOYEE
PARTITION BY ID->EMPLOYEE_ID
EMIT CHANGES;
CREATE TABLE TB_EMPLOYEE (employee_id PRIMARY KEY)
WITH (KAFKA_TOPIC='STREAM_REKEY_EMPLOYEE', FORMAT='AVRO');
And use the employee_id field to join, try to use your primary keys as primitive types.

How to delete a value from ksqldb table or insert a tombstone value?

How is it possible to mark a row in a ksql table for deletion via Rest api or at least as a statement in ksqldb-cli?
CREATE TABLE movies (
title VARCHAR PRIMARY KEY,
id INT,
release_year INT
) WITH (
KAFKA_TOPIC='movies',
PARTITIONS=1,
VALUE_FORMAT = 'JSON'
);
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, 'Aliens', 1986);
This doesn't work for obvious reasons, but DELETE statement doesn't exist in ksqldb:
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, null, null);
Is there a way to create a recommended tombstone null value or do I need to write it directly to the underlying topic?
There is a way to do this that's a bit of a workaround. The trick is to use the KAFKA value format to write a tombstone to the underlying topic.
Here's an example, using your original DDL.
-- Insert a second row of data
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (42, 'Life of Brian', 1986);
-- Query table
ksql> SET 'auto.offset.reset' = 'earliest';
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
|Aliens |48 |1986 |
Limit Reached
Query terminated
Now declare a new stream that will write to the same Kafka topic using the same key:
CREATE STREAM MOVIES_DELETED (title VARCHAR KEY, DUMMY VARCHAR)
WITH (KAFKA_TOPIC='movies',
VALUE_FORMAT='KAFKA');
Insert a tombstone message:
INSERT INTO MOVIES_DELETED (TITLE,DUMMY) VALUES ('Aliens',CAST(NULL AS VARCHAR));
Query the table again:
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
Examine the underlying topic
ksql> print movies;
Key format: KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2021/02/22 11:01:05.966 Z, key: Aliens, value: {"ID":48,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:02:00.194 Z, key: Life of Brian, value: {"ID":42,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:04:52.569 Z, key: Aliens, value: <null>, partition: 0

KSQL queries have strange keys infront of the key value

I have created a stream like this:
CREATE STREAM TEST1 WITH (KAFKA_TOPIC='TEST_1',VALUE_FORMAT='AVRO');
I then query the stream like this via CLI:
SELECT * FROM TEST1;
The results are looking like this:
1571225518167 | \u0000\u0000\u0000\u0000\u0001\u0006key | 7 | 7 | blue
I wonder why the key is formated like this. Is my query somehow wrong? The value should be like this:
1571225518167 | key | 7 | 7 | blue
Your key is in Avro format, which KSQL doesn't support yet.
If you have control over the data producer, write the key in string format (e.g. Kafka Connect use the org.apache.kafka.connect.storage.StringConverter). If not, and you need to use the key e.g. for driving a KSQL table you'd need to re-key the data using KSQL:
CREATE STREAM TEST1_REKEY AS SELECT * FROM TEST1 PARTITION BY my_key_col

Talend csv to relational db tables : foreign key setting

I'm a Talend beginner and searched about this simple problem, many have posted the web about the same problem but no solution appeared...
I have a csv file with 50 fields, I want to load it into a three tables relational database with Talend. I did a tMap, everything is ok except for foreign key : I don't know how to set them.
Here is my job
Here is my tMap
I hope someone could give me the simple exact solution
Cheers
Pascal
You can do it in two Time.
(CSV) -> (tmap) -> (organisation_output)
|
subjobok
|
| (organisation_input)
| |
(CSV) -> (tmap)-> (country and adress output)
Do a INNER JOIN in the second tmap on column that have unique value for each row.
And load 'ImpID' of your organisation input in the two other output table Impid column.

Resources