ksqlDB - join streams without nulls - join

I want to join 2 streams adding only non-null values.
Stream 1:
CREATE STREAM S1 (id BIGINT, rel_id BIGINT) WITH (KAFKA_TOPIC='T1' VALUE_FORMAT='AVRO' PARTITIONS=1);
Stream 2:
CREATE STREAM S2 (s1_rel_id BIGINT, started_at VARCHAR, ended_at VARCHAR) WITH (KAFKA_TOPIC='T2' VALUE_FORMAT='AVRO' PARTITIONS=1);
Next I create new stream with joining:
CREATE STREAM J1 AS SELECT S1.id AS ID, S1.rel_id AS REL_ID, started_at, ended_at FROM S2 JOIN S1 WITHIN 1 SECONDS ON S1.rel_id = S2.s1_rel_id WHERE S2.ended_at IS NOT NULL EMIT CHANGES;
On
SELECT * FROM J1 EMIT CHANGES LIMIT 10;
I receive a table with records containing 'null' in the field 'ended_at'. I tried use INNER JOIN and LEFT_JOIN. But as I understand the INNER JOIN (or simple JOIN) should ignore null values for events in the field 'eneded_at' - https://docs.ksqldb.io/en/latest/developer-guide/joins/join-streams-and-tables/#semantics-of-stream-stream-joins
Why do I still see nulls in records? )) Thanks for any comments.

Related

Snowflake: Joining a Table with Effective Dates and older records are showing NULL

Summary:
In Snowflake I have a table which records the maximum number of an item which changes every so often. I want to be able to join the max number of the item for that date (effective_date). This is the most basic "example" as in my table has items "expire" when they are removed.
CREATE OR REPLACE TABLE ITEM
(
Item VARCHAR(10),
Quantity Number(5,0),
EFFECTIVE_DATE DATE
)
;
CREATE OR REPLACE TABLE REPORT
(
INVOICE_DATE DATE,
ITEM VARCHAR(10)
)
;
INSERT INTO REPORT
VALUES
('2021-02-01', '100'),
('2021-09-10', '100')
;
INSERT INTO ITEM
VALUES
('100', '10', '2021-01-01'),
('101', '15', '2021-01-01'),
('100', '5', '2021-09-01')
;
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
Returns
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,5,2021-09-01
2021-09-10,100,NULL,NULL,NULL
How do I fix this so I no longer get NULL entries on my join.
Thank you for reading this!
I am hoping to get a result like this
INVOICE_DATE,ITEM,ITEM,QUANTITY,EFFECTIVE_DATE
2021-02-01,100,100,10,2021-01-01
2021-09-10,100,100,5,2021-09-01
The issue is with your data and your expectations. Your query is this:
SELECT * FROM REPORT t1
LEFT JOIN
(
SELECT * FROM ITEM
QUALIFY ROW_NUMBER() OVER (PARTITION BY ITEM ORDER BY EFFECTIVE_DATE desc) = 1
) t2 on t1.ITEM = t2.ITEM AND t1.INVOICE_DATE <= t2.EFFECTIVE_DATE
;
which requires that the INVOICE_DATE be less than or equal to the EFFECTIVE DATE of the ITEM. This isn't the case, though. 2021-09-10 is greater than 2021-09-01 so you don't get a join hit, which is why you get NULLs. It's also why your other record is returning the wrong information from your expectations.

How to delete a value from ksqldb table or insert a tombstone value?

How is it possible to mark a row in a ksql table for deletion via Rest api or at least as a statement in ksqldb-cli?
CREATE TABLE movies (
title VARCHAR PRIMARY KEY,
id INT,
release_year INT
) WITH (
KAFKA_TOPIC='movies',
PARTITIONS=1,
VALUE_FORMAT = 'JSON'
);
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, 'Aliens', 1986);
This doesn't work for obvious reasons, but DELETE statement doesn't exist in ksqldb:
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (48, null, null);
Is there a way to create a recommended tombstone null value or do I need to write it directly to the underlying topic?
There is a way to do this that's a bit of a workaround. The trick is to use the KAFKA value format to write a tombstone to the underlying topic.
Here's an example, using your original DDL.
-- Insert a second row of data
INSERT INTO MOVIES (ID, TITLE, RELEASE_YEAR) VALUES (42, 'Life of Brian', 1986);
-- Query table
ksql> SET 'auto.offset.reset' = 'earliest';
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
|Aliens |48 |1986 |
Limit Reached
Query terminated
Now declare a new stream that will write to the same Kafka topic using the same key:
CREATE STREAM MOVIES_DELETED (title VARCHAR KEY, DUMMY VARCHAR)
WITH (KAFKA_TOPIC='movies',
VALUE_FORMAT='KAFKA');
Insert a tombstone message:
INSERT INTO MOVIES_DELETED (TITLE,DUMMY) VALUES ('Aliens',CAST(NULL AS VARCHAR));
Query the table again:
ksql> select * from movies emit changes limit 2;
+--------------------------------+--------------------------------+--------------------------------+
|TITLE |ID |RELEASE_YEAR |
+--------------------------------+--------------------------------+--------------------------------+
|Life of Brian |42 |1986 |
Examine the underlying topic
ksql> print movies;
Key format: KAFKA_STRING
Value format: JSON or KAFKA_STRING
rowtime: 2021/02/22 11:01:05.966 Z, key: Aliens, value: {"ID":48,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:02:00.194 Z, key: Life of Brian, value: {"ID":42,"RELEASE_YEAR":1986}, partition: 0
rowtime: 2021/02/22 11:04:52.569 Z, key: Aliens, value: <null>, partition: 0

Join procedures only once on Firebird

I'm trying to left join two stored procedures in a Firebird query.
In my example data the first returns 70 records, the second just 1 record.
select
--...
from MYSP1('ABC', 123) s1
left join MYSP2('DEF', 456) s2
on s1.FIELDA = s2.FIELDA
and s1.FIELDB = s2.FIELDB
The problem is performances: it takes 10 seconds, while each procedure takes less than 1 second. I suspect that procedures are run multiple times instead of just once. It would make sense to execute them just once, because I pass fixed parameters to them.
Is there a way to oblige Firebird to simply execute once each procedure and then join their results?
Since it seems there is no way, I solved this issue running this query inside a new stored procedure, where I cache all results from MYSP2 into a global temporary table and make the join between MYSP1 and the temporary table.
This is temporary table definition:
create global temporary table MY_TEMP_TABLE
(
FIELDA varchar(3) not null,
FIELDB smallint not null,
FIELDC varchar(10) not null
);
This is stored procedure body:
--cache MYSP2 results
delete from MY_TEMP_TABLE;
insert into MY_TEMP_TABLE
select *
from MYSP2('DEF', 456)
;
--join data
for
select
--...
from MYSP1('ABC', 123) s1
left join MY_TEMP_TABLE s2
on s1.FIELDA = s2.FIELDA
and s1.FIELDB = s2.FIELDB
into
--...
do
suspend;
But if there is another solution without temporary tables it would be great!
Maybe this can help:
with MYSP2W as (MYSP2('DEF', 456))
select
--...
from MYSP1('ABC', 123) s1
left join MYSP2W s2
on s1.FIELDA = s2.FIELDA
and s1.FIELDB = s2.FIELDB

Left outer join with 3 tables and subquery

sorry for the late response.
For a key in table A, there may be 2 or more records present in tables B and C. That is, one another column in these tables will have a date value which would be making the keys unique. So I want to extract the record that has maximum date value. And that's why I am using the max function. I know that the subquery which I have coded should not be included in the ON clause and it would do the filtering before the join statement. So eventually I want to know how to mention the max clause in the query.
Example:
Table A
Key - AAAAA
Table B:
Record 1
Key - AAAAA
Date - 2017-10-01
Record 2
Key - AAAAA
Date - 2017-10-05
I want the only the record AAAAA/2017-10-05 to be selected from the table B
Basically records from table A where A.c3 = 'Y' should be extracted first (assume it gives 500 records)
Then join these 500 records with tables B and C (left outer, to have all the matching records and the non-matching records should have nulls in the columns from the tables B and C)
In tables B and C, if more than 1 record present with different dates, the maximum date field should be extracted.
Hence final output should contain 500 records.
This is all you need for what you describe
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = ‘Y’
These lines are causing your problem...basically forcing your outer joins to an inner joins.
AND B.C3 = (SELECT MAX(B3) FROM TABLE2 T1
WHERE T1.B1 = B.B1)
AND C.C3 = (SELECT MAX(C3) FROM TABLE3 T1
WHERE T1.C1 = C.C1)
If there's no match in B or C , then B.C3 and/or C.C3 will be NULL and NULL can't be = to anything (or <> to anything for that matter)
What are you trying to accomplish with the above that you've not included in the question?
Just do it?
SELECT A.A1, A.A2, B.B1, B.B2, C.C1, C.C2
FROM TABLE1 A
LEFT OUTER JOIN TABLE2 B
ON A.A1 = B.B1
LEFT OUTER JOIN TABLE3 C
ON A.A1 = C.C1
WHERE A.C3 = 'Y' and (B.B1 is null or C.B1 is null)

Hive Join returning zero records

I have two Hive tables and I am trying to join both of them. The tables are not clustered or partitioned by any field. Though the tables contain records for common key fields, the join query always returns 0 records. All the data types are 'string' data types.
The join query is simple and looks something like below
select count(*) cnt
from
fsr.xref_1 A join
fsr.ipfile_1 B
on
(
A.co_no = B.co_no
)
;
Any idea what could be going wrong? I have just one record (same value) in both the tables.
Below are my table definitions
CREATE TABLE xref_1
(
co_no string
)
clustered by (co_no) sorted by (co_no asc) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
CREATE TABLE ipfile_1
(
co_no string
)
clustered by (co_no) sorted by (co_no asc) into 10 buckets
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Hi You are using Star Schema Join. Please use your query like this:
SELET COUNT(*) cnt FROM A a JOIN B b ON (a.key1 = b.key1);
If still have issue Then use MAPJOIN:
set hive.auto.convert.join=true;
select count(*) from A join B on (key1 = key2)
Please see Link for more detail.

Resources