Can I PARTITION BY two columns in KSQL - ksqldb

I have two topics users (contains all the users info) and transactions (contains all the transactions made by the users it includes the 'sender' and 'receiver id'), all of my topics data are nested.
First thing I've done was CREATE STREAM, then I CREATED another STREAM to rename those nested fields because PARTITION BY does not accept nested fields somehow, everything works great and my question is that I want to partition transactions per sender and receiver id so I can join it with users does KSQL accepts PARTITION BY two columns? Do I need to PARTITION BY two columns to get this working or do I just need to partition by either sender or receiver?
I have tried this but came back with an error, I also tried to add PARTITION BY (sender, receiver) at the end and came back with a other error
ksql> CREATE STREAM transactions WITH (PARTITIONS=1) AS \ SELECT * FROM >>flattentransactions PARTITION BY sender,receiver;
line 1:105: mismatched input ',' expecting ';'
Statement: CREATE STREAM transactions WITH (PARTITIONS=1) AS SELECT * >>FROM flattentransactions PARTITION BY sender,receiver;
Caused by: line 1:105: mismatched input ',' expecting ';'
Caused by: org.antlr.v4.runtime.InputMismatchException ##

You need to concatenate the columns first:
CREATE STREAM transactions
WITH (PARTITIONS=1) AS
SELECT X,Y,Z,SENDER + RECEIVER AS MSG_KEY
FROM flattentransactions
PARTITION BY MSG_KEY

Related

KSQL Group By to drop previous values and only use the LAST

I have a Kafka topic "events" which records user image votes and has json in the following structure:
{"category":"image","action":"vote","label":"amsterdam","ip":"1.1.1.1","value":2}
I need to receive on another topic the sum of all votes for the label (e.g. amsterdam) but drop any votes that came from the same IP address using only the last vote. This topic should have json in this format:
{label:”amsterdam”,SCORE:8,TOTAL:3}
SCORE is a sum of all votes and TOTAL is the number of votes counted.
The solution I made creates a stream from the topic events:
CREATE STREAM st_events
(CATEGORY STRING, ACTION STRING, LABEL STRING, VALUE BIGINT, IP STRING)
WITH (KAFKA_TOPIC='events', VALUE_FORMAT='JSON');
Then, I create a table tb_votes which calculates the score and total for each label and IP address:
CREATE TABLE tb_votes WITH (KAFKA_TOPIC='tb_votes', PARTITIONS=1, REPLICAS=1) AS SELECT
st_events.LABEL "label", SUM(st_events.VALUE-1) "score", CAST(COUNT(*) AS BIGINT) "total"
FROM st_events
WHERE
st_events.category='image' AND st_events.action='vote'
GROUP BY st_events.label, st_events.ip
EMIT CHANGES;
The problem is that instead of dropping all the previous votes coming from the same ip address for the same image, Kafka uses all of them. This makes sense as it is a GROUP BY.
Any idea how to "drop" all previous votes and only use the latest values for an image/ IP?
You need a two stage aggregation.
The first stage should build a table with a primary key containing both the ip and label and another column holding the value.
Build a second table from this first table to get the count and sum per-label that you need.
If another vote comes in from the same ip for the same label then the first table will be updated with the new value and the second table will be correctly updated. It will first remove the old value from the count and sum and then apply the new value.
ksqlDB does not yet support multiple primary key columns (though its coming VERY soon!). So when you group by two columns it just does a funky string concatenation. But we can work with that for now.
CREATE TABLE BY_IP_AND_LABEL AS
SELECT
label + '-' + ip AS ipAndLabel,
value
FROM st_events
GROUP BY ip + '#' + label;
CREATE TABLE BY_LABEL AS
SELECT
SUBSTRING(labelAndIp, INSTR(labelAndIp, '#')) AS label,
SUM(VALUE-1) AS score,
COUNT(*) AS total
FROM BY_IP_AND_LABEL
GROUP BY SUBSTRING(ipAndLabel, INSTR(ipAndLabel, '#'));
The first table creates a composite key with and # as the separator. The second table uses INSTR and SUBSTRING to find the separator and extract the label.
Note: I've not tested this - I could have some 'off-by-one' errors in the logic.
This should do what you need.

Is it possible to use literal data as stream source in Sumologic?

Is it possible for a Sumologic user to define data source values inside a Query and use it in subquery condition?
For example in SQL, one can use literal data as source table.
-- example in MySQL
SELECT * FROM (
SELECT 1 as `id`, 'Alice' as `name`
UNION ALL
SELECT 2 as `id`, 'Bob' as `name`
-- ...
) as literal_table
I wonder if Sumo logic also have such kind of functionality.
I believe combining such literal with subqueries would make user's life easier.
I believe the equivalent in a Sumo Logic query would be combining the save operator to create a lookup table in a subquery: https://help.sumologic.com/05Search/Subqueries#Reference_data_from_child_query_using_save_and_lookup
Basically something like this:
_sourceCategory=katta
[subquery:(_sourceCategory=stream explainJSONPlan.ETT) error
| where !(statusmessage="Finished successfully" or statusmessage="Query canceled" or isNull(statusMessage))
| count by sessionId, statusMessage
| fields -_count
| save /explainPlan/neededSessions
| compose sessionId keywords]
| parse "[sessionId=*]" as sessionId
| lookup statusMessage from /explainPlan/neededSessions on sessionid=sessionid
Where /explainPlan/neededSessions is your literal data table that you select from later on in the query (using lookup).
You can define a lookup table with some static map/dictionary you update not so often (you can even point to a file in the internet in case you change the mapping often).
And then you can use the |lookup operator. It's nothing special for subqueries.
Disclaimer: I am currently employed by Sumo Logic.

Resulting KSQL join stream shows no data

I am joining a KSQL stream and a KSQL Table. Both are mapped to same key.
But no data is coming to the resulting stream.
create stream kz_yp_loan_join_by_bandid WITH (KAFKA_TOPIC='kz_yp_loan_join_by_bandid',VALUE_FORMAT='AVRO') AS
select ypl.loan_id, ypl.userid ,ypk.name as user_band_id_name
FROM kz_yp_loan_stream_partition_by_bandid ypl
INNER JOIN kz_yp_key_table ypk
ON ypl.user_band_id = ypk.id;
No data is in stream kz_yp_loan_join_by_bandid
But if I do simply :
select ypl.loan_id, ypl.userid ,ypk.name as user_band_id_name
FROM kz_yp_loan_stream_partition_by_bandid ypl
INNER JOIN kz_yp_key_table ypk
ON ypl.user_band_id = ypk.id;
There is data present.
It shows that stream is not written but why is it so?
I have tried doing entire setup again.
A few things to check:
If you want to process all the existing data as well as new data, make sure that before you run your CREATE STREAM … AS SELECT ("CSAS") you have run SET 'auto.offset.reset' = 'earliest';
If the join is returning data when run outside of the CSAS then this may not be relevant, but always good to check your join is going to match all the requirements
Check the KSQL server log in case there's an issue with writing to the target topic, creating the schema on the Schema Registry, etc.
These references will be useful:
https://www.confluent.io/blog/troubleshooting-ksql-part-1
https://www.confluent.io/blog/troubleshooting-ksql-part-2

Kylin is giving Column 'STATE_NAME' not found in any table

I followed kylin tutorial and able to create kylin model and kylin cube successfully.Kylin cube build is also completed successfully.
I create one fact table as,
create table sales_fact(product_id int,state_id int,location_id string,sales_count int)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile;
create table state_details(state_id int,state_name string)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile;
I loaded these tables as,
fact_table
1000,1,AP1,50
1000,2,KA1,100
1001,2,KA1,50
1002,1,AP1,50
1003,3,TL1,100
state_details
1,AP
2,Karnataka
3,Telangana
4,kerala
But if i queried simple query as,
select sales_count from sales_fact where state_name="Karnataka";
it is error as:
Error while executing SQL "select sales_count from sales_fact where state_name="Karnataka" LIMIT 50000": From line 1, column 42 to line 1, column 51: Column 'STATE_NAME' not found in any table
I am not able to find the cause.Anybody have any idea please tell me.
state_name is not on table sales_fact, please try:
select sales_count from sales_fact as f inner join state_details as d on f.state_id = d.state_id where d.state_name='Karnataka';

Firebird: simulating create table as?

I'm searching a way to simulate "create table as select" in Firebird from SP.
We are using this statement frequently in another product, because it is very easy for make lesser, indexable sets, and provide very fast results in server side.
create temp table a select * from xxx where ...
create indexes on a ...
create temp table b select * from xxx where ...
create indexes on b ...
select * from a
union
select * from b
Or to avoid the three or more levels in subqueries.
select *
from a where id in (select id
from b
where ... and id in (select id from c where))
The "create table as select" is very good cos it's provide correct field types and names so I don't need to predefine them.
I can simulate "create table as" in Firebird with Delphi as:
Make select with no rows, get the table field types, convert them to create table SQL, run it, and make "insert into temp table " + selectsql with rows (without order by).
It's ok.
But can I create same thing in a common stored procedure which gets a select sql, and creates a new temp table with the result?
So: can I get query result's field types to I can create field creator SQL from them?
I'm just asking if is there a way or not (then I MUST specify the columns).
Executing DDL inside stored procedure is not supported by Firebird. You could do it using EXECUTE STATEMENT but it is not recommended (see the warning in the end of "No data returned" topic).
One way to do have your "temporary sets" would be to use (transaction-level) Global Temporary Table. Create the GTT as part of the database, with correct datatypes but without constraints (those would probably get into way when you fill only some columns, not all) - then each transaction only sees it's own version of the table and data...

Resources