How to join a KSQL table and a stream on a non row key column - ksqldb

I am using community edition of confluent Platform version 5.4.1. I did not find any CLI command to print the KSQL Server version but when I enter KSQL what I get to see can be found in the attached screenshot.
I have a geofence table -
CREATE TABLE GEOFENCE (GEOFENCEID INT,
FLEETID VARCHAR,
GEOFENCECOORDINATES VARCHAR)
WITH (KAFKA_TOPIC='MONGODB-GEOFENCE',
VALUE_FORMAT='JSON',
KEY= 'GEOFENCEID');
The data is coming to Geofence KSQL table from Kafka MongoDB source connector whenever an insert or update operation is performed on the geofence MongoDB collection from a web application supported by a REST API. The idea behind making geofence a table is that since tables are mutable it would hold the updated geofence information and since the insert or update operation will not be very frequent and whenever there are changes in the Geofence MongoDB collection they will get updated on the Geofence KSQL table since the key here is GeofenceId.
I have a live stream of vehicle position -
CREATE STREAM VEHICLE_POSITION (VEHICLEID INT,
FLEETID VARCHAR,
LATITUDE DOUBLE,
LONGITUDE DOUBLE)
WITH (KAFKA_TOPIC='VEHICLE_POSITION',
VALUE_FORMAT='JSON')
I want to join table and stream like this -
CREATE STREAM VEHICLE_DISTANCE_FROM_GEOFENCE AS
SELECT GF.GEOFENCEID,
GF.FLEETID,
VP.VEHICLEID,
GEOFENCE_UDF(GF.GEOFENCECOORDINATES, VP.LATITUDE, VP.LONGITUDE)
FROM GEOFENCE GF
LEFT JOIN VEHICLE_POSITION VP
ON GF.FLEETID = VP.FLEETID;
But KSQL will not allow me to do because I am performing join on FLEETID which is a non row key column.Though this would have been possible in SQL but how do I achieve this in KSQL?
Note: According to my application's business logic Fleet Id is used to combine Geofences and Vehicles belonging to a fleet.
Sample data for table -
INSERT INTO GEOFENCE
(GEOFENCEID INT, FLEETID VARCHAR, GEOFENCECOORDINATES VARCHAR)
VALUES (10, 123abc, 52.4497_13.3096);
Sample data for stream -
INSERT INTO VEHICLE_POSITION
(VEHICLEID INT, FLEETID VARCHAR, LATITUDE DOUBLE, LONGITUDE DOUBLE)
VALUES (1289, 125abc, 57.7774, 12.7811):

To solve your problem what you need is a table of FENCEID to GEOFENCECOORDINATES. You could use such a table to join to your VEHICLE_POSITION stream to get the result you need.
So, how do you get a table of FENCEID to GEOFENCECOORDINATES?
The simple answer is that you can't with your current table definition! You declare the table as having only the GEOFENCEID as the primary key. Yet a fleetId can have many fences. To be able to mode this, both GEOFENCEID and FENCEID would need to be part of the primary key of the table.
Consider the example:
INSERT INTO GEOFENCE VALUES (10, 'fleet-1', 'coords-1');
INSERT INTO GEOFENCE VALUES (10, 'fleet-2', 'coords-2');
Are running these two inserts the table would contain only a single row, with key 10 and value 'fleet-2', 'coords-2'.
Even if we could somehow capture the above information in a table, consider what happens if there is a tombstone in the topic, because the first row had been deleted from the source Mongo table. A tombstone is the key, (10), and a null value. ksqlDB would then remove the row from its table with key 10, leaving an empty table.
This is the crux of your problem!
First, you'll need to configure the source connector to get both the fence id and fleet id into the key of the messages.
Next, you'll need to access this in ksqlDB. Unfortunately, ksqlDB, as of version 0.10.0 / CP 6.0.0 doesn't support multiple key columns, though this is being worked on soon.
In the meantime, if you key is a JSON document containing the two key fields, e.g.
{
"GEOFENCEID": 10,
"FLEETID": "fleet-1"
}
Then you can import it into ksqlDB as a STRING:
-- 5.4.1 syntax:
-- ROWKEY will contain the JSON document, containing GEOFENCEID and FLEETID
CREATE TABLE GEOFENCE (
GEOFENCECOORDINATES VARCHAR
)
WITH (
KAFKA_TOPIC='MONGODB-GEOFENCE',
VALUE_FORMAT='JSON'
);
-- 6.0.0 syntax:
CREATE TABLE GEOFENCE (
JSONKEY STRING PRIMARY KEY,
GEOFENCECOORDINATES VARCHAR
)
WITH (
KAFKA_TOPIC='MONGODB-GEOFENCE',
VALUE_FORMAT='JSON'
);
With the table now correctly defined you can use EXTRACTJSONFIELD to access the data in the JSON key and collect all the fence coordinates using COLLECT_SET. I'm not 100% sure this would on 5.4.1, (see how you get on), but will on 6.0.0.
-- 6.0.0 syntax
CREATE TABLE FLEET_COORDS AS
SELECT
EXTRACTJSONFIELD(JSONKEY, '$.FLEETID') AS FLEETID,
COLLECT_SET(GEOFENCECOORDINATES)
FROM GEOFENCE
GROUP BY EXTRACTJSONFIELD(JSONKEY, '$.FLEETID');
This will give you a table of fleetId to a set of fence coordinates. You can use this to join to your vehicle position stream. Of course, your GEOFENCE_UDF udf will need to accept an ARRAY<STRING> for the fence coordinates, as there may be many.
Good luck!

Related

Looking up data in an Oracle (12.1) table using keys from a text file

I have a table with approximately 8 million rows in it. It has a uniqueness constraint on a column called Customer_Identifier. This is a varchar(10) field, is not the primary key, but is unique.
I wish to retrieve some customer rows from this table using SQL Developer. I have been given a text file with each record containing a search key value in the columns 1-10. This query will need to be reused a few times, with different customer_identifier values. Sometimes I will be given a few customer_identifier values (<1000 of them). Sometimes many (between 1000 and 10000 of them). For the times when I want fewer than 1000 values, it's pretty straightforward to use an IN clause. I can edit the text file to wrap the keys in quotes and insert commas as appropriate. But SQL developer has a hard limit of 1000 values in an IN clause.
I only have read rights to the database, so creating and managing a new physical table is out of the question :-(.
Is there a way that I can treat the text file as a table in Oracle 12.1, and thus use it to join to my customer table on the customer_identifier column?
Brgds
Chris
Yes, you can treat a text file as an external table. But you may need DBA assistance to create a new directory, if you don't have access to a directory defined in the database.
Thanks to Oracle Base
**Create a directory object pointing to the location of the files.**
CREATE OR REPLACE DIRECTORY ext_tab_data AS '/data';
**Create the external table using the CREATE TABLE..ORGANIZATION EXTERNAL syntax. This defines the metadata for the table describing how it should appear and how the data is loaded.**
CREATE TABLE countries_ext (
country_code VARCHAR2(5),
country_name VARCHAR2(50),
country_language VARCHAR2(50)
)
ORGANIZATION EXTERNAL (
TYPE ORACLE_LOADER
DEFAULT DIRECTORY ext_tab_data
ACCESS PARAMETERS (
RECORDS DELIMITED BY NEWLINE
FIELDS TERMINATED BY ','
MISSING FIELD VALUES ARE NULL
(
country_code CHAR(5),
country_name CHAR(50),
country_language CHAR(50)
)
)
LOCATION ('Countries1.txt','Countries2.txt')
)
PARALLEL 5
REJECT LIMIT UNLIMITED;
**Once the external table created, it can be queried like a regular table.**
SQL> SELECT *
2 FROM countries_ext
3 ORDER BY country_name;
COUNT COUNTRY_NAME COUNTRY_LANGUAGE
----- ---------------------------- -----------------------------
ENG England English
FRA France French
GER Germany German
IRE Ireland English
SCO Scotland English
USA Unites States of America English
WAL Wales Welsh
7 rows selected.
SQL>

KSQL Group By to drop previous values and only use the LAST

I have a Kafka topic "events" which records user image votes and has json in the following structure:
{"category":"image","action":"vote","label":"amsterdam","ip":"1.1.1.1","value":2}
I need to receive on another topic the sum of all votes for the label (e.g. amsterdam) but drop any votes that came from the same IP address using only the last vote. This topic should have json in this format:
{label:”amsterdam”,SCORE:8,TOTAL:3}
SCORE is a sum of all votes and TOTAL is the number of votes counted.
The solution I made creates a stream from the topic events:
CREATE STREAM st_events
(CATEGORY STRING, ACTION STRING, LABEL STRING, VALUE BIGINT, IP STRING)
WITH (KAFKA_TOPIC='events', VALUE_FORMAT='JSON');
Then, I create a table tb_votes which calculates the score and total for each label and IP address:
CREATE TABLE tb_votes WITH (KAFKA_TOPIC='tb_votes', PARTITIONS=1, REPLICAS=1) AS SELECT
st_events.LABEL "label", SUM(st_events.VALUE-1) "score", CAST(COUNT(*) AS BIGINT) "total"
FROM st_events
WHERE
st_events.category='image' AND st_events.action='vote'
GROUP BY st_events.label, st_events.ip
EMIT CHANGES;
The problem is that instead of dropping all the previous votes coming from the same ip address for the same image, Kafka uses all of them. This makes sense as it is a GROUP BY.
Any idea how to "drop" all previous votes and only use the latest values for an image/ IP?
You need a two stage aggregation.
The first stage should build a table with a primary key containing both the ip and label and another column holding the value.
Build a second table from this first table to get the count and sum per-label that you need.
If another vote comes in from the same ip for the same label then the first table will be updated with the new value and the second table will be correctly updated. It will first remove the old value from the count and sum and then apply the new value.
ksqlDB does not yet support multiple primary key columns (though its coming VERY soon!). So when you group by two columns it just does a funky string concatenation. But we can work with that for now.
CREATE TABLE BY_IP_AND_LABEL AS
SELECT
label + '-' + ip AS ipAndLabel,
value
FROM st_events
GROUP BY ip + '#' + label;
CREATE TABLE BY_LABEL AS
SELECT
SUBSTRING(labelAndIp, INSTR(labelAndIp, '#')) AS label,
SUM(VALUE-1) AS score,
COUNT(*) AS total
FROM BY_IP_AND_LABEL
GROUP BY SUBSTRING(ipAndLabel, INSTR(ipAndLabel, '#'));
The first table creates a composite key with and # as the separator. The second table uses INSTR and SUBSTRING to find the separator and extract the label.
Note: I've not tested this - I could have some 'off-by-one' errors in the logic.
This should do what you need.

Can you aggregate denormalized parse-server query results in one statement using Swift?

My experience is with SQL but I am working on learning parse server data management and in the example below I demonstrate how I would use SQL to represent the data I currently have stored in my parse server classes. I am trying to present all the users, the count of how many images they have uploaded, and a count of how many images they have liked for an app where users can upload images and they can also scroll through and like other people's images. I store the id of the user who uploads the image on the image table and I store an array column in the image table of all the ids that have liked it.
Using SQL I would have normalized this into 3 tables (user, image, user_x_image), joined the tables, and then aggregated that result. But I am trying to learn the right way to do this using parse server where my understanding is that the best practice is to structure the data the way I have below. What I want to do is produce a "leader board" that presents which users have uploaded the most images or liked the most images to inspire engagement. Even justy links to examples of how to join/aggregate parse data sets would be very helpful. If I wasn't clear in what I am trying to achieve please let me know if the comments and I will add updates.
-- SQL approximation of data structured in parse
create volatile table users
( user_id char(10)
, user_name char(50)
) on commit preserve rows;
insert into users values('1a','Tom');
insert into users values('2b','Dick');
insert into users values('3c','Harry');
insert into users values('4d','Simon');
insert into users values('5e','Garfunkel');
insert into users values('6f','Jerry');
create volatile table images
( image_id char(10)
, user_id_owner char(10) -- The object Id for the parse user that uploaded
, UsersWhoLiked varchar(100) -- in Parse class this is array of user ids that clicked like
) on commit preserve rows;
insert into images values('img01','1a','["4d","5e"]');
insert into images values('img02','6f','["1a","2b","3c"]');
insert into images values('img03','6f','["1a","6f",]');
-----------------------------
-- DESIRED RESULTS
-- Tom has 1 uploads and 2 likes
-- Dick has 0 uploads and 1 likes
-- Harry has 0 uploads and 1 likes
-- Simon has 0 uploads and 1 likes
-- Garfunkel has 0 uploads and 1 likes
-- Jerry has 2 uploads and 1 likes
-- How to do with normalized data structure
create volatile table user_x_image
( user_id char(10)
, image_id char(10)
, relationship char(10)
) on commit preserve rows;
insert into user_x_image values('4d','img01','liker');
insert into user_x_image values('5e','img01','liker');
insert into user_x_image values('1a','img02','liker');
insert into user_x_image values('2b','img02','liker');
insert into user_x_image values('3c','img02','liker');
insert into user_x_image values('1a','img03','liker');
insert into user_x_image values('6f','img03','liker');
-- Return the image likers/owners
sel
a.user_name
, a.user_id
, coalesce(c.cnt_owned,0) cnt_owned
, sum(case when b.relationship='liker' then 1 else 0 end) cnt_liked
from
users A
left join
user_x_image B
on a.user_id = b.user_id
left join (
sel user_id_owner, count(*) as cnt_owned
from images
group by 1) C
on a.user_id = c.user_id_owner
group by 1,2,3 order by 2
-- Returns desired results
First, I am assuming you are running Parse Server with a MongoDB database (Parse Server also supports Postgres and it can make things little bit easier for relational queries). Because of this, it is important to note that, besides Parse Server implements relational capabilities in its API, in fact we are talking about a NoSQL database behind the scenes. So, let's go with the options.
Option 1 - Denormalized Data
Since it is a NoSQL database, I'd prefer to have a third collection called LeaderBoard. You could add a afterSave trigger to the UserImage class and make LeaderBoard always updated. When you need the data, you can do a very simple and fast query. I know that it sounds kinda strange for a experienced SQL developer to have a denormalized data, but it is the best option in terms of performance if you have more reads than writes in this collection.
Option 2 - Aggregate
MongoDB supports aggregates (https://docs.mongodb.com/manual/aggregation/) and it has a pipeline stage called $lookup (https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/) that you can use in order to perform your query using a single api call/database operation. Parse Server supports aggregates as well in its API and JS SDK (https://docs.parseplatform.org/js/guide/#aggregate) but unfortunately not directly from client code in Swift because this operation requires a master key in Parse server. Therefore, you will need to write a cloud code function that performs the aggregate query for you and then call this cloud cloud function from your Swift client code.

how to join 2 tables and show records in tableview

I have these 2 tables
CREATE TABLE "QuestionWithAnswer" ("Date" DATETIME PRIMARY KEY NOT NULL , "Question" TEXT, "Answer" TEXT, "UserAnswer" TEXT, "IsCorrect" BOOL)
CREATE TABLE "Records" ("id" INTEGER PRIMARY KEY ,"DateWithTime" DATETIME,"UserGivenAnswer" TEXT DEFAULT (null) ,"Correct" TEXT DEFAULT (null) ,"Question_ID" TEXT)
i want to join them on date and retreive records and show them in tableview in ios.
The following query will get all the columns in the database where the dates match: SELECT * FROM QuestionWithAnswer AS Q JOIN Records AS R ON Q.Date=R.DateWithTime;
Next thing is up to you to create objects of the two tables and parse the fetched SQLite data into the right fields of the respective object.
After that, create a UITableVIew and set the viewcontroller (or a model) as the datasource and just use a list (NSArray, NSDictionary, ..) containing the fetched objects to populate that tableview. There are lots of good tutorials on the net on how to do this.
For your info, this kind of question is quite broad and you cannot expect a full-featured answer since that means creating a lot of code which really isn't the purpose of SO.

Change Data Capture with table joins in ETL

In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:
Countries:
ID Name
1 France
Towns:
ID Name Country_ID
1 Lyon 1
Now lets say a new row is added to Towns table:
ID Name Country_ID
1 Lyon 1
2 Paris 2
The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.
Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.
This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.
There's multiple ways to solve this.
One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).
Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).
There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...
1st Pass is where everything matches via the join...
Union All
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that
requires the ID in the countries table to be null/missing).
You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".

Resources