KSQL Group By to drop previous values and only use the LAST - ksqldb

I have a Kafka topic "events" which records user image votes and has json in the following structure:
{"category":"image","action":"vote","label":"amsterdam","ip":"1.1.1.1","value":2}
I need to receive on another topic the sum of all votes for the label (e.g. amsterdam) but drop any votes that came from the same IP address using only the last vote. This topic should have json in this format:
{label:”amsterdam”,SCORE:8,TOTAL:3}
SCORE is a sum of all votes and TOTAL is the number of votes counted.
The solution I made creates a stream from the topic events:
CREATE STREAM st_events
(CATEGORY STRING, ACTION STRING, LABEL STRING, VALUE BIGINT, IP STRING)
WITH (KAFKA_TOPIC='events', VALUE_FORMAT='JSON');
Then, I create a table tb_votes which calculates the score and total for each label and IP address:
CREATE TABLE tb_votes WITH (KAFKA_TOPIC='tb_votes', PARTITIONS=1, REPLICAS=1) AS SELECT
st_events.LABEL "label", SUM(st_events.VALUE-1) "score", CAST(COUNT(*) AS BIGINT) "total"
FROM st_events
WHERE
st_events.category='image' AND st_events.action='vote'
GROUP BY st_events.label, st_events.ip
EMIT CHANGES;
The problem is that instead of dropping all the previous votes coming from the same ip address for the same image, Kafka uses all of them. This makes sense as it is a GROUP BY.
Any idea how to "drop" all previous votes and only use the latest values for an image/ IP?

You need a two stage aggregation.
The first stage should build a table with a primary key containing both the ip and label and another column holding the value.
Build a second table from this first table to get the count and sum per-label that you need.
If another vote comes in from the same ip for the same label then the first table will be updated with the new value and the second table will be correctly updated. It will first remove the old value from the count and sum and then apply the new value.
ksqlDB does not yet support multiple primary key columns (though its coming VERY soon!). So when you group by two columns it just does a funky string concatenation. But we can work with that for now.
CREATE TABLE BY_IP_AND_LABEL AS
SELECT
label + '-' + ip AS ipAndLabel,
value
FROM st_events
GROUP BY ip + '#' + label;
CREATE TABLE BY_LABEL AS
SELECT
SUBSTRING(labelAndIp, INSTR(labelAndIp, '#')) AS label,
SUM(VALUE-1) AS score,
COUNT(*) AS total
FROM BY_IP_AND_LABEL
GROUP BY SUBSTRING(ipAndLabel, INSTR(ipAndLabel, '#'));
The first table creates a composite key with and # as the separator. The second table uses INSTR and SUBSTRING to find the separator and extract the label.
Note: I've not tested this - I could have some 'off-by-one' errors in the logic.
This should do what you need.

Related

postgres find IP address between 2 db columns

I've a table of IP addresses. The table has two columns names starting_ip and ending_ip. The table looks like the following:
Now, let's say I have a random IP address. From that random Ip address, I want to know the city_name. That means I need to know that the random IP address falls between which range, based on starting_ip and ending_ip. Then find 1 record and get the city_name.
I wrote a query something like this:
class IpToCity < ActiveRecord::Base
establish_connection :"ip_database_#{Rails.env}"
scope :search_within_ip_range, -> (ip_address) do
self.connection.select_all("
with candidate as (
select * from ip_cities
where ending_ip >= '#{ip_address}'::inet
order by ending_ip asc
limit 1
)
select * from candidate
where starting_ip <= '#{ip_address}'::inet;
")
end
end
It's a scope, where I pass the random IP and get a single record. The problem is, the query works fine, but it's very slow. Any suggestion, how to make it faster?
Thanks in advance!
Do all the rows match this format?
starting_ip ending_ip
x.y.z.0 x.y.z.255
If so, then you can add another column for "prefix": x.y.z.
Then match the first 3 octets of the target against the prefix column.
When updating the DB, break rows that span more than one prefix into multiple rows.
The max number of rows is 16.8M (2563), which is small and only slightly bigger than your current 5M.

How to assign incremental numbers but only if another column is different value

I am trying to assign a unique ID to companies that I have a list of. These companies have multiple products so the company name appears on multiple rows.
=ARRAYFORMULA(IF($B2<>"",IF((COUNTIF($B$1:$B1,$B2)>0),INDEX($A$1:$R2,MATCH($B2,$B$1:$B1,0),12),CONCATENATE("C00",ROW($C2))),""))
The above kind of works but it will assign C001 then it will see that Column C row value matches and skips but it ten assigns C009 if the next company name is 8 rows down rather than assigning C002 to this next company.
=ARRAYFORMULA(IF($B2<>"",IF((COUNTIF($B$1:$B1,$B2)>0),INDEX($A$1:$R2,MATCH($B2,$B$1:$B1,0),12),CONCATENATE("RET00",ROW($B2))),""))
I expect each different company name to have an incremental unique ID inputted to Company ID column rows.
Here is my data and expected result:
try it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(B2:B,
{UNIQUE(INDIRECT("B2:B"&COUNTA(B2:B)+1)),
TEXT(ROW(INDIRECT("B1:B"&COUNTUNIQUE(B2:B))), "C0#")}, 2, 0)))

Esper - concatenate values from multiple rows to a list

I have an Esper query that returns multiple rows, but I'd like to instead get one row, where that row has a list (or concatenated string) of all of the values from the (corresponding columns of the) matching rows that my current query returns.
For example:
SELECT Name, avg(latency) as avgLatency
FROM MyStream.win:time(5 min)
GROUP BY Name
HAVING avgLatency / 1000 > 60
OUTPUT last every 5 min
Returns:
Name avgLatency
---- ----------
A 65
B 70
C 75
What I'd really like:
Name
----
{A, B, C}
Is this possible to do via the query itself? I tried to make this work using subqueries, but I'm not working with multiple streams. I can't find any aggregation functions or enumeration functions in the Esper documentation that fits what I'm trying to do either.
Thanks to anybody that has any insight or direction for me here.
EDIT:
If this can't be done via the query, I'm open to changing the subscriber, or anything else, if necessary.
You can have a subscriber or listener do the concat. There is a "Multi-Row Delivery" for subscribers. Or use a table like below.
// create table to hold aggregation result
create table LatencyTable(name string primary key, avgLatency avg(double));
// update aggregations in table from events coming in
into LatencyTable select name, avg(latency) as avgLatency from MyStream#time(5 min) group by name;
// do a select with the "aggregate" enumeration method
select (select * from LatencyTable where avgLatency > x).aggregate(....) from pattern[every timer:interval(5 min)]

How to get the number of entries in a measurement

I am a newbie to influxdb. I just started to read the influx documentation.
I cant seem to get the equivalent of 'select count(*) from table' to work in influx db.
I have a measurement called cart:
time status cartid
1456116106077429261 0 A
1456116106090573178 0 B
1456116106095765618 0 C
1456116106101532429 0 D
but when I try to do
select count(cartid) from cart
I get the error
ERR: statement must have at least one field in select clause
I suppose cartId is a tag rather than a field value? count() currently can't be used on tag and time columns. So if your status is a non-tag column (a field), do the count on that.
EDIT:
Reference
This works as long as no field or tag exists with the name count:
SELECT SUM(count) FROM (SELECT *,count::INTEGER FROM MyMeasurement GROUP BY count FILL(1))
If it does use some other name for the count field. This works by first selecting all entries including an unpopulated field (count) then groups by the unpopulated field which does nothing but allows us to use the fill operator to assign 1 to each entry for count. Then we select the sum of the count fields in a super query. The result should look like this:
name: MyMeasurement
----------------
time sum
0 47799
It's a bit hacky but it's the only way to guarantee a count of all entries when no field exists that is always present in all entries.

Change Data Capture with table joins in ETL

In my ETL process I am using Change Data Capture (CDC) to discover only rows that have been changed in the source tables since the last extraction. Then I do the transformation only for this rows. The problem is when I have for example 2 tables which I want to join into one dimension, and only one of them has changed. For example I have table Countries and Towns as following:
Countries:
ID Name
1 France
Towns:
ID Name Country_ID
1 Lyon 1
Now lets say a new row is added to Towns table:
ID Name Country_ID
1 Lyon 1
2 Paris 2
The Countries table has not been changed, so CDC for these tables shows me only the row from Towns table. The problem is when I do the join between Countries and Towns, there is no row in Countries change set, so the join will result in empty set.
Do you have an idea how to solve it? Of course there might be more difficult cases, involving 3 and more tables, and consequential joins.
This is a typical problem found when doing Realtime Change-Data-Capture, or even Incremental-only daily changes.
There's multiple ways to solve this.
One way would be to do your joins on the natural keys in the dimension or mapping table, to get the associated country (SELECT distinct country_name, [..other attributes..] from dim_table where country_id = X).
Another alternative would be to do the join as part of the change capture process - when a row is loaded to towns, a trigger goes off that loads the foreign key values into the associated staging tables (country, etc).
There is allot i could babble on for more information on but i will be specific to what is in your question. I would suggest the following to get the results...
1st Pass is where everything matches via the join...
Union All
2nd Pass Gets all towns where there isn't a country
(left outer join with a where condition that
requires the ID in the countries table to be null/missing).
You would default the Country ID value in that unmatched join to something designated as a "Unmatched Value" typically 0 or -1 is used or a series of standard -negative numbers that you could assign descriptions to later to identify why data is bad for your example -1 could be "Found Town Without Country".

Resources