KSQL for aggregated historic data sorted by day with accepting late updates - ksqldb

i've been trying to use ksql to aggregate data that comes in the following json pattern.
{"code": "1234", "type": "1234", "item": "1234", "items_sold": 3, "items_acquired": 0, "timestamp": "yyyyMMddhhmmssSSSSSSSSS"}
So far I've managed to aggregate the net change of items for any given day. I can also output the total current amount for any given item of type and code without problems.
The following is how i calculate net change for a given day
CREATE STREAM daily_change WITH (KAFKA_TOPIC='daily_change', PARTITIONS=1, REPLICAS=1) AS SELECT
items_acquired - items_sold as net_change,
item, type, code,
SUBSTRING(CAST(datets AS STRING), 0, 8) as curr_day
FROM change_event
EMIT CHANGES;
and from that stream i create a table
CREATE TABLE DAILY_INVENTORY WITH (KAFKA_TOPIC='daily_inventory', KEY_FORMAT='DELIMITED', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='JSON') AS SELECT
SUM(net_change),
curr_day, item, type, code
FROM daily_change
WINDOW TUMBLING ( SIZE 24 HOURS )
GROUP BY curr_day, item, type, code
EMIT CHANGES;
For the total inventory its the following
CREATE TABLE current_inventory WITH (KAFKA_TOPIC='current_inventory', KEY_FORMAT='DELIMITED', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='JSON') AS SELECT
SUM(items_acquired - items_sold) current_stock,
code, item, type
FROM change_event
GROUP BY code, item, type
EMIT CHANGES;
I'm stuck trying to get a table to show the total amount of inventory grouped by the day, code, item and type.
On top of that problem, it also needs to accept changes of up to 14 days in the past.
The problem is that i have not managed to seperate the inventory by day. It either calculates only the net change of amount per day or the total current amount.
Basically either i do not have the information of day or i do not have the information of total amount.
Some example data so i can illustrate the problem:
{"code": "1234", "type": "1234", "item": "1234", "items_sold": 0, "items_acquired": 2, "timestamp": "20220910113045000000000"}
{"code": "1234", "type": "1234", "item": "1234", "items_sold": 1, "items_acquired": 0, "timestamp": "20220910143045000000000"}
At the end of the day this should result in a total of 1 for 20220910
{"code": "1234", "type": "1234", "item": "1234", "items_sold": 0, "items_acquired": 3, "timestamp": "20220912113045000000000"}
This should give me a total of 4 for 20220912
{"code": "1234", "type": "1234", "item": "1234", "items_sold": 0, "items_acquired": 1, "timestamp": "20220902113045000000000"}
This should adjust every total by 1, so 20220910 should now be 2, and
20220912 should now be 5.
I've tried using windowing with retention and grace and such, but haven't had success, as mentioned above i either end up producing the net_change or have to give up the information of the day timestamp
Does anybody have an idea what kind of approach i could try?

Related

Divide a number over 23 columns based on variables

Okay this might be very basic but here goes:
I have 23 "campaign" types, each of which contain a number of locales (UK, India, etc). Each of these, such as "Campaign 1, UK" are either "Open" or "Closed" and have a priority assigned to them. They also have a required client number, so "Campaign 1, UK" has a required client number of 5, "Campaign 1, India" has 7, "Campaign 2, UK" has 14 and so on and so forth.
If I have a list containing different locales client numbers such as 14 for "UK", 54 for "India", is it possible to:
If UK is "open", gather all of the campaign types and assign the 14 UK clients to different campaigns based on the priority and required client number?
I understand this probably makes no sense but can clarify if needed.
Thanks in advance!

Find and sum date ranges with overlapping records in postgresql

I have a large dataset where I want to sum a count where records have overlapping time. For example, given the data
[
{"id": 1, "name": 'A', "start": '2018-12-10 00:00:00', "end": '2018-12-20 00:00:00', count: 34},
{"id": 2, "name": 'B', "start": '2018-12-16 00:00:00', "end": '2018-12-27 00:00:00', count: 19},
{"id": 3, "name": 'C', "start": '2018-12-16 00:00:00', "end": '2018-12-20 00:00:00', count: 56},
{"id": 4, "name": 'D', "start": '2018-12-25 00:00:00', "end": '2018-12-30 00:00:00', count: 43}
]
You can see there are 2 periods where activities overlap. I want to return the total count of these 'overlaps' based on the activities involved in overlap. So the above would output something like:
[
{start:'2018-12-16', end: '2018-12-20', overlap_ids:[1,2,3], total_count: 109},
{start:'2018-12-25', end: '2018-12-27', overlap_ids:[2,4], total_count: 62},
]
The question is, how to go about generating this via a postgres query? Was looking into generate_series then working out what activity falls into each interval, but thats not quite right as the data is continuous - I really need to identify the exact overlapping time then do a sum on the overlapping activities.
EDIT Have added another example. As #SRack pointed out, since A,B,C overlap, this means B,C A,B and A,C also overlap. This doesn’t matter since the output I’m looking for is an array of date ranges that contain overlapping activities rather than all the unique combinations of overlaps. Also note the dates are timestamps, so will have millisecond precision and won’t necessarily all be at 00:00:00.
If it helps, there would probably be a WHERE condition on the total count. For example only want to see results where total count > 100
demo:db<>fiddle (uses the old data set with the overlapping A-B-part)
Disclaimer: This works for day intervals not for timestamps. The requirement for ts came later.
SELECT
s.acts,
s.sum,
MIN(a.start) as start,
MAX(a.end) as end
FROM (
SELECT DISTINCT ON (acts)
array_agg(name) as acts,
SUM(count)
FROM
activities, generate_series(start, "end", interval '1 day') gs
GROUP BY gs
HAVING cardinality(array_agg(name)) > 1
) s
JOIN activities a
ON a.name = ANY(s.acts)
GROUP BY s.acts, s.sum
generate_series generates all dates between start and end. So every date an activity exists gets one row with the specific count
Grouping all dates, aggregating all existing activities and sum of their counts
HAVING filters out the dates where only one activity exist
Because there are different days with the same activities we only need one representant: Filter all duplicates with DISTINCT ON
Join this result against the original table to get the start and end. (note that "end" is a reserved word in Postgres, you should better find another column name!). It was more comfortable to lose them before but its possible to get these data within the subquery.
Group this join to get the most early and latest date of each interval.
Here's a version for timestamps:
demo:db<>fiddle
WITH timeslots AS (
SELECT * FROM (
SELECT
tsrange(timepoint, lead(timepoint) OVER (ORDER BY timepoint)),
lead(timepoint) OVER (ORDER BY timepoint) -- 2
FROM (
SELECT
unnest(ARRAY[start, "end"]) as timepoint -- 1
FROM
activities
ORDER BY timepoint
) s
)s WHERE lead IS NOT NULL -- 3
)
SELECT
GREATEST(MAX(start), lower(tsrange)), -- 6
LEAST(MIN("end"), upper(tsrange)),
array_agg(name), -- 5
sum(count)
FROM
timeslots t
JOIN activities a
ON t.tsrange && tsrange(a.start, a.end) -- 4
GROUP BY tsrange
HAVING cardinality(array_agg(name)) > 1
The main idea is to identify possible time slots. So I take every known time (both start and end) and put them into a sorted list. So I can take the first tow known times (17:00 from start A and 18:00 from start B) and check which interval is in it. Then I check it for the 2nd and 3rd, then for 3rd an 4th and so on.
In the first timeslot only A fits. In the second from 18-19 also B is fitting. In the next slot 19-20 also C, from 20 to 20:30 A isn't fitting anymore, only B and C. The next one is 20:30-22 where only B fits, finally 22-23 D is added to B and last but not least only D fits into 23-23:30.
So I take this time list and join it agains the activities table where the intervals intersect. After that its only a grouping by time slot and sum up your count.
this puts both ts of a row into one array whose elements are expanded into one row per element with unnest. So I get all times into one column which can be simply ordered
using the lead window function allows to take the value of the next row into the current one. So I can create a timestamp range out of these both values with tsrange
This filter is necessary because the last row has no "next value". This creates a NULL value which is interpreted by tsrange as infinity. So this would create an incredible wrong time slot. So we need to filter this row out.
Join the time slots against the original table. The && operator checks if two range types overlap.
Grouping by single time slots, aggregating the names and the count. Filter out the time slots with only one activity by using the HAVING clause
A little bit tricky to get the right start and end points. So the start points are either the maximum of the activity start or the beginning of a time slot (which can be get using lower). E.g. Take the 20-20:30 slot: It begins 20h but neither B nor C has its starting point there. Similar the end time.
As this is tagged Ruby on Rails, I've put together a Rails solution for this too. I've updated the data so they don't all overlap, and worked with the following:
data = [
{"id": 1, "name": 'A', "start": '2017-12-10 00:00:00', "end": '2017-12-20 00:00:00', count: 34},
{"id": 2, "name": 'B', "start": '2018-12-16 00:00:00', "end": '2018-12-21 00:00:00', count: 19},
{"id": 3, "name": 'C', "start": '2018-12-20 00:00:00', "end": '2018-12-29 00:00:00', count: 56},
{"id": 4, "name": 'D', "start": '2018-12-21 00:00:00', "end": '2018-12-30 00:00:00', count: 43}
]
(2..data.length).each_with_object({}) do |n, hash|
data.combination(n).each do |items|
combination = items.dup
first_item = combination.shift
first_item_range = (Date.parse(first_item[:start])..Date.parse(first_item[:end]))
if combination.all? { |i| (Date.parse(i[:start])..Date.parse(i[:end])).overlaps?(first_item_range) }
hash[items.map { |i| i[:name] }.sort] = items.sum { |i| i[:count] }
end
end
end
I've updated the data so they don't all overlap, which generates the following results:
# => {["B", "C"]=>75, ["B", "D"]=>62, ["C", "D"]=>99, ["B", "C", "D"]=>118}
... So you can see items B, C and D overlap, with a total count of 118. (Naturally, this also means B, C, B, D and C, D overlap.)
Here's what this does in steps:
gets each combination of entries of data, from a length of 2 to 4 (the data's length)
iterates through these and compares the first element of the combination to the others
if these all overlap, store this in a hash
This way, we get unique entries of data names, with a count stored alongside them.
Hope this is useful - happy to take feedback on anyway in which this could be improved. Let me know how you get on!

How can I easily label my data in Power BI?

Question
Is there a fast, scalable way to replace number values by mapped text labels in my visualisations?
Background
I often find myself with questionnaire data of the following format:
ID Sex Age class Answer to question
001 1 2 5
002 2 3 2
003 1 3 1
004 2 5 1
The Sex, Age class and Answer column values actually map to text labels. For the example of Sex:
ID Description
0 Unknown
1 Man
2 Woman
Similar mappings are possible for the other columns.
If I create visualisations of e.g. the distribution of sex in my respondent group I'll get a visual showing that 50% of my data has sex 1 and 50% of my data has sex 2.
The data itself often originates from an Excel or csv file.
What I have tried
To make that visualisation meaningful to other people I:
create a second table containing the mapping between the value and label
create a relationship between the source data and the mapping
use the Description column of my mapping table as a category in my visualisations.
I have to do this for several columns in my dataset, which makes this a tedious process.
Ideal solution
A method that allows me to define, per column, a mapping between values and corresponding text labels. SPSS' VALUE LABELS command comes to mind.
You can simply create a calculated column on your table that defines how you want to map each ID values using a SWITCH function and use that column in your visual. For example,
Sex Label =
SWITCH([Sex],
1, "Man",
2, "Woman",
"Unknown"
)
(Here, the last argument is an else condition that gets returned if none of the previous get matched.)
If you want to do a whole bunch at a time, you can create a new table from your existing table using ADDCOLUMNS like this:
Test =
ADDCOLUMNS(
Table1,
"Sex Label", SWITCH([Sex], 1, "Man", 2, "Woman", "Unknown"),
"Question 1 Label", SWITCH([Question 1], 1, "Yes", 2, "No", "Don't Know"),
"Question 2 Label", SWITCH([Question 2], 1, "Yes", 2, "No", "Don't Know"),
"Question 3 Label", SWITCH([Question 3], 1, "Yes", 2, "No", "Don't Know")
)

Create/Modify Survey - API v3

I have used V2 of the Survey Monkey API to get details on collectors and surveys. I am now interested in learning how to use the V3 API to Create/Modify surveys. I hope some useful tips from other users, would help me out, as I am relatively new to the API. I will be using Python.
Specifically, my use case is that I want to use a base survey as a template, and modify the answer options per recipient. Here is an example:
Recipient A would get:
Q1. On a scale of 1 (least) to 5 (most), how much do you like eating:
a. Burgers
b. Pizza
c. Hotdogs
Q2. On a scale of 1 (rarely) to 5 (very), in a typical week, how often do you eat:
a. Burgers
b. Pizza
c. Hotdogs
While recipient B would get
Q1. On a scale of 1 (least) to 5 (most), how much do you like eating:
a. Fried chicken
b. French fries
c. Tacos
Q2. On a scale of 1 (rarely) to 5 (very), in a typical week, how often do you eat:
a. Fried chicken
b. French fries
c. Tacos
How do I create the API that reads in the various answer options.
I also plan to use pandas to load the table of answer options per recipient, and want to find out how to pipe the answer options into the API - would it be through a conversion into JSON? Have read the documentation, but it's not always obvious what needs to be done (to a newbie).
Thanks so much!
As far as I'm aware, there isn't branching logic available to show/hide answer options. If you were sending the survey to one recipient at a time, and you really wanted to have one question with modified answer options you could theoretically do something like this:
POST /v3/surveys/<id>/pages/<id>/questions
{
"family": "matrix",
"subtype": "rating",
"answers": {
"rows": [
{
"text": "Burgers",
"visible": true,
"position": 1
},
{
"text": "Pizza",
"visible": true,
"position": 2
},
{
"text": "Hotdogs",
"visible": true,
"position": 3
},
{
"text": "Fried chicken",
"visible": false,
"position": 4
},
{
"text": "French fries",
"visible": false,
"position": 5
},
{
"text": "Tacos",
"visible": false,
"position": 6
}
],
"choices": [
{
"text": "1",
"position": 1
},
{
"text": "2",
"position": 2
},
{
"text": "3",
"position": 3
},
{
"text": "4",
"position": 4
},
{
"text": "5",
"position": 5
}
]
},
"headings": [
{
"heading": "On a scale of 1 (least) to 5 (most), how much do you like eating:"
}
],
"forced_ranking": false
}
And then patch visible on the answer options between true and false for each recipient, that way you can analyse on the same question. But that's not really ideal as this changes the survey for everyone taking it limiting you to one recipient taking the survey at a time.
Given you are planning on moving the data to pandas anyways, why not just separate into four different questions? Then just use advanced branching to hide/show questions based on a custom value on the recipient. That way you can have a rule that's something like:
if contact.custom1 is exactly "fried" then hide question 1 and show question 2
Then you can export all your data or fetch your responses through the API
GET /v3/surveys/<id>/responses/bulk
Which will give you a JSON of all the responses which you can move to pandas. There may be other ways to do what you'd like but given the available functionality; this is a couple examples that may be helpful.

Group data by timeframe on a datetime Highcharts graph

Consider the following series:
[
{
"x": 1447840800000,
"y": 199214,
"num_messages": 6
},
{
"x": 1447842600000,
"y": 27152,
"num_messages": 3
},
{
"x": 1447844400000,
"y": 349919,
"num_messages": 7
}
...
]
Each of the timestamps are a half-hour apart. I'm currently showing 2-weeks worth of data like this on a datetime-formatted Highcharts column chart. I would like to give users the ability to view it in different increments of time, not just the default half-hour. When larger increments are chosen, the y and num_messages values should be summed for whichever data points fall within a given increment.
Currently, I am manipulating the series array manually to create a new series with the desired time increments. But can Highcharts essentially do this for me? Tried googling and looking through the API but didn't come up with anything.

Resources