KSQL Hopping Window : any way to get only one record in response? - ksqldb

We are using KSQL to perform some aggregations / filtering on a real time data.
One of the use case we have is, we need to perform some operation on last N days of a particular activity, this would be continuous operation.
So this needs to be hopping window.
When I tried the query, with hopping duration as M days, KSQL query returned M records instead of 1 (which was hoping for).
Query :
select PROTO,
TIMESTAMPTOSTRING(WindowStart(), 'yyyy-MM-dd''T''HH:mm:ss''Z''', 'UTC') as "timestamp",
TIMESTAMPTOSTRING(WindowEnd(), 'yyyy-MM-dd''T''HH:mm:ss''Z''', 'UTC'),
COUNT(PROTO) AS Count
FROM DATASTREAM
WINDOW HOPPING (SIZE 5 DAYS, ADVANCE BY 1 DAY)
WHERE MSG like '%SOMESTRING%'
AND SPLIT(PROTO, '/')[0] = 'tcp'
GROUP BY PROTO;
tcp/22 | 2020-01-27T00:00:00Z | 2020-02-01T00:00:00Z | 1
tcp/22 | 2020-01-28T00:00:00Z | 2020-02-02T00:00:00Z | 1
tcp/22 | 2020-01-29T00:00:00Z | 2020-02-03T00:00:00Z | 1
tcp/22 | 2020-01-30T00:00:00Z | 2020-02-04T00:00:00Z | 1
tcp/22 | 2020-01-31T00:00:00Z | 2020-02-05T00:00:00Z | 1
Is there any way to get only first record only, or the records for which end time <= current time or any other workaround to get 1 result per window ?
Please consider below data records.
{ "time": "2020-01-25 23:36:37 UTC", "msg": "Error"}
{ "time": "2020-01-25 23:36:38 UTC", "msg": "Error"}
{ "time": "2020-01-25 23:36:40 UTC", "msg": "Error"}
{ "time": "2020-01-26 23:36:37 UTC", "msg": "Error"}
{ "time": "2020-01-26 23:36:38 UTC", "msg": "Error"}
{ "time": "2020-01-26 23:36:39 UTC", "msg": "Error"}
{ "time": "2020-01-26 23:36:40 UTC", "msg": "Error"}
{ "time": "2020-01-27 23:36:37 UTC", "msg": "Error"}
{ "time": "2020-01-27 23:36:38 UTC", "msg": "Error"}
{ "time": "2020-01-27 23:36:39 UTC", "msg": "Error"}
{ "time": "2020-01-28 23:36:37 UTC", "msg": "Error"}
{ "time": "2020-01-28 23:36:38 UTC", "msg": "Error"}
{ "time": "2020-01-29 23:36:37 UTC", "msg": "Error"}
{ "time": "2020-01-29 23:36:38 UTC", "msg": "Error"}
{ "time": "2020-01-29 23:36:39 UTC", "msg": "Error"}
{ "time": "2020-01-29 23:36:40 UTC", "msg": "Error"}
I am looking for count of records which have, msg as Error past 2 days.
If I fire the KSQL query on 25th at 23:36:37, I would be expecting result as :
2020-01-25T23:36:37Z | 1
2020-01-25T23:36:38Z | 2
2020-01-25T23:36:40Z | 3
2020-01-26T23:36:37Z | 4
2020-01-26T23:36:38Z | 5
2020-01-26T23:36:39Z | 6
2020-01-26T23:36:40Z | 7
2020-01-27T23:36:37Z | 5
2020-01-27T23:36:38Z | 6
2020-01-27T23:36:39Z | 7
2020-01-28T23:36:37Z | 4
2020-01-28T23:36:38Z | 5
2020-01-29T23:36:37Z | 3
2020-01-29T23:36:38Z | 4
2020-01-29T23:36:39Z | 5
2020-01-29T23:36:40Z | 6

I think you need a TUMBLING window instead if you want one window for the five day period. You're instead getting five windows because you've used HOPPING with an advance of 1 DAY - see the WINDOWSTART() changes per day.
References:
Hopping windows twitter thread
Tumbling windows twitter thread
https://docs.ksqldb.io/en/latest/concepts/time-and-windows-in-ksqldb-queries/
https://rmoff.net/2020/01/09/exploring-ksqldb-window-start-time/

Related

Problems with same query in mongoid rails db but different parameters

I’m using mongoid gem ’mongoid’, ’~> 7.2.4’ (mongoDB 3.6) with rails (5) and I have a database with customer collections and bills with this relation:
class Bill
...
belongs_to :customer, index: true
...
end
class Customer
....
has_many :bills
...
end
then in a pry console I test with two clients:
[55] pry(main)> c_slow.class
=> Customer
[58] pry(main)> c_slow.bills.count
MONGODB | pro-app-mongodb-05:27017 req:1030 conn:1:1 | db_api_production.aggregate | STARTED | {"aggregate"=>"bills", "pipeline"=>[{"$match"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('60c76b9e21225c002044f6c5')}}, {"$group"=>{"_id"=>1, "n"=>{"$sum"=>1}}}], "cursor"=>{}, "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#...
MONGODB | pro-app-mongodb-05:27017 req:1030 | db_api_production.aggregate | SUCCEEDED | 0.008s
=> 523
[59] pry(main)> c_fast.bills.count
MONGODB | pro-app-mongodb-05:27017 req:1031 conn:1:1 | db_api_production.aggregate | STARTED | {"aggregate"=>"bills", "pipeline"=>[{"$match"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('571636f44a506256d6000003')}}, {"$group"=>{"_id"=>1, "n"=>{"$sum"=>1}}}], "cursor"=>{}, "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#...
MONGODB | pro-app-mongodb-05:27017 req:1031 | db_api_production.aggregate | SUCCEEDED | 0.135s
=> 35913
until this moment it seems correct but when I execute this query:
[60] pry(main)> c_slow.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference, :_id)
MONGODB | pro-app-mongodb-05:27017 req:1083 conn:1:1 | db_api_production.find | STARTED | {"find"=>"bills", "filter"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('60c76b9e21225c002044f6c5'), "_id"=>{"$ne"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}, "limit"=>1000, "sort"=>{"reference"=>-1}, "projection"=>{"reference"=>1, "_id"=>1...
MONGODB | pro-app-mongodb-05:27017 req:1083 | db_api_production.find | SUCCEEDED | 10.075s
MONGODB | pro-app-mongodb-05:27017 req:1087 conn:1:1 | db_api_production.getMore | STARTED | {"getMore"=>#<BSON::Int64:0x0000558bcd7ba5f8 #value=165481790189>, "collection"=>"bills", "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x0000558bcd7a4b90 #seconds=1652511506, #increment=1>, "signature"=>{"hash"=><...
MONGODB | pro-app-mongodb-05:27017 req:1087 | db_api_production.getMore | SUCCEEDED | 1.181s
[61] pry(main)> c_fast.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference, :_id)
MONGODB | pro-app-mongodb-05:27017 req:1091 conn:1:1 | db_api_production.find | STARTED | {"find"=>"bills", "filter"=>{"deleted_at"=>nil, "customer_id"=>BSON::ObjectId('571636f44a506256d6000003'), "_id"=>{"$ne"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}, "limit"=>1000, "sort"=>{"reference"=>-1}, "projection"=>{"reference"=>1, "_id"=>1...
MONGODB | pro-app-mongodb-05:27017 req:1091 | db_api_production.find | SUCCEEDED | 0.004s
MONGODB | pro-app-mongodb-05:27017 req:1092 conn:1:1 | db_api_production.getMore | STARTED | {"getMore"=>#<BSON::Int64:0x0000558bcd89c4d0 #value=166614148534>, "collection"=>"bills", "$db"=>"db_api_production", "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x0000558bcd88eab0 #seconds=1652511516, #increment=1>, "signature"=>{"hash"=><...
MONGODB | pro-app-mongodb-05:27017 req:1092 | db_api_production.getMore | SUCCEEDED | 0.013s
The slow customer is taking 10 seconds and the fast one is taking 0.004s in the same query. and the slow customer has less than 600 documents and the fast client more than 35000. it has no sense for me.
We did on the bills collection a Reindex, we take the query over all customers and it seems too work at the beginnign but in thre second query it went slow again but the same customers are always slow than the fastest one
[1] pry(main)> Customer.all.collect do |c|
[1] pry(main)* starting = Process.clock_gettime(Process::CLOCK_MONOTONIC)
[1] pry(main)* c.bills.excludes(_id: BSON::ObjectId('62753df4a54d7584e56ea829')).order(reference: :desc).limit(1000).pluck(:reference_string, :id);nil
[1] pry(main)* ending = Process.clock_gettime(Process::CLOCK_MONOTONIC)
[1] pry(main)* [c.acronym, ending - starting]
[1] pry(main)* end
I cannot apply explain on pluck query. I reviewd the index and it worked correctly placed in the collection
but doing explain it is slow on the same query
MONGODB | pro-app-mongodb-05:27017 req:1440 | dbapiproduction.explain | SUCCEEDED | 10.841s
MONGODB | pro-app-mongodb-05:27017 req:2005 | dbapiproduction.explain | SUCCEEDED | 0.006s
obviously time, but also docsExamined
the query is the same, changing obyously de ids:
[23] pry(main)> h_slow["queryPlanner"]["parsedQuery"]
=> {"$and"=>
[{"customer_id"=>{"$eq"=>BSON::ObjectId('60c76b9e21225c002044f6c5')}},
{"deleted_at"=>{"$eq"=>nil}},
{"$nor"=>[{"_id"=>{"$eq"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}]}]}
[24] pry(main)> h_fast["queryPlanner"]["parsedQuery"]
=> {"$and"=>
[{"customer_id"=>{"$eq"=>BSON::ObjectId('571636f44a506256d6000003')}},
{"deleted_at"=>{"$eq"=>nil}},
{"$nor"=>[{"_id"=>{"$eq"=>BSON::ObjectId('62753df4a54d7584e56ea829')}}]}]}
"inputStage": {
"advanced": 1000,
"direction": "backward",
"dupsDropped": 0,
"dupsTested": 0,
"executionTimeMillisEstimate": 0,
"indexBounds": {
"reference": [
"[MaxKey, MinKey]"
]
},
"indexName": "reference_1",
"indexVersion": 2,
"invalidates": 0,
"isEOF": 0,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"reference": 1
},
"keysExamined": 1000,
"multiKeyPaths": {
"reference": []
},
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"seeks": 1,
"seenInvalidated": 0,
"stage": "IXSCAN",
"works": 1000
},
"invalidates": 0,
"isEOF": 0,
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"stage": "FETCH",
"works": 1000
},
"invalidates": 0,
"isEOF": 1,
"limitAmount": 1000,
"nReturned": 1000,
"needTime": 0,
"needYield": 0,
"restoreState": 10,
"saveState": 10,
"stage": "LIMIT",
"works": 1001
},
"executionSuccess": true,
"executionTimeMillis": 7,
"nReturned": 1000,
"totalDocsExamined": 1000,
"totalKeysExamined": 1000
}
"inputStage": {
"advanced": 604411,
"direction": "backward",
"dupsDropped": 0,
"dupsTested": 0,
"executionTimeMillisEstimate": 320,
"indexBounds": {
"reference": [
"[MaxKey, MinKey]"
]
},
"indexName": "reference_1",
"indexVersion": 2,
"invalidates": 0,
"isEOF": 1,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"reference": 1
},
"keysExamined": 604411,
"multiKeyPaths": {
"reference": []
},
"nReturned": 604411,
"needTime": 0,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"seeks": 1,
"seenInvalidated": 0,
"stage": "IXSCAN",
"works": 604412
},
"invalidates": 0,
"isEOF": 1,
"nReturned": 523,
"needTime": 603888,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"stage": "FETCH",
"works": 604412
},
"invalidates": 0,
"isEOF": 1,
"limitAmount": 1000,
"nReturned": 523,
"needTime": 603888,
"needYield": 0,
"restoreState": 6138,
"saveState": 6138,
"stage": "LIMIT",
"works": 604412
},
"executionSuccess": true,
"executionTimeMillis": 9472,
"nReturned": 523,
"totalDocsExamined": 604411,
"totalKeysExamined": 604411
}
Why happen this differences, and what I can do to correct this collection

Multiple field sort_by combination of reverse and not

So I got this array of object from database and I have to sort it in order:
by score (reverse)
by count (reverse)
by name
I've tried all three in reverse but what I need is the last one should not be reverse (DESC) here's my code:
new_data = data.sort_by{ |t| [t.score, t.matches_count, t.name] }.reverse
RESULT
[
{
"id": null,
"team_id": 939,
"name": "DAV",
"matches_count": 2,
"score": 100.0
},
{
"id": null,
"team_id": 964,
"name": "SAN",
"matches_count": 1,
"score": 100.0
},
{
"id": null,
"team_id": 955,
"name": "PAS",
"matches_count": 1,
"score": 100.0
},
{
"id": null,
"team_id": 954,
"name": "PAR",
"matches_count": 1,
"score": 100.0
},
{
"id": null,
"team_id": 952,
"name": "NUE",
"matches_count": 1,
"score": 100.0
}
]
Expected result should be sorted by name in ASC order not DESC I knew my code is wrong because the t.name is inside .reverse but if I will reorder it by name alone after the first 2 I will get the wrong answer it will just sort all by name not by the 3. I've also tried to .order("name DESC") from query so when reverse it will go ASC but no luck. Thank you!
data = [{:score=>100.0, :matches_count=>2, :name=>"DAV"},
{:score=>100.0, :matches_count=>1, :name=>"SAN"},
{:score=>100.0, :matches_count=>1, :name=>"PAS"},
{:score=>110.0, :matches_count=>1, :name=>"PAR"},
{:score=>100.0, :matches_count=>1, :name=>"NUE"}]
data.sort_by{ |h| [-h[:score], -h[:matches_count], h[:name]] }
#=> [{:score=>110.0, :matches_count=>1, :name=>"PAR"},
# {:score=>100.0, :matches_count=>2, :name=>"DAV"},
# {:score=>100.0, :matches_count=>1, :name=>"NUE"},
# {:score=>100.0, :matches_count=>1, :name=>"PAS"},
# {:score=>100.0, :matches_count=>1, :name=>"SAN"}]
You could also use Array#sort, which does not require that values to be sorted on in descending order be numeric, so long as they are comparable, that is, so long as they respond to the method :<=>.
data.sort do |t1, t2|
case t1[:score] <=> t2[:score]
when -1
1
when 1
-1
else
case t1[:matches_count] <=> t2[:matches_count]
when -1
1
when 1
-1
else
t1[:name] <=> t2[:name]
end
end
end
#=> <as above>
Instead of using - one can also try to reverse the sort using ! as demonstrated here: https://www.ruby-forum.com/t/sort-by-multiple-fields-with-reverse-sort/197361/2
So in your case, you could do:
new_data = data.sort_by{ |t| [!t.score, !t.matches_count, t.name] }
which should give you the same result as intended.

How do I perform a “diff” on two set of json buckets using Apache Beam Python SDK?

I would like to compare the run results of my pipeline. Getting the diff between jsons with the same schema though different data.
Run1 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7}
{"doc_id": 1, "entity": "New York", "start": 30, "end": 38} # Missing from Run2
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 8}
Run2 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7} # same as in Run1
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 10} # different end span
{"doc_id": 2, "entity": "Karim", "start": 10, "end": 15} # added in Run2, not in Run1
Based on the answer here my approach has been making a tuple out of the json values and then cogrouping using this large composite key made of some of the json values: How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?
Is there a better way to diff jsons with beam?
Code based on linked answer:
def make_kv_pair(x):
if x and isinstance(x, basestring):
x = json.loads(x)
""" Output the record with the x[0]+x[1] key added."""
key = tuple((x[dict_key] for dict_key in ["doc_id", "entity"]))
return (key, x)
class FilterDoFn(beam.DoFn):
def process(self, (key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
Pipeline code:
table_a = (p | 'ReadJSONRun1' >> ReadFromText("run1.json")
| 'SetKeysRun1' >> beam.Map(make_kv_pair))
table_b = (p | 'ReadJSONRun2' >> ReadFromText("run2.json")
| 'SetKeysRun2' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'removed', 'unchanged']
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | "WriteUnchanged" >> WriteToText("unchanged/", file_name_suffix="_unchanged.json.gz")
key_collections.changed | "WriteChanged" >> WriteToText("changed/", file_name_suffix="_changed.json.gz")
key_collections.added | "WriteAdded" >> WriteToText("added/", file_name_suffix="_added.json.gz")
key_collections.removed | "WriteRemoved" >> WriteToText("removed/", file_name_suffix="_removed.json.gz")

How can I add a background to a pattern in Highcharts?

I am currently working with Highcharts in combination with the pattern fill module. When I set a pattern for a series in the chart, the pattern is shown but it has a transparent background. I need to set an additional background because the pattern is overlapping with another series which I don't want to see behind it. You can check this fiddle. So basically I don't want to see those three columns on the left behind the pattern. Any ideas how I can do that? I haven't seen any options to set an additional background, but maybe you know some trick. This is the code I am using for the pattern:
"color": {
"pattern": {
"path": {
"d": "M 0 0 L 10 10 M 9 -1 L 11 1 M -1 9 L 1 11"
},
"width": 10,
"height": 10,
"opacity": 1,
"color": "rgb(84,198,232)"
}
}
You need to set fill attribute as a path property:
"color": {
"pattern": {
"path": {
"d": "M 0 0 L 10 10 M 9 -1 L 11 1 M -1 9 L 1 11",
fill: 'red'
},
"width": 10,
"height": 10,
"opacity": 1,
"color": 'rgb(84,198,232)'
}
}
Live demo: https://jsfiddle.net/BlackLabel/m9rxwej5/
I guess there's been an update. backgroundColor should be set at pattern's root level:
"color": {
"pattern": {
"backgroundColor": 'red',
"path": {
"d": "M 0 0 L 10 10 M 9 -1 L 11 1 M -1 9 L 1 11",
},
"width": 10,
"height": 10,
"opacity": 1,
"color": 'rgb(84,198,232)',
}
}
https://jsfiddle.net/vL4fqhao/

Eventbrite API sort_by date doesn't work?

I am trying to get a list of events from the Eventbrite API, and sort them by date.
However I seem to get an event list that is not ordered by date.
Here is an example:
$ curl "https://www.eventbrite.com/json/event_search?app_key=MY_API_KEY&country=GB&sort_by=date" | jsonlint | grep start_date
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 68722 0 68722 0 0 42483 0 --:--:-- 0:00:01 --:--:-- 61413
"start_date": "2013-03-23 09:00:00",
"start_date": "2013-03-23 09:00:00",
"start_date": "2013-03-31 13:00:00",
"start_date": "2013-01-18 01:45:00",
"start_date": "2013-03-28 09:00:00",
"start_date": "2013-02-01 05:55:00",
"start_date": "2013-02-01 05:55:00",
"start_date": "2013-02-01 05:55:00",
"start_date": "2013-03-23 10:00:00",
"start_date": "2013-01-01 00:00:00",
"start_date": "2012-12-12 00:00:00",
"start_date": "2013-04-09 19:00:00",
"start_date": "2013-04-13 09:00:00",
"start_date": "2013-04-17 18:15:00",
"start_date": "2013-04-17 19:00:00",
"start_date": "2013-04-17 19:00:00",
(The non-indented start_times are those for events)
I know it is accepting the sort_by=date parameter as it is returned in the results summary:
{
"summary": {
"total_items": 14911,
"first_event": 4673700163,
"last_event": 5844441883,
"filters": {
"country": "GB",
"sort_by": "date"
},
"num_showing": 10
}
},
I've assumed the date it sorts by is the start_date, but manually inspecting the other dates (end_date, created, modified) it doesn't seem to be sorting by that either.
Is this a bug in the API or am I doing something wrong?
afaik, Eventbrite allows you to sort by location OR by date, but not both.
It's possible that since you selected a country / region (GB), the date sorting is not being applied.

Resources