Session Windows behave with Kafka Stream is not as expected - stream

I am a bit newbie working with kafka stream but what I have noticed is a behave I am not expecting. I have developed an app which is consuming from 6 topics. My goal is to group (or join) an event on every topic by an internal field. That is working fine. But my issue is with window time, it looks like the end time of every cycle affect to all the aggregations are taking on that time. Is only one timer for all aggregation are taking at the same time ?. I was expecting that just when the stream get the 30 seconds configured get out of the aggregation process. I think it is possible because I have seen data on Windowed windowedRegion variable and the windowedRegion.window().start() and windowedRegion.window().end() values are different per every stream.
This is my code:
streamsBuilder
.stream(topicList, Consumed.with(Serdes.String(), Serdes.String()))
.groupBy(new MyGroupByKeyValueMapper(), Serialized.with(Serdes.String(), Serdes.String()))
.windowedBy(SessionWindows.with(windowInactivity).until(windowDuration))
.aggregate(
new MyInitializer(),
new MyAggregator(),
new MyMerger(),
Materialized.with(new Serdes.StringSerde(), new PaymentListSerde())
)
.mapValues(
new MyMapper()
)
.toStream(new MyKeyValueMapper())
.to(consolidationTopic,Produced.with(Serdes.String(), Serdes.String()));

I'm not sure if this is what you're asking but every aggregation (every per-key session window) may indeed be updated multiple times. You will not generally get just one message per window with the final result for that session window on your "consolidation" topic. This is explained in more detail here:
https://stackoverflow.com/a/38945277/7897191

Related

select from system$stream_has_data returns error - parameter must be a valid stream name... hmm?

I'm trying to see if there is data in a stream and I provided the exact stream name as follows :
Select SYSTEM$STREAM_HAS_DATA('STRM_EXACT_STREAM_NAME_GIVEN');
But, I get an error :
SQL compilation error: Invalid value ['STRM_EXACT_STREAM_NAME_GIVEN'] for function 'SYSTEM$STREAM_HAS_DATA', parameter 1: must be a valid stream name
1) Any idea why ? How can this error be resolved ?
2) Would it hurt to resume a set of tasks (alter task resume;) without knowing if the corresponding stream has data in it or not? I blv if there is (delta) data in the stream, the task will load it, if not, the task won't do anything.
3) Any idea how to modify / update a stream that shows up as 'STALE' ? - or should just loading fresh data into the table associated with the stream should set the stream as 'NOT STALE' i.e. stale = false ? what if loading the associated table does not update the state of the task? (and that is what is happening currently in my case, as things appear.
1) It doesn't look like you have a stream by that name. Try running SHOW STREAMS; to see what streams you have active in the database/schema that you are currently using.
2) If your task has a WHEN clause that validates against the SYSTEM$STREAM_HAS_DATA result, then resuming a task and letting it run on schedule only hits against your global services layer (no warehouse credits), so there is no harm there.
3) STALE means that the stream data wasn't used by a DML statement in a long time (I think its 14 days by default or if data retention is longer than 14 days, then it's the longer of those). Loading more data into the stream table doesn't help that. Running a DML statement will, but since the stream is stale, doing so may have bad consequences. Streams are meant to be used for frequent DML, so not running DML against a stream for longer than 14 days is very uncommon.

Event Consolidation when there is no defined time window for Event Arrival

We have a single topic called migrationstatus, assume we partition so that all instances and events for a given MigrationCandidateNumber always end up on the same topic and partition.
Following event arrives at 12-10-2019 at 10:00 AM
{
"MigrationCandidateNumber": 54545451,
"MigrationStatus":"Final Bill Produced"
}
Following event arrives at 14-10-2019 at 08:00 AM
{
"MigrationCandidateNumber": 54545451,
"MigrationStatus":"Product Ready"
}
Following event arrives at 17-10-2019 at 12:00 AM
{
"MigrationCandidateNumber": 54545451,
"MigrationStatus":"Registration Complete"
}
Problem Statement:
Once all 3 of those events have been processed, we need to produce the event below onto migrationstatus-out topic as shown below:
{
"MigrationCandidateNumber": **54545451**,
"MigrationStatus":"**Ready for extract 2**"
}
The wide time window is deliberate, since the first 3 events could arrive days apart.
Best way of doing this with no external database?
Solution Tried:
You can't use windowed aggregation because we are not sure about when event arrives.
created 3 streams out of the main stream for different migration status but again stream-stream joins are windowed.
For this scenario I don't see a way to aggregate data so that we store data in KSQL table and perform a group by to check If messages with all the status has arrived.
I know its a wide open question and mostly related to approach for solving the problem rather than technical issue but I couldn't find a better forum to post this.
I have solved this problem and shared the code in github. Please follow the link for the solution.
GitHub link for the solution
Thanks MatthiasJ.Sax for the heads up.

How to call multiple APIs in parallel for load testing (using Gatling)?

I'm currently trying to load test my APIs using Gatling, but I have a very specific test that I want to perform. I would like to simulate a Virtual User calling all my APIs (the 16 of them) simultaneously. I would like to repeat this multiple times so I can have an idea of the average time it takes for my APIs to respond when called all together at the same time.
The method I used was :
Creating a Scenario for each of my API.
Calling every single one of the scenarios in my SetUp()
Injecting 60 users in every scenarios with a throttle of 1 Request per Second & holding it for 60 seconds.
The aim was to have 60 iterations of what I wanted.
FYI I'm using Gatling 3.1.2
//This is what all my scenarios look like
val bookmarkScn = scenario("Bookmarks").exec(http("Listing bookmarks")
.get("/bookmarks")
.check(status.is(200))
)
//My setUp
setUp(
bookmarkScn.inject(
atOnceUsers(60)
).throttle(
jumpToRps(1),
holdFor(60)
),
permissionScn.inject(
atOnceUsers(60)
).throttle(
jumpToRps(1),
holdFor(60)
),
//Adding all the scenarios one after the other
).protocols(httpConfig)
I got some results with this method but they are not at all what I was expecting and if I keep the test going for too long eventually every call just timeout.
It was supposed to just take more time than usual (e.g from 100ms per API to 300ms).
My question is : Is this method correct ? Can you help me achieve my goal ?
What you've got should work, but there's probably an easier way to specify this injection. Instead of
bookmarkScn.inject(
atOnceUsers(60)
).throttle(
jumpToRps(1),
holdFor(60)
),
you could use
bookmarkScn.inject(
constantUsersPerSec(1) during (60 seconds)
),
in terms of your results, I'd expect that the issue lies somewhere downstream of gatling - 16 concurrent users making simple GET requests is very straightforward for Gatling. You may have issues elsewhere with performance in your app or infrastructure in-between.

XGrabPointer poll till next event or pipe

I was trying to write a mouse event listener. This was my approach, can you please tell me if this will work before I start writing it. I'm writing it in ctypes, so if I ctype it all (couple days) then find out it doesnt work its a loss of time.
My goal is, that I should be able to cancel the poll via a pipe. This was my approach:
In another thread call XThreadsInit
Open XDisplay display
XGrabPointer to display
get file descriptor ConnectionNumber(display)
connect to pipe that was made on main thread
Do a pselect with no timeout timeout is set to null on pipe and fd from 4
Is this right approach?
Thanks
If you are using threads you are sharing variables between threads. It would be much simpler to use a global variable that is set when the poll must be aborted, then in your watch thread create a tight loop that checks for that variable and use a short timeout in pselect(). This may introduce a short delay but if you keep the timeout short (say, 100 ms) it would be hardly noticable and still efficient.

RxSwift: Receive events immediately, unless the last event was processed within a certain interval

New to RxSwift / Reactivex. Basically what I'm trying to do is to make a server call whenever something happens, but make sure it's not done more often than every 10 seconds. Less often if possible.
For instance, whenever an event ("needs update") is generated I'd like to call the server immediately if more than 10 seconds have passed since my last call. If less time has passed I'd like to make the call on the 10 second mark from the last one. It doesn't matter how many events have been generated within these 10 seconds.
I looked at the description of throttle but it appears to starve if events happen very quickly, which isn't desirable.
How can I achieve this?
There's a proposed new operator for RxSwiftExt that would give you something you're looking for, I think. However, it doesn't exist yet. You might want to keep an eye on it, though.
https://github.com/RxSwiftCommunity/RxSwiftExt/issues/10

Resources