I'm new into the RxJS world and I'm kinda lost with this. Hope someone can help me out.
I have a source observable (Firebase through AngularFire) which constantly emits me a lot of data (up to 50 or 80 emissions in a 2s window) in random pike times, since this is something that's slowing down my project performance, I thought the correct way to handle this would be to group the emissions in an array and after, do a transaction with all the data received and insert it in the store.
The result I'm looking for is something like the following:
Taking into account that I'd place a "hold" amount of time of 3s, I'd like the following result:
1s 1.5s
--> 30 --> 60 --> 100
1s 2s
--> 5 --> 1 --> 50 --> 70
[30, 60, 100] --> in 1.5s interval time
[5, 1, 50, 70] --> in 2s interval time
The values in the array would be the emissions received in that specific time starting from the first emission received. After that specific amount of time, it would "restart" an initialize in the next batch of emissions (which could actually be in 1 second or in 2 hours, but then, the interval would trigger for 2s the emissiones got)
What I've tried so far was using Window and Buffer, maybe I'm not using these correctly or I'm just dumb but I can't find to get the result I just explained.
filter((snapshot) => { if (snapshot.payload.val().reference) { return snapshot; } }),
window(interval(2000)),
mergeAll(),
withTransaction((snapshots:[]) => {
snapshots.forEach(snapshot => {
if (snapshot.type === 'child_changed') {
this.store.add(snapshot.key, snapshot.val());
} else if (snapshot.type === 'child_changed') {
this.store.replace(snapshot.key, snapshot.val());
} else if (snapshot.type === 'child_removed') {
this.store.remove(snapshot.key);
}
})
})
I don't even know if it's possible with RxJS (I guess so. I have seen many cool things around) but any suggestions or a guide for making it through this, would be greatly appreciated.
Thanks a lot in advance!
note: withTransaction is a custom operator.
not positive what you're after, but seems like you want bufferTime?
const source = timer(0, 500);
const buffered = source.pipe(bufferTime(3000));
buffered.subscribe(val => console.log(val));
this will emit all values collected within the buffer period as an array every 3 seconds.
blitz demo: https://stackblitz.com/edit/rxjs-vpu97e?file=index.ts
in your example I THINK you'd just use it as:
filter((snapshot) => snapshot.payload.val().reference), // this is all you need for filter
bufferTime(2000),
withTransaction((snapshots:[]) => {
...
})
Related
I would like to use K6 in order to measure the time it takes to proces 1.000.000 requests (in total) by an API.
Scenario
Execute 1.000.000 (1 million in total) get requests by 50 concurrent users/theads, so every user/thread executes 20.000 requests.
I've managed to create such a scenario with Artillery.io, but I'm not sure how to create the same one while using K6. Could you point me in the right direction in order to create the scenario? (Most examples are using a pre-defined duration, but in this case I don't know the duration -> this is exactly what I want to measure).
Artillery yml
config:
target: 'https://localhost:44000'
phases:
- duration: 1
arrivalRate: 50
scenarios:
- flow:
- loop:
- get:
url: "/api/Test"
count: 20000
K6 js
import http from 'k6/http';
import {check, sleep} from 'k6';
export let options = {
iterations: 1000000,
vus: 50
};
export default function() {
let res = http.get('https://localhost:44000/api/Test');
check(res, { 'success': (r) => r.status === 200 });
}
The iterations + vus you've specified in your k6 script options would result in a shared-iterations executor, where VUs will "steal" iterations from the common pile of 1m iterations. So, the faster VUs will complete slightly more than 20k requests, while the slower ones will complete slightly less, but overall you'd still get 1 million requests. And if you want to see how quickly you can complete 1m requests, that's arguably the better way to go about it...
However, if having exactly 20k requests per VU is a strict requirement, you can easily do that with the aptly named per-vu-iterations executor:
export let options = {
discardResponseBodies: true,
scenarios: {
'million_hits': {
executor: 'per-vu-iterations',
vus: 50,
iterations: 20000,
maxDuration: '2h',
},
},
};
In any case, I strongly suggest setting maxDuration to a high value, since the default value is only 10 minutes for either executor. And discardResponseBodies will probably help with the performance, if you don't care about the response body contents.
btw you can also do in k6 what you've done in Artillery, have 50 VUs start a single iteration each and then just loop the http.get() call 20000 times inside of that one single iteration... You won't get a very nice UX that way, the k6 progressbars will be frozen until the very end, since k6 will have no idea of your actual progress inside of each iteration, but it will also work.
I'm getting unexpected results streaming in the cloud.
My pipeline looks like:
SlidingWindow(60min).every(1min)
.triggering(Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(30)))
)
)
.withAllowedLateness(15sec)
.accumulatingFiredPanes()
.apply("Get UniqueCounts", ApproximateUnique.perKey(.05))
.apply("Window hack filter", ParDo(
if(window.maxTimestamp.isBeforeNow())
c.output(element)
)
)
.toJSON()
.toPubSub()
If that filter isn't there, I get 60 windows per output. Apparently because the pubsub sink isn't window aware.
So in the examples below, if each time period is a minute, I'd expect to see the unique count grow until 60 minutes when the sliding window closes.
Using DirectRunner, I get expected results:
t1: 5
t2: 10
t3: 15
...
tx: growing unique count
In dataflow, I get weird results:
t1: 5
t2: 10
t3: 0
t4: 0
t5: 2
t6: 0
...
tx: wrong unique count
However, if my unbounded source has older data, I'll get normal looking results until it catches up at which point I'll get the wrong results.
I was thinking it had to do with my window filter, but removing that didn't change the results.
If I do a Distinct() then Count().perKey(), it works, but that slows my pipeline considerably.
What am I overlooking?
[Update from the comments]
ApproximateUnique inadvertently resets its accumulated value when result is extracted. This is incorrect when the value is read more than once as with windows firing multiple times. Fix (will be in version 2.4): https://github.com/apache/beam/pull/4688
I'm taking a PCollection of sessions and trying to get average session duration per channel/connection. I'm doing something where my early triggers are firing for each window produced - if 60min windows sliding every 1 minute, an early trigger will fire 60 times. Looking at the timestamps on the outputs, there's a window every minute for 60minutes into the future. I'd like the trigger to fire once for the most recent window so that every 10 seconds I have an average of session durations for the last 60 minutes.
I've used sliding windows before and had the results I expected. By mixing sliding and sessions windows, I'm somehow causing this.
Let me paint you a picture of my pipeline:
First, I'm creating sessions based on active users:
.apply("Add Window Sessions",
Window.<KV<String, String>> into(Sessions.withGapDuration(Duration.standardMinutes(60)))
.withOnTimeBehavior(Window.OnTimeBehavior.FIRE_ALWAYS)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Group Sessions", Latest.perKey())
Steps after this create a session object, compute session duration, etc. This ends with a PCollection(Session).
I create a KV of connection,duration from the Pcollection(Session).
Then I apply the sliding window and then the mean.
.apply("Apply Rolling Minute Window",
Window. < KV < String, Integer >> into(
SlidingWindows
.of(Duration.standardMinutes(60))
.every(Duration.standardMinutes(1)))
.triggering(
Repeatedly.forever(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(10)))
)
)
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes()
)
.apply("Get Average", Mean.perKey())
It's at this point where I'm seeing issues. What I'd like to see is a single output per key with the average duration. What I'm actually seeing is 60 outputs for the same key for each minute into the next 60 minutes.
With this log in a DoFn with C being the ProcessContext:
LOG.info(c.pane().getTiming() + " " + c.timestamp());
I get this output 60 times with timestamps 60 minutes into the future:
EARLY 2017-12-17T20:41:59.999Z
EARLY 2017-12-17T20:43:59.999Z
EARLY 2017-12-17T20:56:59.999Z
(cont)
The log was printed at Dec 17, 2017 19:35:19.
The number of outputs is always window size/slide duration. So if I did 60 minute windows every 5 minutes, I would get 12 output.
I think I've made sense of this.
Sliding windows create a new window with the .every() function. Setting early firings applies to each window so getting multiple firings makes sense.
In order to fit my use case and only output the "current window", I'm checking c.pane().isFirst() == true before outputting results and adjusting the .every() to control the frequency.
We have a DataFlow job that is subscribed to a PubSub stream of events. We have applied sliding windows of 1 hour with a 10 minute period. In our code, we perform a Count.perElement to get the counts for each element and we then want to run this through a Top.of to get the top N elements.
At a high level:
1) Read from pubSub IO
2) Window.into(SlidingWindows.of(windowSize).every(period)) // windowSize = 1 hour, period = 10 mins
3) Count.perElement
4) Top.of(n, comparisonFunction)
What we're seeing is that the window is being applied twice so data seems to be watermarked 1 hour 40 mins (instead of 50 mins) behind current time. When we dig into the job graph on the Dataflow console, we see that there are two groupByKey operations being performed on the data:
1) As part of Count.perElement. Watermark on the data from this step onwards is 50 minutes behind current time which is expected.
2) As part of the Top.of (in the Combine.PerKey). Watermark on this seems to be another 50 minutes behind the current time. Thus, data in steps below this is watermarked 1:40 mins behind.
This ultimately manifests in some downstream graphs being 50 minutes late.
Thus it seems like every time a GroupByKey is applied, windowing seems to kick in afresh.
Is this expected behavior? Anyway we can make the windowing only be applicable for the Count.perElement and turn it off after that?
Our code is something on the lines of:
final int top = 50;
final Duration windowSize = standardMinutes(60);
final Duration windowPeriod = standardMinutes(10);
final SlidingWindows window = SlidingWindows.of(windowSize).every(windowPeriod);
options.setWorkerMachineType("n1-standard-16");
options.setWorkerDiskType("compute.googleapis.com/projects//zones//diskTypes/pd-ssd");
options.setJobName(applicationName);
options.setStreaming(true);
options.setRunner(DataflowPipelineRunner.class);
final Pipeline pipeline = Pipeline.create(options);
// Get events
final String eventTopic =
"projects/" + options.getProject() + "/topics/eventLog";
final PCollection<String> events = pipeline
.apply(PubsubIO.Read.topic(eventTopic));
// Create toplist
final PCollection<List<KV<String, Long>>> topList = events
.apply(Window.into(window))
.apply(Count.perElement()) //as eventIds are repeated
// get top n to get top events
.apply(Top.of(top, orderByValue()).withoutDefaults());
Windowing is not applied each time there is a GroupByKey. The lag you were seeing was likely the result of two issues, both of which should be resolved.
The first was that data that was buffered for later windows at the first group by key was preventing the watermark from advancing, which meant that the earlier windows were getting held up at the second group by key. This has been fixed in the latest versions of the SDK.
The second was that the sliding windows was causing the amount of data to increase significantly. A new optimization has been added which uses the combine (you mentioned Count and Top) to reduce the amount of data.
At the answer to the question on Stack and in the book at here on page 52 I found the normal getTickCount getTickFrequency combination to measure time of execution gives time in milliseconds . However the OpenCV website says its time in seconds. I am confused. Please help...
There is no room for confusion, all the references you have given point to the same thing.
getTickCount gives you the number of clock cycles after a certain event, eg, after machine is switched on.
A = getTickCount() // A = no. of clock cycles from beginning, say 100
process(image) // do whatever process you want
B = getTickCount() // B = no. of clock cycles from beginning, say 150
C = B - A // C = no. of clock cycles for processing, 150-100 = 50,
// it is obvious, right?
Now you want to know how many seconds are these clock cycles. For that, you want to know how many seconds a single clock takes, ie clock_time_period. If you find that, simply multiply by 50 to get total time taken.
For that, OpenCV gives second function, getTickFrequency(). It gives you frequency, ie how many clock cycles per second. You take its reciprocal to get time period of clock.
time_period = 1/frequency.
Now you have time_period of one clock cycle, multiply it with 50 to get total time taken in seconds.
Now read all those references you have given once again, you will get it.
dwStartTimer=GetTickCount();
dwEndTimer=GetTickCount();
while((dwEndTimer-dwStartTimer)<wDelay)//delay is 5000 milli seconds
{
Sleep(200);
dwEndTimer=GetTickCount();
if (PeekMessage (&uMsg, NULL, 0, 0, PM_REMOVE) > 0)
{
TranslateMessage (&uMsg);
DispatchMessage (&uMsg);
}
}