Stream de-duplication on Dataflow | Running services on Dataflow services - google-cloud-dataflow

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.

1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?

Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks

I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

Related

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

Wrapper for slow indicator for backtesting

If a technical indicator works very slow, and I wish to include it in an EA ( using iCustom() ), is there a some "wrapper" that could cache the indicator results to a file based on the particular indicator inputs?
This way I could get a better speed next time when I backtest it using the same set of parameters, since the "wrapper" could read the result from file rather than recalculate the result from the indicator.
I heard that some developers did that for their needs in order to speed up backtesting, but as far as i know, there's no publicly available solution.
If I had to solve this problem, I would create a class with two fields (datetime and indicator value, or N buffers of the indicator), and a collection class similar to CArrayObj.mqh but with an option to apply binary search, or to start looking for element from a specific index, not from the very beginning of the array.
Recent MT4 Builds added VERY restrictive conditions for Indicators
In early years of MT4, this was not so cruel as it is these days.
FACT#1: fileIO is 10.000x ~ 100.000x slower than memIO:
This means, there is no benefit from "pre-caching" values to disk.
FACT#2: Processing Performance has HARD CEILING:
All, yes ALL, Custom Indicators, that are being used in MetaTrader4 Terminal ( be it directly in GUI, or indirectly, via Template(s) or called via iCustom() calls & in Strategy Tester via .tpl + iCustom() ) ALL THESE SHARE A SINGLE THREAD ...
FACT#3: Strategy Tester has the most demanding needs for speed:
Thus - eliminate all, indeed ALL, non-core indicators from tester.tpl template and save it as "blank", to avoid any part of such non-core processing.
Next, re-design the Custom Indicator, where possible, so as to avoid any CPU-ops & MEM-allocation(s), that are not necessary.
I remember a Custom Indicatore designs with indeed deep-convolutions, which could have been re-engineered so as to keep just a triangular sparse-matrix with necessary updates, that has increased the speed of Indicator processing more than 10.000x, so code-revision is the way.
So, rather run a separate MetaTrader4 Terminal, just for BackTesting, than having to wait for many hours just due to un-compressible nature of numerical processing under a traffic-jam congestion in the shared use of the CustomIndicator-solo-Thread that none scheduling could improve.
FACT#4: O/S can increase a process priority:
Having got to the Devil's zone, it is a common practice to spin-up the PRIO for the StrategyTester MT4, up to the "RealTime PRIO" in the O/S tools.
One may even additionally "lock" this MT4-process onto a certain CPU-core(s) and setup all other processes with adjacent CPU-core-AFFINITY, so that these two distinct groups of processes do not jump one to the other group's CPU-core(s). Hard, but if squeezing the performance to the bleeding edge, this is a must.

Why is “group by” so much slower than my custom CombineFn or asList()?

Mistakenly, many months ago, I used the following logic to essentially implement the same functionality as the PCollectionView's asList() method:
I assigned a dummy key of “000” to every element in my collection
I then did a groupBy on this dummy key so that I would essentially get a list of all my elements in a single array list
The above logic worked fine when I only had about 200 elements in my collection. When I ran it on 5000 elements, it ran for hours until I finally killed it. Then, I created a custom “CombineFn” in which I essentially put all the elements into my own local hash table. That worked, and even on my 5000 element situation, it ran within about a minute or less. Similarly, I later learned that I could use the asList() method, and that too ran in less than a minute. However, what concerns me – and what I don't understand – is why the group by took so long to run (even with only 200 elements it would take more than a few seconds) and with 5000 it ran for hours without seeming to accomplish anything.
I took a look at the group by code, and it seems to be doing a lot off steps I don't quite understand… Is it somehow related to the fact that the group by statement is attempting to run things across a cluster? Or is it may be related to using an efficient coder? Or am I missing something else? The reason I ask is that there are certain situations in which I'm forced to use a group by statement, because the data set is way too large to fit in any single computer's RAM. However, I am concerned that I'm not understanding how to properly use a group by statement, since it seems to be so slow…
There are a couple of things that could be contributing to the slow performance. First, if you are using SerializableCoder, that is quite slow, and AvroCoder will likely work better for you. Second, if you are iterating through the Iterable in your ParDo after the GBK, if you have enough elements you will exceed the cache and end up fetching the same data many times. If you need to do that, explicitly putting the data in a container will help. Both the CombineFn and asList approaches do this for you.

Omniture: Creating Specific Context Variables

Was wondering if anyone out there can help.......
My company works in the travel industry and one of the product we provide is the function of buying a flight and hotel together.
One of the advantages of this is that sometimes a visitor can save on a hotel if they buy the package together.
What I want to be able to track is the following:
The hotel which has the saving on it (accomodation code); the saving that they will make; the price of the package that they will pay.
I am new to implementing but have been told by a colleague that I can use a context variable.
Would anyone be able to tell me how I should write this please?
Kind Regards
Yaser
Here is the document entry for Context Data Variables
For example, in the custom code section of the on-page code, within s_doPlugins or via some wrapper function that ultimately makes a s.t() or s.tl() call, you would have:
s.contextData['package.code'] = "accommodation code";
s.contextData['package.savings'] = "savings";
s.contextData['package.price'] = "price";
Then in the interface you can go to processing rules and map them to whatever props or eVars you want.
Having said that...processing rules are pretty basic at the moment, and to be honest, it's not really worth it IMO. Firstly, you have to get certified (take an exam and pass) to even access processing rules. It's not that big a deal, but it's IMO a pointless hoop to jump through (tip: if you are going to go ahead and take this step, be sure to study up on more than just processing rules. Despite the fact that the exam/certification is supposed to be about processing rules, there are several questions that have little to nothing to do with them)
2nd, context data doesn't show up in reports by themselves. You must assign the values to actual props/eVars/events through processing rules (or get ClientCare to use them in a vista rule, which is significantly more powerful than a processing rule, but costs lots of money)
3rd, the processing rules are pretty basic. Seriously, you're limited to just simple stuff like straight duping, concatenating values, etc.
4th, processing rules are limited in setting events, and won't let you set the products string. IOW, You can set a basic (counter) event, but not a numeric or currency event (an event with a custom value associated with it). Reason I mention this is because those price and savings values might be good as a numeric or currency event for calculated metrics. Well since you can't set an event as such via processing rules, you'd have to set the events in your page code anyways.
The only real benefit here is if you're simply looking to dupe them into a prop/eVar and that prop/eVar varies from report suite to report suite (which FYI, most people try to keep them consistent across report suites anyways, and people rarely repurpose them).
So if you are already being consistent across multiple report suites (or only have like 1 report suite in the first place), since you're already having to put some code on the site, there's no real incentive to just pop the values in the first place.
I guess the overall point here is that since the overall goal is to get the values into actual props, eVars and possibly events, and processing rules fail on a lot of levels, there's no compelling reason not to just pop them in the first place.

Resources