Set num of output shard in Write.to(Sink) in dataflow - google-cloud-dataflow

I am having a customized sink extending FileBasedSink to which I write to by calling PCollection.apply(Write.to(MySink)) in dataflow (very simpler to XmlSink.java). However it seems by default simply calling Write.to will always result to 3 output shards? Is there any way that I could define the number of output shard (like TextTO.Write.withNumShards) just in customized sink class definition? or I have to define another customized PTransformer like TextIO.Write?

Unfortunately, right now FileBasedSink does not support specifying the number of shards.
In practice, the number of shards you get will be dependent on how the framework chooses to optimize the parts of the pipeline producing the collection you're writing, so there's essentially no control over that.
I've filed a JIRA issue for your request so you can subscribe to the status.

Related

Apache Beam - Parallelize Google Cloud Storage Blob Downloads While Maintaining Grouping of Blobs

I’d like to be able to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS). i.e.PCollection<Iterable<String>> --> PCollection<Iterable<String>> where the starting PCollection is an Iterable of file paths and the resulting PCollection is Iterable of file contents. Alternatively, PCollection<String> --> PCollection<Iterable<String>> would also work and perhaps even be preferable, where the starting PCollection is a glob pattern, and the resulting PCollection is an iterable of file contents which matched the glob.
My use-case is that at a point in my pipeline I have as input PCollection<String>. Each element of the PCollection is a GCS glob pattern. It’s important that files which match the glob be grouped together because the content of the files–once all files in a group are read–need to be grouped downstream in the pipeline. I originally tried using FileIO.matchAll and a subsequently GroupByKey . However, the matchAll, window, and GroupByKey combination lacked any guarantee that all files matching the glob would be read and in the same window before performing the GroupByKey transform (though I may be misunderstanding Windowing). It’s possible to achieve the desired results if a large time span WindowFn is applied, but it’s still probabilistic rather than a guarantee that all files will be read before grouping. It’s also the main goal of my pipeline to maintain the lowest possible latency.
So my next, and currently operational, plan was to use an AsyncHttpClient to fan out fetching file contents via GCS HTTP API. I feel like this goes against the grain in Beam and is likely sub-optimal in terms of parallelization.
So I’ve started investigating SplittableDoFn . My current plan is to allow splitting such that each entity in the input Iterable (i.e. each matched file from the glob pattern) could be processed separately. I've been able to modify FileIO#MatchFn (defined here in the Java SDK) to provide mechanics for PCollection<String> -> PCollection<Iterable<String>> transform between input of GCS glob patterns and output of Iterable of matches for the glob.
The challenge I’ve encountered is: how do I go about grouping/gathering the split invocations back into a single output value in my DoFn? I’ve tried using stateful processing and using a BagState to collect file contents along the way, but I realized part way along that the ProcessElement method of a splittable DoFn may only accept ProcessContext and Restriction tuples, and no other args therefore no StateId args referring to a StateSpec (throws an invalid argument error at runtime).
I noticed in the FilePatternWatcher example in the official SDF proposal doc that a custom tracker was created wherein FilePath Objects kept in a set and presumably added to the set via tryClaim. This seems as though it could work for my use-case, but I don’t see/understand how to go about implementing a #SplitRestriction method using a custom RestrictionTracker.
I would be very appreciative if anyone were able to offer advice. I have no preference for any particular solution, only that I want to achieve the ability to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS).
Would joining the output PCollections help you?
PCollectionList
.of(collectionOne)
.and(collectionTwo)
.and(collectionThree)
...
.apply(Flatten.pCollections())

Apache Beam: read from UnboundedSource with fixed windows

I have an UnboundedSource that generates N items (it's not in batch mode, it's a stream -- one that only generates a certain amount of items and then stops emitting new items but a stream nonetheless). Then I apply a certain PTransform to the collection I'm getting from that source. I also apply the Window.into(FixedWindows.of(...)) transform and then group the results by window using Combine. So it's kind of like this:
pipeline.apply(Read.from(new SomeUnboundedSource(...)) // extends UnboundedSource
.apply(Window.into(FixedWindows.of(Duration.millis(5000))))
.apply(new SomeTransform())
.apply(Combine.globally(new SomeCombineFn()).withoutDefaults())
And I assumed that would mean new events are generated for 5 seconds, then SomeTransform is applied to the data in the 5 seconds window, then a new set of data is polled and therefore generated. Instead all N events are generated first, and only after that is SomeTransform applied to the data (but the windowing works as expected). Is it supposed to work like this? Does Beam and/or the runner (I'm using the Flink runner but the Direct runner seems to exhibit the same behavior) have some sort of queue where it stores items before passing it on to the next operator? Does that depend on what kind of UnboundedSource is used? In my case it's a generator of sorts. Is there a way to achieve the behavior that I expected or is it unreasonable? I am very new to working with streaming pipelines in general, let alone Beam. I assume, however, it would be somewhat illogical to try to read everything from the source first, seeing as it's, you know, unbounded.
An important thing to note is that windows in Beam operate on event time, not processing time. Adding 5 second windows to your data is not a way to prescribe how the data should be processed, only the end result of aggregations for that processing. Further, windows only affect the data once an aggregation is reached, like your Combine.globally. Until that point in your pipeline the windowing you applied has no effect.
As to whether it is supposed to work that way, the beam model doesn't specify any specific processing behavior so other runners may process elements slightly differently. However, this is still a correct implementation. It isn't trying to read everything from the source; generally streaming sources in Beam will attempt to read all elements available before moving on and coming back to the source later. If you were to adjust your stream to stream in elements slowly over a long period of time you will likely see more processing in between reading from the source.

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

How to access "key" in combine.perKey in beam

In How to create custom Combine.PerKey in beam sdk 2.0, I asked and got a correct answer on how to create a custom Combine.PerKey in the new beam sdk 2.0. However, I now need to create a custom combinePerKey such that within my custom CombinePerKey logic, I need to be able to access the contents of the key. This was easily possible in dataflow 1.x, but in the new beam sdk 2.0, I'm unsure how to do so. Any little code snippet/example would be extremely useful.
EDIT #1 (per Ben Chambers's request)
The real use case is hard to explain, but I'm going to try:
We have a 3d space composed of millions of little hills. We try to determine the apex of these millions of hills as follows: we create billions of "rectangular probes" for the whole 3d space, and then we ask each of these billions of probes to "move" in a greedy way to the apex. Once it hits the apex, it stops. The probe then returns the apex and itself. The apex is the KEY for which we'll do a custom combine by key.
Now, the custom combine function is going to finally return a final object (called a feature) which is derived from all the probes that reach the same apex (ie the same key). When generating this "feature" object, we need to know infomration about the final apex/key (ie the top of the hill). Hence, I need this key info.
One way to solve this is using a group by key, but that was slow (at least in df 1.x); we got it to be fast (in df 1.x) using a custom combine fn. So, we'd like the key. That said, groupbykey works in beam skd 2.0.
Alternatively, we could stick the "apex" information into the "probe" objects itself, but this means that each of our billions of probe objects now needs to be tripled in size just to hold this apex information (and this apex information repeats itself, since there are only say 1 million apexes but 1 billion probes, so this intuitively feels highly inefficient.)
Rather than relying on the CombineFn to compute the entire result, could you instead have the ComibneFn compute some partial result based only on information about the probes? Then your Combine.perKey(...) returns a PCollection<KV<Apex, InfoAboutProbes>> and you can use a ParDo to combine the information about the apex with the summary information about the probes. This allows you to use the CombineFn for efficiently combining information about many probes, while using a ParDo to access the key.

How can I emit summary data for each window even if a given window was empty?

It is really important for my application to always emit a "window finished" message, even if the window was empty. I cannot figure out how to do this. My initial idea was to output an int for each record processed and use Sum.integersGlobally and then emit a record based off that, giving me a singleton per window, I could then simply emit one summary record per window, with 0 if the window was empty. Of course, this fails, and you have to use withoutDefaults which will then emit nothing if the window was empty.
Cloud Dataflow is built around the notion of processing data that is likely to be highly sparse. By design, it does not conjure up data to fill in those gaps of sparseness, since this will be cost prohibitive for many cases. For a use case like yours where non-sparsity is practical (creating non-sparse results for a single global key), the workaround is to join your main PCollection with a heartbeat PCollection consisting of empty values. So for the example of Sum.integersGlobally, you would Flatten your main PCollection<Integer> with a secondary PCollection<Integer> that contains exactly one value of zero per window. This assumes you're using an enumerable type of window (e.g. FixedWindows or SlidingWindows; Sessions are by definition non-enumerable).
Currently, the only way to do this would be to write a data generator program that injects the necessary stream of zeroes into Pub/Sub with timestamps appropriate for the type of windows you will be using. If you write to the same Pub/Sub topic as your main input, you won't even need to add a Flatten to your code. The downside is that you have to run this as a separate job somewhere.
In the future (once our Custom Source API is available), we should be able to provide a PSource that accepts an enumerable WindowFn plus a default value and generates an appropriate unbounded PCollection.

Resources