Flume HDFS sink partition by event time - flume

I use Flume to receive access log from Kafka and write to HDFS directories partitioned by time. The "time" here, as far as I understand, is "processing time", and I want to partition by "event time".
For instance, logs like the following should be partitioned into dt=20170627, even if it is received late in 20170628:
0.123 [2017-06-27 09:08:00] GET /
0.234 [2017-06-27 09:08:01] GET /
I know Hive sink can partition by a field extracted from records, but it requires the records are delimited or in json format.
Thanks.

Related

Can input and output topics be the same for a Pulsar Function?

Is it an anti-pattern to have the same input and output topic when using Pulsar functions?
In my case, I have been using just one topic where I have a Cassandra sink consuming the messages. I was thinking to create a function that will read messages from this topic and send the transformed messages to the same one. The sink will be able to sink into Cassandra just the processed messages because they will respect the schema.
Is this a bad practice?
I would not recommend it. You'll need to filter the transformed messages in your function otherwise you get an infinite loop. Also the sink will have to filter the raw messages. These filters are wastes of resources.
It would be far better to have distinct topics for the raw and the transformed messages. Is there something preventing you to do it ?

Apache BEAM pipeline IllegalArgumentException - Timestamp skew?

I have an existing BEAM pipeline that is handling the data ingested (from Google Pubsub topic) by 2 routes. The 'hot' path does some basic transformation and stores them in Datastore, while the 'cold' path performs fixed hourly windowing for deeper analysis before storage.
So far the pipeline has been running fine until I started to do some local buffering on the data before publishing to Pubsub (so data arrives at Pubsub may be a few hours 'late'). The error that gets thrown is as below:
java.lang.IllegalArgumentException: Cannot output with timestamp 2018-06-19T14:00:56.862Z. Output timestamps must be no earlier than the timestamp of the current input (2018-06-19T14:01:01.862Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.checkTimestamp(SimpleDoFnRunner.java:463)
at org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.outputWithTimestamp(SimpleDoFnRunner.java:429)
at org.apache.beam.sdk.transforms.WithTimestamps$AddTimestampsDoFn.processElement(WithTimestamps.java:138)
It seems to be referencing the section of my code (withTimestamps method) that performs the hourly windowing as below:
Window<KV<String, Data>> window = Window.<KV<String, Data>>into
(FixedWindows.of(Duration.standardHours(1)))
.triggering(Repeatedly.forever(pastEndOfWindow()))
.withAllowedLateness(Duration.standardSeconds(10))
.discardingFiredPanes();
PCollection<KV<String, List<Data>>> keyToDataList = eData.apply("Add Event Timestamp", WithTimestamps.of(new EventTimestampFunction()))
.apply("Windowing", window)
.apply("Group by Key", GroupByKey.create())
.apply("Sort by date", ParDo.of(new SortDataFn()));
I'm not sure if I understand exactly what I've done wrong here. Is it because the data is arriving late that is throwing the error? As I understand, if the data arrives late past the allowed lateness, it should be discarded and not throw an error like the one I'm seeing.
Wondering if setting an unlimited timestampSkew will resolve this? The data that's late can be exempt from analysis, I just need to ensure that errors don't get thrown that will choke the pipeline. There's also nowhere else where I'm adding/ changing the timestamps for the data so I'm not sure why the errors are thrown.
It looks like your DoFn is using “outputWithTimestamp” and you are trying to set a timestamp which is older than the input element’s timestamp. Typically timestamps of output elements are derived from inputs, this is important to ensure the correctness of the watermark computation.
You may be able to workaround this by increasing both the timestamp skew and the windwing allowed lateness, however, some data may be lost, it is for you to determine if such loss is acceptable in your scenario.
Another alternative is not to use output with timestamp and instead use the PubSub message timestamp to process each message. Then, output each element as a KV, where the RealTimestamp is computed in the same way you are currently processing the timestamp (just don’t use it in “WithTimestamps”), GroupByKey and write the KV to Datastore.
Other questions you can ask yourself are:
Why are the input elements associated to a most recent timestamp than the output elements?
Do you really need to Buffer that much data before publishing to PubSub?

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Control rate of individual topic consumption in Kafka Streams 0.9.1.0-cp1?

I am trying to backprocess data in Kafka topics using a Kafka Streams application that involves a join. One of the streams to be joined has much larger volume of data per unit of time in the corresponding topic. I would like to control the consumption from the individual topics so that I get roughly the same event timestamps from each topic in a single consumer.poll(). However, there doesn't appear to be any way to control the behavior of the KafkaConsumer backing the source stream. Is there any way around this? Any insight would be appreciated.
Currently Kafka cannot control the rate limit on both Producers and Consumers.
Refer:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
But if you are using Apache Spark as the stream processing platform, you can limit the input rate for the Kafka receivers.
in the consumer side you can use consume([num_messages=1][, timeout=-1])
function instead of poll.
consume([num_messages=1][, timeout=-1]):
Consumes a list of messages (possibly empty on timeout). Callbacks may be executed as a side effect of calling this method.
The application must check the returned Message object’s Message.error() method to distinguish between proper messages (error() returns None) and errors for each Message in the list (see error().code() for specifics). If the enable.partition.eof configuration property is set to True, partition EOF events will also be exposed as Messages with error().code() set to _PARTITION_EOF.
num_messages (int) – The maximum number of messages to return (default: 1).
timeout (float) – The maximum time to block waiting for message, event or callback (default: infinite (-1)). (Seconds)

Output sorted text file from Google Cloud Dataflow

I have a PCollection<String> in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to:
PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));
Currently the lines of each shard of output are in random order.
Is it possible to get Dataflow to output the lines in sorted order?
This is not directly supported by Dataflow.
For a bounded PCollection, if you shard your input finely enough, then you can write sorted files with a Sink implementation that sorts each shard. You may want to refer to the TextSink implementation for a basic outline.

Resources