Watching for new files matching a filepattern in Apache Beam - google-cloud-dataflow

I have a directory on GCS or another supported filesystem to which new files are being written by an external process.
I would like to write an Apache Beam streaming pipeline that continuously watches this directory for new files and reads and processes each new file as it arrives. Is this possible?

This is possible starting with Apache Beam 2.2.0. Several APIs support this use case:
If you're using TextIO or AvroIO, they support this explicitly via TextIO.read().watchForNewFiles() and the same on readAll(), for example:
PCollection<String> lines = p.apply(TextIO.read()
.from("gs://path/to/files/*")
.watchForNewFiles(
// Check for new files every 30 seconds
Duration.standardSeconds(30),
// Never stop checking for new files
Watch.Growth.<String>never()));
If you're using a different file format, you may use FileIO.match().continuously() and FileIO.matchAll().continuously() which support the same API, in combination with FileIO.readMatches().
The APIs support specifying how often to check for new files, and when to stop checking (supported conditions are e.g. "if no new output appears within a given time", "after observing N outputs", "after a given time since starting to check" and their combinations).
Note that right now this feature currently works only in the Direct runner and the Dataflow runner, and only in the Java SDK. In general, it will work in any runner that supports Splittable DoFn (see capability matrix).

To add to Eugene's excellent answer as well as the watchfornewfiles options there are a couple of other choices;
There are several options available to solve this requirement dependent on your latency requirements. As of SDK 2.9.0:
Option 1: Continuous read mode:
Java:
FileIO , TextIO and several other IO sources support continuous reading of the source for new files.
FileIO class supports the ability to watch a single file pattern continuously.
This example matches a single filepattern repeatedly every 30 seconds, continuously returns new matched files as an unbounded PCollection and stops if no new files appear for 1 hour.
PCollection<Metadata> matches = p.apply(FileIO.match()
.filepattern("...")
.continuously(
Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1))));
TextIO class supports streaming new file matching using the watchForNewFiles property.
PCollection<String> lines = p.apply(TextIO.read()
.from("/local/path/to/files/*")
.watchForNewFiles(
// Check for new files every minute
Duration.standardMinutes(1),
// Stop watching the filepattern if no new files appear within an hour
afterTimeSinceNewOutput(Duration.standardHours(1))));
It is important to note that the file list is not retained across restarts of the pipeline. To deal with that scenario, you can move the files either through a process downstream of the pipeline or as part of the pipeline itself. Another option would be to store processed file names in an external file and de-dupe the lists at the next transform.
Python:
The continuously option is not available as of SDK 2.9.0 for python.
Option 2: Stream processing triggered from external source
You can have a Beam pipeline running in stream mode, which has an unbounded source, for example PubSub. When new files arrive you can use an external to Beam process to detect the file arrival and then send a PubSub message which has a URI as payload to the file. In a DoFn which is preceded by the PubSub source you can then use that URI to process the file.
Java :
Use an Unbounded Source IO ( PubSubIO, KafakIO, etc...)
Python:
Use an UnBounded Source IO ( PubSubIO, etc...)
Option 3: Batch mode processing triggered from external source
This approach, introduces latency over Option 1 & 2 as the pipeline needs to startup before processing can begin. Here you can have a triggering event from your source file system to schedule or immediately start a Dataflow process. This option is best suited for low frequency large file size updates.

Related

Storage Event Trigger failed to trigger the function

I am working on pipeline that copies data from ADLS Gen into Azure Synapse Dedicated SQL Pool. I used the Synapse pipelines and followed the Microsoft Docs on how to create a storage event trigger. But when a new file is loaded into the ADLS, I get the following error:
" 'The template language expression 'trigger().outputs.body.ContainerName' cannot be evaluated because property 'ContainerName' doesn't exist, available properties are 'RunToken'.
I have set the following pipeline parameters:
The pipeline successfully runs when I manually trigger it and pass the parameters. I would appreciate any solution or guidance to resolve this issue
Thank you very much
I tried to set trigger the synapse pipeline and copy the new blob into the dedicated pool, but when I monitored the triggers run, it failed to run.
I can trigger the pipeline manually
According to storage event trigger
The storage event trigger captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName.
It does not have property called container name.
As per data you provided it seems that file is stored in your container itself. for this approach you have to give value for container name parameter as #trigger().outputs.body.folderPath it will return container name as folder
And now pass these pile line parameters to dataset properties dynamically.
It will run pipeline successfully and copy data from ADLS to synapse dedicated pool

save/load thingsboard configuration

Is it possible to somehow serialize current Thingsboard (let's call it TBoard) configuration, save it and than latter load saved configuration on TBoard startup.
I am specifically interested in loading device profiles, rule chains, and dashboards.
I want to save configuration together with my project in git repository so than latter I could just use docker-compose to start multiple services from project (let's call them sensors) and single TBoard instance with saved configuration which will be used for collecting telemetry from sensors and drawing dashboards.
Another reason for saving configuration is what happens if for some reason TBoard container crashes or somehow get corrupted so it can't be started again, would I have to click on the things again in order to create all device profiles, dashboards, configure rule chains ... etc etc ... ?
Regarding this line
I am specifically interested in loading device profiles, rule chains, and dashboards. I want to save configuration together with my project in git repository
I have just recently implemented version control for my Thingsboard deployment. The way i am doing it is with the python REST client.
I have written functions to export all dashboards/data converters/integrations/rule chains/widgets into json files which I save into a github repository.
I have also written the reverse script to push the stored files to a fresh environment, essentially "flashing" it. Surprisingly, this works perfectly.
I have an idea to publish this as a package, but it's something I've never done before so I'm unsure if I will get to it.
Just letting you know that it is definitely possible to get source control operational via the API.

Question: BigQueryIO creates one file per input line, is it correct?

I'm new on Apache Beam and I'm developing a pipeline to get rows from JDBCIO and send them to BigQueryIO. I'm converting the rows to avro files with withAvroFormatFunction but it is creating a new file for each row returned by JDBCIO. The same for withFormatFunction with json files.
It is so slow to run locally with DirectRunner because it uploads a lot of files to Google Storage. Is this approach good for scaling on Google Dataflow? Is there a better way to deal with it?
Thanks
In BigqueryIO there is an option to specify withNumFileShards which controls the number of files that get generated while using Bigquery Load Jobs.
From the documentation
Control how many file shards are written when using BigQuery load jobs. Applicable only when also setting withTriggeringFrequency(org.joda.time.Duration).
You can set test your process by setting the value to 1 to see if only 1 large file gets created.
BigQueryIO will commit results to BigQuery for each bundle. The DirectRunner is known to be a bit inefficient about bundling. It never combines bundles. So whatever bundling is provided by a source is propagated to the sink. You can try using other runners such as Flink, Spark, or Dataflow. The in-process open source runners are about as easy to use as the direct runner. Just change --runner=DirectRunner to --runner=FlinkRunner and the default settings will run in local embedded mode.

Does apache beam in google cloud dataflow keep track of intermediate files in temp location?

In dataflow you specify a temp location for data to be parallelized and then aggregated at the end. I am wondering if it keeps track of which temp files it needs to aggregate in a run. If the same bucket is specified for subsequent runs, and other temp files with different names are left over from previous runs, will it just lazily aggregate everything under the temp folder in the bucket or only the specific temp file names associated with the current run?
Only the ones associated to the current run, since Dataflow is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
But is recommended to set an individual bucket for every Job as Jobs based in templates could use the same directory, based on the timestamp of when the template was created
e.g:
.temp-beam-2020-01-12_14-13-30-12/

Is there any GCP Dataflow template for "Pub/Sub to Cloud Spanner"

I am trying to find out if there is any GCP Dataflow template available for data ingestion with "Pub/Sub to Cloud Spanner". I have found there is already a default GCP dataflow template available with example - "Cloud Pub/Sub to BigQuery".
So, I am interested to see if I can do data ingestion to spanner in stream or batch mode and how the behavior would be
There is a Dataflow template to import Avro files in batch mode that you can use by following these instructions. Unfortunately a Cloud Pub/Sub streaming template is not available yet. If you would like, you can file a feature request.
Actually I tried to do something like use "projects/pubsub-public-data/topics/taxirides-realtime" and "gs://dataflow-templates/latest/Cloud_PubSub_to_Avro" template to load sample data file to my gcp storage. Then I stopped this stream job and created another batch job with "gs://dataflow-templates/latest/GCS_Avro_to_Cloud_Spanner" template. But the batch job failed with below error,
java.io.FileNotFoundException: No files matched spec: gs://cardataavi/archive/spanner-export.json
at org.apache.beam.sdk.io.FileSystems.maybeAdjustEmptyMatchResult(FileSystems.java:166)
at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:153)
at org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn.process(FileIO.java:636)
It seems, right now spanner support only Avro data format which has Spanner specific format. Is the understanding correct?

Resources