I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).
Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:
Get the topic from pub/sub
ParDo to read each file and get the lines
Process the lines of the file...
Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.
You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:
PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());
This will read all lines in each file arriving over pubsub.
Related
Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.
I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill.
The writing works well, writing a number of files, each with a pickled object. When I download the file manually, I can unpickle the file. Code for writing: beam.io.WriteToText('gs://{}', coder=coders.DillCoder())
But the reading breaks every time, with one of the errors below. Code for reading: beam.io.ReadFromText('gs://{}*', coder=coders.DillCoder())
Either...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\x90'
...or...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
__import__(module)
ImportError: No module named measur
(the class of the object sits in a path with measure, though not sure why it misses the last character there)
I've tried using the default coder, and a BytesCoder, and pickling & unpickling as a custom task in the pipeline.
My working hypothesis is the reader splitting the file by line, and so treating a single pickle (which has new lines within it) as multiple objects. If so, is there a way of avoiding that?
I could attempt to build a reader myself, but I'm hesitant since this seems like a well-solved problem (e.g. Beam already has a format to move objects from one pipeline stage to another).
Tangentially related: How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?
Thank you!
ReadFromText is designed to read new line separated records in text files hence is not suitable for your use-case. Implementing FileBasedSource is not a good solution either since it's designed for reading large files with multiple records (and usually splits these files into shards for parallel processing). So, in your case, the current best solution for Python SDK is to implement a source yourself. This can be as simple as a ParDo that reads files and produces a PCollection of records. If your ParDo produce a large number of records consider adding a apache_beam.transforms.util.Reshuffle step following that which will allow runners to parallelize following steps better. For Java SDK we have FileIO which already provides transforms to make this bit easier.
Encoding as string_escape escapes the newlines, so the only newlines that Beam sees are those between pickles:
class DillMultiCoder(DillCoder):
"""
Coder that allows multi-line pickles to be read
After an object is pickled, the bytes are encoded as `unicode_escape`,
meaning newline characters (`\n`) aren't in the string.
Previously, the presence of newline characters these confues the Dataflow
reader, as it can't discriminate between a new object and a new line
within a pickle string
"""
def _create_impl(self):
return coder_impl.CallbackCoderImpl(
maybe_dill_multi_dumps, maybe_dill_multi_loads)
def maybe_dill_multi_dumps(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_dumps(o).encode('string_escape')
def maybe_dill_multi_loads(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_loads(o.decode('string_escape'))
For large pickles, I also needed to set the buffersize much higher to 8MB - on the previous buffer size (8kB), a 120MB file spun for 2 days of CPU time:
class ReadFromTextPickle(ReadFromText):
"""
Same as ReadFromText, but with a really big buffer. With the standard 8KB
buffer, large files can be read on a loop and never finish
Also added DillMultiCoder
"""
def __init__(
self,
file_pattern=None,
min_bundle_size=0,
compression_type=CompressionTypes.AUTO,
strip_trailing_newlines=True,
coder=DillMultiCoder(),
validate=True,
skip_header_lines=0,
**kwargs):
# needs commenting out, not sure why
# super(ReadFromTextPickle, self).__init__(**kwargs)
self._source = _TextSource(
file_pattern,
min_bundle_size,
compression_type,
strip_trailing_newlines=strip_trailing_newlines,
coder=coder,
validate=validate,
skip_header_lines=skip_header_lines,
buffer_size=8000000)
Another approach would be to implement a PickleFileSource inherited from FileBasedSource and call pickle.load on the file - each call would yield a new object. But there's a bunch of complication around offset_range_tracker that looked like more lift than strictly necessary
Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.
I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]
A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here
I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.
I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.
I am new to flume.My flume agent having source as http server,from where it getting zip files(compressed xml files) on regular interval.This zip files are very small (less than 10 mb) and i want to put the zip files extracted into the hdfs sink.Please share some idea how to do this.Do i have to go for a custom interceptor.
Flume will try to read your files line by line, except if you configure a specific deserializer. A deserializer lets you control how the file is parsed and split into events. You could of course follow the example of the blob deserizalizer, which is designed for PDFs and such, but I understand that you actually want to unpack them and then read them line by line. In that case you would need to write a custom deserializer which reads Zip and writes line by line events.
Here's the reference in the documentation:
https://flume.apache.org/FlumeUserGuide.html#event-deserializers