Read a pickle from another pipeline in Beam? - google-cloud-dataflow

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill.
The writing works well, writing a number of files, each with a pickled object. When I download the file manually, I can unpickle the file. Code for writing: beam.io.WriteToText('gs://{}', coder=coders.DillCoder())
But the reading breaks every time, with one of the errors below. Code for reading: beam.io.ReadFromText('gs://{}*', coder=coders.DillCoder())
Either...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\x90'
...or...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
__import__(module)
ImportError: No module named measur
(the class of the object sits in a path with measure, though not sure why it misses the last character there)
I've tried using the default coder, and a BytesCoder, and pickling & unpickling as a custom task in the pipeline.
My working hypothesis is the reader splitting the file by line, and so treating a single pickle (which has new lines within it) as multiple objects. If so, is there a way of avoiding that?
I could attempt to build a reader myself, but I'm hesitant since this seems like a well-solved problem (e.g. Beam already has a format to move objects from one pipeline stage to another).
Tangentially related: How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?
Thank you!

ReadFromText is designed to read new line separated records in text files hence is not suitable for your use-case. Implementing FileBasedSource is not a good solution either since it's designed for reading large files with multiple records (and usually splits these files into shards for parallel processing). So, in your case, the current best solution for Python SDK is to implement a source yourself. This can be as simple as a ParDo that reads files and produces a PCollection of records. If your ParDo produce a large number of records consider adding a apache_beam.transforms.util.Reshuffle step following that which will allow runners to parallelize following steps better. For Java SDK we have FileIO which already provides transforms to make this bit easier.

Encoding as string_escape escapes the newlines, so the only newlines that Beam sees are those between pickles:
class DillMultiCoder(DillCoder):
"""
Coder that allows multi-line pickles to be read
After an object is pickled, the bytes are encoded as `unicode_escape`,
meaning newline characters (`\n`) aren't in the string.
Previously, the presence of newline characters these confues the Dataflow
reader, as it can't discriminate between a new object and a new line
within a pickle string
"""
def _create_impl(self):
return coder_impl.CallbackCoderImpl(
maybe_dill_multi_dumps, maybe_dill_multi_loads)
def maybe_dill_multi_dumps(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_dumps(o).encode('string_escape')
def maybe_dill_multi_loads(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_loads(o.decode('string_escape'))
For large pickles, I also needed to set the buffersize much higher to 8MB - on the previous buffer size (8kB), a 120MB file spun for 2 days of CPU time:
class ReadFromTextPickle(ReadFromText):
"""
Same as ReadFromText, but with a really big buffer. With the standard 8KB
buffer, large files can be read on a loop and never finish
Also added DillMultiCoder
"""
def __init__(
self,
file_pattern=None,
min_bundle_size=0,
compression_type=CompressionTypes.AUTO,
strip_trailing_newlines=True,
coder=DillMultiCoder(),
validate=True,
skip_header_lines=0,
**kwargs):
# needs commenting out, not sure why
# super(ReadFromTextPickle, self).__init__(**kwargs)
self._source = _TextSource(
file_pattern,
min_bundle_size,
compression_type,
strip_trailing_newlines=strip_trailing_newlines,
coder=coder,
validate=validate,
skip_header_lines=skip_header_lines,
buffer_size=8000000)
Another approach would be to implement a PickleFileSource inherited from FileBasedSource and call pickle.load on the file - each call would yield a new object. But there's a bunch of complication around offset_range_tracker that looked like more lift than strictly necessary

Related

is there any way I can avoid reading old files from old folder with Apache Beam's TextIo watchForNewFiles(Duration, condition)?

Use Case: During dataflow job start up we should provide initial file name to read data and later on it should watch for new files in that directory and it should consider all remaining old files as already read.
Issues:
Approach 1:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/*").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
If we are using like this its considering old files as new files for this dataflow job and reading all those files in that folder
Approach 2:
PCollection<String> readfile = pipeline.apply(TextIO.read().from("gs://folder-Name/file-name").
watchForNewFiles(Duration.standardSeconds(10),
Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(30))));
Its reading only this particular file and not able to read upcoming new files
can anyone please suggest the approach to achieve my use case?
The watchForNewFiles() function will always read all files matching the filepattern, both existing and new. In your second approach, the file pattern is only one file, so you just get that.
However, you can use the lower-level building block transforms in FileIO to accomplish what you need. The following code will just read files written after the pipeline starts:
PCollection<String> lines = p
.apply(FileIO.match().filepattern("gs://folder-Name/*")
.continuously(Duration.standardSeconds(30), afterTimeSinceNewOutput(Duration.standardHours(1)))
.setCoder(MetadataCoderV2.of())
.apply(Filter.by(metadata -> metadata.lastModifiedMillis() > PIPELINE_START))
.apply(FileIO.readMatches())
.apply(apply(TextIO.readFiles()))
You can change the details of the Filter transform to whatever precise condition you need. To also include specific older files, you can read those with a standard TextIO.read().from(...) and then use Flatten to combine that PCollection with the continuous set. Like this:
PCollection allLines =
PCollectionList.of(lines).and(p.apply(TextIO.read().from("gs://folder-Name/file-name)
.apply(Flatten.pCollections())
Maybe you need to clarify your Use Case, do you provide a file name to read ? or a file pattern ? What is the number of files expected ? Should you really use a Dataflow streaming pipeline ? Doesn't a Cloud Function answer your need ? What is your issue ? Files get read again when you restart your pipeline ?
You can, as suggested by danielm use FileIO to fetch and filter on file metadata in order to know which file was added after the pipeline began.
If you provide a file pattern, then all file will be read once by the pipeline. There's no way to keep a State between pipelines if you not code it yourself, so when you restart the pipeline you will read again all the file matching the pattern.
If you want to avoid that, you can manually move old files to another path between stopping the old pipeline and starting a new one.
You could also consider is consuming GCS notification on file creation with PubsubIO and use this event to know which file to treat in your pipeline.
A good practice though is to have multiple folders that reflects the status of the files:
input
processing
failed
succeed
This way you know the state of each file. You can put files to treat in the input folder, and inside your pipeline move the file to its corresponding state folder.

Multiple file generation while writing to XML through Apache Beam

I'm trying to write an XML file where the source is a text file stored in GCS. The code is running fine but instead of a single XML file, it is generating multiple XML files. (No. of XML files seem to follow total no. of records present in source text file). I have observed this scenario while using 'DataflowRunner'.
When I run the same code in local then two files get generated. First one contains all the records with proper elements and the second one contains only opening and closing root element.
Any idea about the occurrence of this unexpected behaviour? please find below the code snippet I'm using :
PCollection<String>input_records=p.apply(TextIO.read().from("gs://balajee_test/xml_source.txt"));
PCollection<XMLFormatter> input_object= input_records.apply(ParDo.of(new DoFn<String,XMLFormatter>(){
#ProcessElement
public void processElement(ProcessContext c)
{
String elements[]=c.element().toString().split(",");
c.output(new XMLFormatter(elements[0],elements[1],elements[2],elements[3],elements[4]));
System.out.println("Values to be written have been provided to constructor ");
}
})).setCoder(AvroCoder.of(XMLFormatter.class));
input_object.apply(XmlIO.<XMLFormatter>write()
.withRecordClass(XMLFormatter.class)
.withRootElement("library")
.to("gs://balajee_test/book_output"));
Please let me know the way to generate a single XML file(book_output.xml) at output.
XmlIO.write().to() is documented as follows:
/**
* Writes to files with the given path prefix.
*
* <p>Output files will have the name {#literal {filenamePrefix}-0000i-of-0000n.xml} where n is
* the number of output bundles.
*/
I.e. it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.
Asking the IO to produce exactly one file is a reasonable request if your data is not particularly big. It is supported in TextIO and AvroIO via .withoutSharding(), but not yet supported in XmlIO. Please feel free to file a JIRA with the feature request.

Read files from a PCollection of GCS filenames in Pipeline?

I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).
Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:
Get the topic from pub/sub
ParDo to read each file and get the lines
Process the lines of the file...
Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.
You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:
PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());
This will read all lines in each file arriving over pubsub.

HDFS Flume sink - Roll by File

Is it possible for HDFS Flume sink to roll whenever a single file (from a Flume source, say Spooling Directory) ends, instead of rolling after certain bytes (hdfs.rollSize), time (hdfs.rollInterval), or events (hdfs.rollCount)?
Can Flume be configured so that a single file is a single event?
Thanks for your input.
Reagarding your first question, it is not possible due to the sinks logic is disconnected from the sources logic. I mean, a sink only sees events being put into the channel which must be processed by him; the sink does not know if an event is the first or the last regarding a file.
Of course, you could try to create your own source (or extend an existing one) in order to add a header to the event with a value meaning "this is the last event". Then, another custom sink could behave depending on such a header: for instance, if the header is not set, then the events are not persisted but stored in memory until the header is seen; then all the information is persisted in the final backend as a bach. Other possibility is that custom sink persists the data in a file until the header is seen; then the file is closed and another one is opened.
Regarding your second question, it depends on the sink. The spooldir source behaves based on the deserializer parameter; by default its value is LINE, what means:
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
But other custom Java classes can be configured, as said above; for instance, a deserialized for the whole file.
You can set rollsize to a small number combined with BlobDeserializer to load file by file instead of combining into blocks. This is really helpful when you have unsplittable binary files such as PDF or gz files.
This is part of the configuration that is relevant:
#Set deserializer to BlobDeserializer and set the maximum blob size to be 1GB.
#Notice that the blobs have to fit in memory so this doesn't work for files that cannot fit in memory.
agent.sources.spool.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spool.deserializer.maxBlobLength = 1000000000
#Set rollSize to 1024 to avoid combining multiple small files into one part.
agent.sinks.hdfsSink.hdfs.rollSize = 1024
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 0
The answer to the question "Can Flume be configured so that a single file is a single event?" is yes.
Yo only have to configure the following property to be 1:
hdfs.rollCount = 1
I'm looking for a solution for your first question, because sometimes the file is too big and it's needed to split the file in several chunks.
You can use any event headers in hdfs.path. ( https://flume.apache.org/FlumeUserGuide.html#hdfs-sink )
If you are using Spooling Directory Source, you can enable putting the file name in the events using fileHeaderKey or basenameHeaderKey ( https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source ).
Can Flume be configured so that a single file is a single event?
It could be, however it is not recommended. The underlying implementation (protobuf) limits file sizes to 64m. Flume events are to be small in size due to its architecture and design. (Fault-tolerance, etc.)

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Resources