Multiple file generation while writing to XML through Apache Beam - google-cloud-dataflow

I'm trying to write an XML file where the source is a text file stored in GCS. The code is running fine but instead of a single XML file, it is generating multiple XML files. (No. of XML files seem to follow total no. of records present in source text file). I have observed this scenario while using 'DataflowRunner'.
When I run the same code in local then two files get generated. First one contains all the records with proper elements and the second one contains only opening and closing root element.
Any idea about the occurrence of this unexpected behaviour? please find below the code snippet I'm using :
PCollection<String>input_records=p.apply(TextIO.read().from("gs://balajee_test/xml_source.txt"));
PCollection<XMLFormatter> input_object= input_records.apply(ParDo.of(new DoFn<String,XMLFormatter>(){
#ProcessElement
public void processElement(ProcessContext c)
{
String elements[]=c.element().toString().split(",");
c.output(new XMLFormatter(elements[0],elements[1],elements[2],elements[3],elements[4]));
System.out.println("Values to be written have been provided to constructor ");
}
})).setCoder(AvroCoder.of(XMLFormatter.class));
input_object.apply(XmlIO.<XMLFormatter>write()
.withRecordClass(XMLFormatter.class)
.withRootElement("library")
.to("gs://balajee_test/book_output"));
Please let me know the way to generate a single XML file(book_output.xml) at output.

XmlIO.write().to() is documented as follows:
/**
* Writes to files with the given path prefix.
*
* <p>Output files will have the name {#literal {filenamePrefix}-0000i-of-0000n.xml} where n is
* the number of output bundles.
*/
I.e. it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.
Asking the IO to produce exactly one file is a reasonable request if your data is not particularly big. It is supported in TextIO and AvroIO via .withoutSharding(), but not yet supported in XmlIO. Please feel free to file a JIRA with the feature request.

Related

Setting multiple files as source input generated by a single task generator

A single task generator generates a number of source and header files. The number of generated files is not known at that time. How can I set these generated files as source input?
I used the code shown in the documentation, but this only describes the case a.a → a.b + a.c, but my case is a.a → a lot of files in directory a. Therefore I am not able to use:
b_node = node.change_ext('.b')
c_node = node.change_ext('.c')
self.create_task('idl', node, [b_node, c_node])
self.source.append(b_node)
The example is shown in the documentation here: https://waf.io/book/#_mixing_extensions_and_c_c_features
How can these unknown number of files used as input for self.source.append(**what goes here?**)
Well you should look at §11.4.2: A compiler producing source files with names unknown in advance. The trick is to manage dependencies by overloading runnable_status() and run() methods

Can I avoid hardcoding file locations in SPSS syntax?

I'm using SPSS 25 syntax to open and process a set of datafiles. I would like these syntax files to be as portable as possible. For that reason, I want the user to be able to select the file locations at runtime without having to recode the syntax itself.
I'm running Windows 10, although hopefully that doesn't matter. I do have the Python plugin for SPSS, although ideally this would be a base SPSS syntax solution.
In SPSS right now, I'm doing this:
GET
FILE='C:\Users\xkcd\studies\project\rawdata'+
'\reallyraw\veryraw.sav'
PASSWORD='CorrectHorseBatteryStaple'.
DATASET NAME Demo WINDOW=FRONT.
In R, I would do this:
message("Where is the veryraw.sav file?")
demo<-fread(file.choose())
Ideally, the user would, at runtime, select the individual files one at a time.
Less ideally, the user would select a folder in which all of the files, with known names.
I could use FILE HANDLE so that the user would only have to hardcode a few folder locations, but that's less than ideal - I really would rather that the user isn't editing the syntax at all.
Thanks in advance!
Following up on the idea of a fully automated process - the following code will work assuming there is a specific file name you need to run your code on, and only one copy exists in the folder you are searching. This is possible to run on drive C: directly, but will take much less time to run if you can narrow down the path:
* this will create a text file that has the path of the required file.
HOST COMMAND=['dir /s /b "C:\Users\somename\*required file name.sav" > C:\Users\somename\tempname.sps'].
* now to read the name and put in in a handle.
DATA LIST file = "C:\Users\somename\tempname.sps" fixed / pth 1-500 (a).
exe.
string cmd(a500).
compute cmd=concat("file handle myfile / name='", rtrim(pth), "'.").
write out="C:\Users\somename\tempname.sps" /cmd.
exe.
* inserting the new syntax will activate the handle.
insert file = "C:\Users\somename\tempname.sps".
Now you can use the handle myfile in the syntax, e.g:
get file=myfile.

Read a pickle from another pipeline in Beam?

I'm running batch pipelines in Google Cloud Dataflow. I need to read objects in one pipeline that another pipeline has previously written. The easiest wa objects is pickle / dill.
The writing works well, writing a number of files, each with a pickled object. When I download the file manually, I can unpickle the file. Code for writing: beam.io.WriteToText('gs://{}', coder=coders.DillCoder())
But the reading breaks every time, with one of the errors below. Code for reading: beam.io.ReadFromText('gs://{}*', coder=coders.DillCoder())
Either...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\x90'
...or...
File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
return StockUnpickler.find_class(self, module, name)
File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
__import__(module)
ImportError: No module named measur
(the class of the object sits in a path with measure, though not sure why it misses the last character there)
I've tried using the default coder, and a BytesCoder, and pickling & unpickling as a custom task in the pipeline.
My working hypothesis is the reader splitting the file by line, and so treating a single pickle (which has new lines within it) as multiple objects. If so, is there a way of avoiding that?
I could attempt to build a reader myself, but I'm hesitant since this seems like a well-solved problem (e.g. Beam already has a format to move objects from one pipeline stage to another).
Tangentially related: How to read blob (pickle) files from GCS in a Google Cloud DataFlow job?
Thank you!
ReadFromText is designed to read new line separated records in text files hence is not suitable for your use-case. Implementing FileBasedSource is not a good solution either since it's designed for reading large files with multiple records (and usually splits these files into shards for parallel processing). So, in your case, the current best solution for Python SDK is to implement a source yourself. This can be as simple as a ParDo that reads files and produces a PCollection of records. If your ParDo produce a large number of records consider adding a apache_beam.transforms.util.Reshuffle step following that which will allow runners to parallelize following steps better. For Java SDK we have FileIO which already provides transforms to make this bit easier.
Encoding as string_escape escapes the newlines, so the only newlines that Beam sees are those between pickles:
class DillMultiCoder(DillCoder):
"""
Coder that allows multi-line pickles to be read
After an object is pickled, the bytes are encoded as `unicode_escape`,
meaning newline characters (`\n`) aren't in the string.
Previously, the presence of newline characters these confues the Dataflow
reader, as it can't discriminate between a new object and a new line
within a pickle string
"""
def _create_impl(self):
return coder_impl.CallbackCoderImpl(
maybe_dill_multi_dumps, maybe_dill_multi_loads)
def maybe_dill_multi_dumps(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_dumps(o).encode('string_escape')
def maybe_dill_multi_loads(o):
# in Py3 this needs to be `unicode_escape`
return maybe_dill_loads(o.decode('string_escape'))
For large pickles, I also needed to set the buffersize much higher to 8MB - on the previous buffer size (8kB), a 120MB file spun for 2 days of CPU time:
class ReadFromTextPickle(ReadFromText):
"""
Same as ReadFromText, but with a really big buffer. With the standard 8KB
buffer, large files can be read on a loop and never finish
Also added DillMultiCoder
"""
def __init__(
self,
file_pattern=None,
min_bundle_size=0,
compression_type=CompressionTypes.AUTO,
strip_trailing_newlines=True,
coder=DillMultiCoder(),
validate=True,
skip_header_lines=0,
**kwargs):
# needs commenting out, not sure why
# super(ReadFromTextPickle, self).__init__(**kwargs)
self._source = _TextSource(
file_pattern,
min_bundle_size,
compression_type,
strip_trailing_newlines=strip_trailing_newlines,
coder=coder,
validate=validate,
skip_header_lines=skip_header_lines,
buffer_size=8000000)
Another approach would be to implement a PickleFileSource inherited from FileBasedSource and call pickle.load on the file - each call would yield a new object. But there's a bunch of complication around offset_range_tracker that looked like more lift than strictly necessary

HDFS Flume sink - Roll by File

Is it possible for HDFS Flume sink to roll whenever a single file (from a Flume source, say Spooling Directory) ends, instead of rolling after certain bytes (hdfs.rollSize), time (hdfs.rollInterval), or events (hdfs.rollCount)?
Can Flume be configured so that a single file is a single event?
Thanks for your input.
Reagarding your first question, it is not possible due to the sinks logic is disconnected from the sources logic. I mean, a sink only sees events being put into the channel which must be processed by him; the sink does not know if an event is the first or the last regarding a file.
Of course, you could try to create your own source (or extend an existing one) in order to add a header to the event with a value meaning "this is the last event". Then, another custom sink could behave depending on such a header: for instance, if the header is not set, then the events are not persisted but stored in memory until the header is seen; then all the information is persisted in the final backend as a bach. Other possibility is that custom sink persists the data in a file until the header is seen; then the file is closed and another one is opened.
Regarding your second question, it depends on the sink. The spooldir source behaves based on the deserializer parameter; by default its value is LINE, what means:
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
But other custom Java classes can be configured, as said above; for instance, a deserialized for the whole file.
You can set rollsize to a small number combined with BlobDeserializer to load file by file instead of combining into blocks. This is really helpful when you have unsplittable binary files such as PDF or gz files.
This is part of the configuration that is relevant:
#Set deserializer to BlobDeserializer and set the maximum blob size to be 1GB.
#Notice that the blobs have to fit in memory so this doesn't work for files that cannot fit in memory.
agent.sources.spool.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spool.deserializer.maxBlobLength = 1000000000
#Set rollSize to 1024 to avoid combining multiple small files into one part.
agent.sinks.hdfsSink.hdfs.rollSize = 1024
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 0
The answer to the question "Can Flume be configured so that a single file is a single event?" is yes.
Yo only have to configure the following property to be 1:
hdfs.rollCount = 1
I'm looking for a solution for your first question, because sometimes the file is too big and it's needed to split the file in several chunks.
You can use any event headers in hdfs.path. ( https://flume.apache.org/FlumeUserGuide.html#hdfs-sink )
If you are using Spooling Directory Source, you can enable putting the file name in the events using fileHeaderKey or basenameHeaderKey ( https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source ).
Can Flume be configured so that a single file is a single event?
It could be, however it is not recommended. The underlying implementation (protobuf) limits file sizes to 64m. Flume events are to be small in size due to its architecture and design. (Fault-tolerance, etc.)

How to process a GCS filepattern, full file at a time?

I need to process a (GCS) bucket of files, where each file is compressed and contains a single multi-line JSON record. Also, the name of the file being processed is significant and I need to know it within my transform.
Starting with examples in the docs, TextIO looks pretty close, but it looks like its designed to process each file line-by-line and does not allow me to read the entire file at once. Also, I don't see any way to get the filename that's being processed?
PCollectionTuple results = p.apply(TextIO.Read
.from("gs://bucket/a/*.gz")
.withCompressionType(TextIO.CompressionType.GZIP)
.withCoder(MyJsonCoder.of()))
Looks like I need to write a custom IO reader, or some such? Any tips for best place to start?
You are correct that right now none of the existing classes do exactly what you want. There are 2 reasonable approaches:
Match the filepattern yourself (using IOChannelUtils and IOChannelFactory) and wrap the resulting files into a PCollection<String> where the String will be a filename, using Create.of(filenames). Then apply a ParDo with a function which reads the given filename.
Write your own subclass of Source (there's also FileBasedSource, but it's not quite right for your use case). It would be configured by the filepattern, and splitIntoBundles would match the filepattern and expand into individual sources each corresponding to one file.
I would recommend the first approach because it seems like less code and your use case does not require the full power of Source.

Resources