Read multiple files at runtime (dataflow template)

Read multiple files at runtime (dataflow template) - google-cloud-dataflow

I am trying to build a dataflow template.
The goal is to read ValueProvider that will tell me what files to read.
Then for each files I need to read and enrich data with the object.
I have tried this:
p.apply(Create.of(options.getScheduleBatch()))
.apply(ParDo.of(StringScheduleBatchToFileReceivedFn.of()))
.apply(ParDo.of(new DoFn<FileReceived, PCollection<EventRow>>() {
#ProcessElement
public void process(ProcessContext c) {
FileReceived fileReceived = c.element();
Broker broker = configuration.getBroker(fileReceived.getBrokerId());
PCollection<EventRow> eventRows = p
.apply(TextIO.read().from(fileReceived.getUri()))
.apply(ParDo.of(StringToEventRowFn.of(broker, fileReceived, options.getJobName())));
c.output(eventRows);
}
}));
But I have the following error:
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.beam.sdk.values.PCollection.
I would love to find a better way than reading the file by myself using gcs client.
Do you have any tips ?
Best regards

The problem:
You're trying to emit a PCollection as an output of your ParDo. This doesn't work.
Details:
PCollection is an abstraction that represents a potentially unbounded collection of elements. Applying a transformation to a PCollection gives you another PCollection. One of the transformations you can apply is a ParDo. ParDos make element-wise transforms. When applying a ParDo you're expressing - "take this PCollection and make another one by converting all elements within it by applying that ParDo".
One of the things that makes the processing effective is ability to execute everything in parallel, e.g. converting a lot of elements at once on multiple execution nodes (e.g. VMs/machines) by running the same ParDo on each against different elements. And you can't explicitly control whether any specific transform will happen on the same execution node or another one, it's part of the underlying system design how to optimize this. But to enable this you must be able to potentially pass elements around between execution nodes and persist them for aggregation. Beam supports this by requiring you to implement Coders for elements. Coders are a serialization mechanism that allows Beam to convert an element (represented by a java object) to a byte array which can then be passed to the next transformation (that can potentially happen on another machine) or storage. For example, Beam needs to be able to encode the elements that you output from a ParDo. Beam knows how to serialize some types, but it cannot infer everything automatically, you have to provide coders for something that cannot be inferred.
Your example looks like this: take some PCollection, and convert it into another PCollection by applying a ParDo to each element, and that ParDo transforms each input element into a PCollection. This means that as soon as element gets processed by a ParDo you have to encode it and pass it to the next transformation. And the question here is - how do you encode and pass a (potentially unbounded) PCollection to the next transform or persist it for aggregation?
Beam doesn't support thisat the moment, so you will need to choose another approach.
In your specific case I am not sure if in Beam out of the box you can simply use a stream of filenames and the convert them into sub-pipelines for processing the lines in the files.
Workarounds:
Few approaches I can think of to bypass this limitation:
If your file names have a known pattern, you can specify the pattern in TextIO and it can read the new files as they arrive.
If they don't have a known pattern, you can potentially write another pipeline to rename the files names so that they have common name pattern and then use the pattern in TextIO in another pipeline.
If feasible (e.g. files fit in memory), you could probably read the files contents with pure java File API, split them into rows and emit those rows in a single ParDo. Then you can apply the same StringToEventRowFn in the following ParDo.
Hope this helps

Related

Can I modify elements within an apache beam transform?

The Apache Beam programming guide contains the following rule:
3.2.2. Immutability
A PCollection is immutable. Once created, you cannot add, remove, or
change individual elements. A Beam Transform might process each
element of a PCollection and generate new pipeline data (as a new
PCollection), but it does not consume or modify the original input
collection.
Does this mean I cannot, must not, or should not modify individual elements in a custom transform?
Specifically, I am using the python SDK and considering the case of a transform that takes a dict {key: "data"} as input, does some processing and adds further fields {other_key: "some more data"}.
My interpretation of rule 3.2.2 above is that I should so something like
def process(self,element):
import copy
output = copy.deepcopy(element)
output[other_key] = some_data
yield output
but I am wondering if this may be a bit overkill.
Using a TestPipeline, I found that the elements of the input collection are also modified if I act on them in the process() method (unless the elements are basic types such as int, float, bool...).
Is mutating elements considered an absolute no-go, or just a practice one has to be careful with ?

Mutating elements is an absolute no-go, and it can and will lead to violations of the Beam model semantics, i.e. to incorrect and unpredictable results. Beam Java direct runner intentionally detects mutations and fails pipelines that do that - this is not yet implemented in Python runner, but it should be.
The reason for that is, primarily, fusion. Eg. imagine that two DoFn's are applied to the same PCollection "C" (f(C) and g(C) - not f(g(C))), and a runner schedules them to run in the same shard. Imagine the first DoFn modifies the element, then by the time the second DoFn runs, the element has been changed - i.e. the second DoFn is not really being applied to "C". There are a number of other scenarios where mutations will lead to incorrect results.

Indexing a PCollection in DataFlow

I've built a PCollection in Cloud Dataflow that I will write to disk as it is. I would like to build another collection that references items in the first collection by their index. e.g.
PC1:
strings go here
some other string here
more strings
PC2:
0,1
1,1
0,2
I'm unsure how to get the indices in PC1 without writing the whole pipeline and starting another, and even then I'm not sure how to keep record of the line/record number being read. Is it safe to simply use a static variable? I would assume not based on the generally parallel nature of the platform.

PCollection's are inherently unordered, so there's no such thing as "index of an item in a collection" - however, you can include the line number in the element itself: have PC1 be a PCollection<KV<Integer, String>> where the Integer is the line number - basically read lines from a text file paired with their line number.
We currently don't provide a built-in source that does this - your best bet would be to write a simple DoFn<String, KV<Integer, String>> that takes the filename as input and uses IOChannelFactory to open the file and read it line by line and emit the contents with line numbers to produce PC1.

How to walk whole Parse Tree and print it's content with slight changes in ANTLR4?

So as stated in the title, my task is to traverse the Parse Tree generated for code written in Java (grammar is a standard Java grammar), print most of it unchanged and modify only some words, for example type declarations.
My current approach was to create ParseTreeListener and implement the logic in the enterEveryRule method, but unfortunately it doesn't appear to work even for basic printing. The output is very messy and there are a lot of repetitions, as if every node was visited multiple times.
My another try was to implement appropriate methods in BaseListener that would do the changes to the type declarations I need, but from there I see no possibility to print the rest of the code unchanged.
Looking forward to your help!

You could use ANTLR's string templates to produce code from the ASTs.
In general, you start with set of "standard" string templates that can regenerate source code corresponding to the underlying tree.
To get the effect you want, you judiciously choose the standard string templates on AST nodes where you don't want changes, and variant templates where you do want changes.
IMHO, it is better to modify the AST, and then simply apply the standard templates.

Source code logic evaluation

I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.

In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.

Why build an AST walker instead of having the nodes responsible for their own output?

Given an AST, what would be the reason behind making a Walker class that walks over the tree and does the output, as opposed to giving each Node class a compile() method and having it responsible for its own output?
Here are some examples:
Doctrine 2 (an ORM) uses a SQLWalker to walk over an AST and generate SQL from nodes.
Twig (a templating language) has the nodes output their own code (this is an if statement node).

Using a separate Walker for code generation avoids combinatorial explosion in the number of AST node classes as the number of target representations increases. When a Walker is responsible for code generation, you can retarget it to a different representation just by altering the Walker class. But when the AST nodes themselves are responsible for compilation, you need a different version of each node for each separate target representation.

Mostly because of old literature and available tools. Experimenting with both methods you can easily find that AST traversal produces very slow and convoluted code. Moreover, code separated from immediate syntax doesn't resemble it anymore. It's very much like supporting two synchronized code bases, which is always a bad idea. Debugging, maintenance become difficult.
Of course, it can be also difficult to process semantics on the nodes unless you have a well designed state machine. In fact you are never worse than having to traverse AST after the fact, because it's just one particular case of processing semantics on nodes.
You can often hear that AST traversal allows for implementation of multiple semantics for the same syntax. In reality you would never want that, not only because it's rarely needed, but also for performance reasons. And frankly, there is no difficulty in writing separate syntax for a different semantics. The results were always better when both designed together.
And finally, in every non-trivial task, get syntax parsed is the easiest part, getting semantics correct and process actions fast is a challenge. Focusing on AST is approaching the task backwards.

To have support for a feature that the "internal AST walker" doesn't have.
For example, there are several ways to trasnverse a "hierarchical" or "tre" structure,
like "walk thru the leafs first", or "walk thru the branches first".
Or if the nodes siblings have a sort index, and you want to "walk" / "visit" them decremantally by their index, instead of incrementally.
If the AST class or structure you have only works with one method, you may want to use another method using your custom "walker" / "visitor".

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart