Can I modify elements within an apache beam transform? - google-cloud-dataflow

The Apache Beam programming guide contains the following rule:
3.2.2. Immutability
A PCollection is immutable. Once created, you cannot add, remove, or
change individual elements. A Beam Transform might process each
element of a PCollection and generate new pipeline data (as a new
PCollection), but it does not consume or modify the original input
collection.
Does this mean I cannot, must not, or should not modify individual elements in a custom transform?
Specifically, I am using the python SDK and considering the case of a transform that takes a dict {key: "data"} as input, does some processing and adds further fields {other_key: "some more data"}.
My interpretation of rule 3.2.2 above is that I should so something like
def process(self,element):
import copy
output = copy.deepcopy(element)
output[other_key] = some_data
yield output
but I am wondering if this may be a bit overkill.
Using a TestPipeline, I found that the elements of the input collection are also modified if I act on them in the process() method (unless the elements are basic types such as int, float, bool...).
Is mutating elements considered an absolute no-go, or just a practice one has to be careful with ?

Mutating elements is an absolute no-go, and it can and will lead to violations of the Beam model semantics, i.e. to incorrect and unpredictable results. Beam Java direct runner intentionally detects mutations and fails pipelines that do that - this is not yet implemented in Python runner, but it should be.
The reason for that is, primarily, fusion. Eg. imagine that two DoFn's are applied to the same PCollection "C" (f(C) and g(C) - not f(g(C))), and a runner schedules them to run in the same shard. Imagine the first DoFn modifies the element, then by the time the second DoFn runs, the element has been changed - i.e. the second DoFn is not really being applied to "C". There are a number of other scenarios where mutations will lead to incorrect results.

Related

Read multiple files at runtime (dataflow template)

I am trying to build a dataflow template.
The goal is to read ValueProvider that will tell me what files to read.
Then for each files I need to read and enrich data with the object.
I have tried this:
p.apply(Create.of(options.getScheduleBatch()))
.apply(ParDo.of(StringScheduleBatchToFileReceivedFn.of()))
.apply(ParDo.of(new DoFn<FileReceived, PCollection<EventRow>>() {
#ProcessElement
public void process(ProcessContext c) {
FileReceived fileReceived = c.element();
Broker broker = configuration.getBroker(fileReceived.getBrokerId());
PCollection<EventRow> eventRows = p
.apply(TextIO.read().from(fileReceived.getUri()))
.apply(ParDo.of(StringToEventRowFn.of(broker, fileReceived, options.getJobName())));
c.output(eventRows);
}
}));
But I have the following error:
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.beam.sdk.values.PCollection.
I would love to find a better way than reading the file by myself using gcs client.
Do you have any tips ?
Best regards
The problem:
You're trying to emit a PCollection as an output of your ParDo. This doesn't work.
Details:
PCollection is an abstraction that represents a potentially unbounded collection of elements. Applying a transformation to a PCollection gives you another PCollection. One of the transformations you can apply is a ParDo. ParDos make element-wise transforms. When applying a ParDo you're expressing - "take this PCollection and make another one by converting all elements within it by applying that ParDo".
One of the things that makes the processing effective is ability to execute everything in parallel, e.g. converting a lot of elements at once on multiple execution nodes (e.g. VMs/machines) by running the same ParDo on each against different elements. And you can't explicitly control whether any specific transform will happen on the same execution node or another one, it's part of the underlying system design how to optimize this. But to enable this you must be able to potentially pass elements around between execution nodes and persist them for aggregation. Beam supports this by requiring you to implement Coders for elements. Coders are a serialization mechanism that allows Beam to convert an element (represented by a java object) to a byte array which can then be passed to the next transformation (that can potentially happen on another machine) or storage. For example, Beam needs to be able to encode the elements that you output from a ParDo. Beam knows how to serialize some types, but it cannot infer everything automatically, you have to provide coders for something that cannot be inferred.
Your example looks like this: take some PCollection, and convert it into another PCollection by applying a ParDo to each element, and that ParDo transforms each input element into a PCollection. This means that as soon as element gets processed by a ParDo you have to encode it and pass it to the next transformation. And the question here is - how do you encode and pass a (potentially unbounded) PCollection to the next transform or persist it for aggregation?
Beam doesn't support thisat the moment, so you will need to choose another approach.
In your specific case I am not sure if in Beam out of the box you can simply use a stream of filenames and the convert them into sub-pipelines for processing the lines in the files.
Workarounds:
Few approaches I can think of to bypass this limitation:
If your file names have a known pattern, you can specify the pattern in TextIO and it can read the new files as they arrive.
If they don't have a known pattern, you can potentially write another pipeline to rename the files names so that they have common name pattern and then use the pattern in TextIO in another pipeline.
If feasible (e.g. files fit in memory), you could probably read the files contents with pure java File API, split them into rows and emit those rows in a single ParDo. Then you can apply the same StringToEventRowFn in the following ParDo.
Hope this helps

What are the steps in doing incrementation in erlang?

increment([]) -> [];
increment([H|T]) -> [H+1|increment(T)].
decrement([]) -> [];
decrement([H|T]) -> [H-1|decrement(T)].
So I have this code but I don't know how they properly work like in java.
Java and Erlang are different beasts. I don't recommend trying to make comparisons to Java when learning Erlang, especially if Java is the only language you know so far. The code you've posted is a good example of the paradigm known as "functional programming". I'd suggest doing some reading on that subject to help you understand what's going on. To try to break this down as far as Erlang goes, you need to understand that an Erlang function is completely different from a Java method.
In Java, your method signature is composed of the method name and the types of its arguments. The return type can also be significant. A Java increment method like the function you wrote might be written like List<Integer> increment(List<Integer> input). The body of the Java method would probably iterate through the list an element at a time and set each element to itself plus one:
List<Integer> increment(List<Integer> input) {
for (int i = 0; i < input.size; i++) {
input.set(i, input.get(i) + 1);
}
}
Erlang has almost nothing in common with this. To begin with, an erlang function's "signature" is the name and arity of the function. Arity means how many arguments the function accepts. So your increment function is known as increment/1, and that's its unique signature. The way you write the argument list inside the parentheses after the function name has less to do with argument types than with the pattern of the data passed to it. A function like increment([]) -> ... can only successfully be called by passing it [], the empty list. Likewise, the function increment([Item]) -> ... can only be successfully called by passing it a list with one item in it, and increment([Item1, Item2]) -> ... must be passed a list with two items in it. This concept of matching data to patterns is quite aptly known as "pattern matching", and you'll find it in many functional languages. In Erlang functions, it's used to select which head of the function to execute. This bears a rough similarity to Java's method overloading, where you can have many methods with the same name but different argument types; however a pattern in an Erlang function head can bind variables to different pieces of the arguments that match the pattern.
In your code example, the function increment/1 has two heads. The first head is executed only if you pass an empty list to the function. The second head is executed only if you pass a non-empty list to the function. When that happens, two variables, H and T, are bound. H is bound to the first item of the list, and T is bound to the rest of the list, meaning all but the first item. That's because the pattern [H|T] matches a non-empty list, including a list with one element, in which case T would be bound to the empty list. The variables thus bound can be used in the body of the function.
The bodies of your functions are a very typical form of iterating a list in Erlang to produce a new list. It's typical because of another important difference from Java, which is that Erlang data is immutable. That means there's no such concept as "setting an element of a list" like I did in the Java code above. If you want to change a list, you have to build a new one, which is what your code does. It effectively says:
The result of incrementing the empty list is the empty list.
The result of incrementing a non-empty list is:
Take the first element of the list: H.
Increment the rest of the list: increment(T).
Prepend H+1 to the result of incrementing the rest of the list.
Note that you want to be careful about how you build lists in Erlang, or you can end up wasting a lot of resources. The List Handling User's Guide is a good place to learn about that. Also note that this code uses a concept known as "recursion", meaning that the function calls itself. In many popular languages, including Java, recursion is of limited usefulness because each new function call adds a stack frame, and your available memory space for stack frames is relatively limited. Erlang and many functional languages support a thing known as "tail call elimination", which is a feature that allows properly written code to recurse indefinitely without exhausting any resources.
Hopefully this helps explain things. If you can ask a more specific question, you might get a better answer.

Indexing a PCollection in DataFlow

I've built a PCollection in Cloud Dataflow that I will write to disk as it is. I would like to build another collection that references items in the first collection by their index. e.g.
PC1:
strings go here
some other string here
more strings
PC2:
0,1
1,1
0,2
I'm unsure how to get the indices in PC1 without writing the whole pipeline and starting another, and even then I'm not sure how to keep record of the line/record number being read. Is it safe to simply use a static variable? I would assume not based on the generally parallel nature of the platform.
PCollection's are inherently unordered, so there's no such thing as "index of an item in a collection" - however, you can include the line number in the element itself: have PC1 be a PCollection<KV<Integer, String>> where the Integer is the line number - basically read lines from a text file paired with their line number.
We currently don't provide a built-in source that does this - your best bet would be to write a simple DoFn<String, KV<Integer, String>> that takes the filename as input and uses IOChannelFactory to open the file and read it line by line and emit the contents with line numbers to produce PC1.

Source code logic evaluation

I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.

Why build an AST walker instead of having the nodes responsible for their own output?

Given an AST, what would be the reason behind making a Walker class that walks over the tree and does the output, as opposed to giving each Node class a compile() method and having it responsible for its own output?
Here are some examples:
Doctrine 2 (an ORM) uses a SQLWalker to walk over an AST and generate SQL from nodes.
Twig (a templating language) has the nodes output their own code (this is an if statement node).
Using a separate Walker for code generation avoids combinatorial explosion in the number of AST node classes as the number of target representations increases. When a Walker is responsible for code generation, you can retarget it to a different representation just by altering the Walker class. But when the AST nodes themselves are responsible for compilation, you need a different version of each node for each separate target representation.
Mostly because of old literature and available tools. Experimenting with both methods you can easily find that AST traversal produces very slow and convoluted code. Moreover, code separated from immediate syntax doesn't resemble it anymore. It's very much like supporting two synchronized code bases, which is always a bad idea. Debugging, maintenance become difficult.
Of course, it can be also difficult to process semantics on the nodes unless you have a well designed state machine. In fact you are never worse than having to traverse AST after the fact, because it's just one particular case of processing semantics on nodes.
You can often hear that AST traversal allows for implementation of multiple semantics for the same syntax. In reality you would never want that, not only because it's rarely needed, but also for performance reasons. And frankly, there is no difficulty in writing separate syntax for a different semantics. The results were always better when both designed together.
And finally, in every non-trivial task, get syntax parsed is the easiest part, getting semantics correct and process actions fast is a challenge. Focusing on AST is approaching the task backwards.
To have support for a feature that the "internal AST walker" doesn't have.
For example, there are several ways to trasnverse a "hierarchical" or "tre" structure,
like "walk thru the leafs first", or "walk thru the branches first".
Or if the nodes siblings have a sort index, and you want to "walk" / "visit" them decremantally by their index, instead of incrementally.
If the AST class or structure you have only works with one method, you may want to use another method using your custom "walker" / "visitor".

Resources