Indexing a PCollection in DataFlow - google-cloud-dataflow

I've built a PCollection in Cloud Dataflow that I will write to disk as it is. I would like to build another collection that references items in the first collection by their index. e.g.
PC1:
strings go here
some other string here
more strings
PC2:
0,1
1,1
0,2
I'm unsure how to get the indices in PC1 without writing the whole pipeline and starting another, and even then I'm not sure how to keep record of the line/record number being read. Is it safe to simply use a static variable? I would assume not based on the generally parallel nature of the platform.

PCollection's are inherently unordered, so there's no such thing as "index of an item in a collection" - however, you can include the line number in the element itself: have PC1 be a PCollection<KV<Integer, String>> where the Integer is the line number - basically read lines from a text file paired with their line number.
We currently don't provide a built-in source that does this - your best bet would be to write a simple DoFn<String, KV<Integer, String>> that takes the filename as input and uses IOChannelFactory to open the file and read it line by line and emit the contents with line numbers to produce PC1.

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

Read multiple files at runtime (dataflow template)

I am trying to build a dataflow template.
The goal is to read ValueProvider that will tell me what files to read.
Then for each files I need to read and enrich data with the object.
I have tried this:
p.apply(Create.of(options.getScheduleBatch()))
.apply(ParDo.of(StringScheduleBatchToFileReceivedFn.of()))
.apply(ParDo.of(new DoFn<FileReceived, PCollection<EventRow>>() {
#ProcessElement
public void process(ProcessContext c) {
FileReceived fileReceived = c.element();
Broker broker = configuration.getBroker(fileReceived.getBrokerId());
PCollection<EventRow> eventRows = p
.apply(TextIO.read().from(fileReceived.getUri()))
.apply(ParDo.of(StringToEventRowFn.of(broker, fileReceived, options.getJobName())));
c.output(eventRows);
}
}));
But I have the following error:
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.beam.sdk.values.PCollection.
I would love to find a better way than reading the file by myself using gcs client.
Do you have any tips ?
Best regards
The problem:
You're trying to emit a PCollection as an output of your ParDo. This doesn't work.
Details:
PCollection is an abstraction that represents a potentially unbounded collection of elements. Applying a transformation to a PCollection gives you another PCollection. One of the transformations you can apply is a ParDo. ParDos make element-wise transforms. When applying a ParDo you're expressing - "take this PCollection and make another one by converting all elements within it by applying that ParDo".
One of the things that makes the processing effective is ability to execute everything in parallel, e.g. converting a lot of elements at once on multiple execution nodes (e.g. VMs/machines) by running the same ParDo on each against different elements. And you can't explicitly control whether any specific transform will happen on the same execution node or another one, it's part of the underlying system design how to optimize this. But to enable this you must be able to potentially pass elements around between execution nodes and persist them for aggregation. Beam supports this by requiring you to implement Coders for elements. Coders are a serialization mechanism that allows Beam to convert an element (represented by a java object) to a byte array which can then be passed to the next transformation (that can potentially happen on another machine) or storage. For example, Beam needs to be able to encode the elements that you output from a ParDo. Beam knows how to serialize some types, but it cannot infer everything automatically, you have to provide coders for something that cannot be inferred.
Your example looks like this: take some PCollection, and convert it into another PCollection by applying a ParDo to each element, and that ParDo transforms each input element into a PCollection. This means that as soon as element gets processed by a ParDo you have to encode it and pass it to the next transformation. And the question here is - how do you encode and pass a (potentially unbounded) PCollection to the next transform or persist it for aggregation?
Beam doesn't support thisat the moment, so you will need to choose another approach.
In your specific case I am not sure if in Beam out of the box you can simply use a stream of filenames and the convert them into sub-pipelines for processing the lines in the files.
Workarounds:
Few approaches I can think of to bypass this limitation:
If your file names have a known pattern, you can specify the pattern in TextIO and it can read the new files as they arrive.
If they don't have a known pattern, you can potentially write another pipeline to rename the files names so that they have common name pattern and then use the pattern in TextIO in another pipeline.
If feasible (e.g. files fit in memory), you could probably read the files contents with pure java File API, split them into rows and emit those rows in a single ParDo. Then you can apply the same StringToEventRowFn in the following ParDo.
Hope this helps

Can I modify elements within an apache beam transform?

The Apache Beam programming guide contains the following rule:
3.2.2. Immutability
A PCollection is immutable. Once created, you cannot add, remove, or
change individual elements. A Beam Transform might process each
element of a PCollection and generate new pipeline data (as a new
PCollection), but it does not consume or modify the original input
collection.
Does this mean I cannot, must not, or should not modify individual elements in a custom transform?
Specifically, I am using the python SDK and considering the case of a transform that takes a dict {key: "data"} as input, does some processing and adds further fields {other_key: "some more data"}.
My interpretation of rule 3.2.2 above is that I should so something like
def process(self,element):
import copy
output = copy.deepcopy(element)
output[other_key] = some_data
yield output
but I am wondering if this may be a bit overkill.
Using a TestPipeline, I found that the elements of the input collection are also modified if I act on them in the process() method (unless the elements are basic types such as int, float, bool...).
Is mutating elements considered an absolute no-go, or just a practice one has to be careful with ?
Mutating elements is an absolute no-go, and it can and will lead to violations of the Beam model semantics, i.e. to incorrect and unpredictable results. Beam Java direct runner intentionally detects mutations and fails pipelines that do that - this is not yet implemented in Python runner, but it should be.
The reason for that is, primarily, fusion. Eg. imagine that two DoFn's are applied to the same PCollection "C" (f(C) and g(C) - not f(g(C))), and a runner schedules them to run in the same shard. Imagine the first DoFn modifies the element, then by the time the second DoFn runs, the element has been changed - i.e. the second DoFn is not really being applied to "C". There are a number of other scenarios where mutations will lead to incorrect results.

How to create a string outside of Erlang that represents a DICT Term?

I want to construct a string in Java that represents a DICT term and that will be passed to an Erlang process for being reflected back as an erlang term ( string-to-term ).
I can achieve this easily for ORDDICT 's, since they are structured as a simple sorted key / value pair in a list of tuples such as : [ {field1 , "value1"} , {field2 , "value2} ]
But, for DICTS, they are compiled into a specific term that I want to find how to reverse-engineer it. I am aware this structure can change over new releases, but the benefits for performance and ease of integration to Java would overcome this. Unfortunately Erlang's JInterface is based on simple data structures. An efficient DICT type would be of great use.
A simple dict gets defined as follows:
D1 = dict:store("field1","AAA",dict:new()).
{dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],[],[],[]}}}
As it can be seen above, there are some coordinates which I do not understand what they mean ( the numbers 1,16,16,8,80,48 and a set of empty lists, which likely represent something as well.
Adding two other rows (key-value pairs) causes the data to look like:
D3 = dict:store("field3","CCC",D2).
{dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],
[["field3",67,67,67]],
[],[],[],[],[],
[["field1",65,65,65]],
[],[],[],[],
[["field2",66,66,66]],
[],[]}}}
From the above I can notice that:
the first number (3) reppresets the number of items in the DICT.
the second number (16) shows the number of list slots in the first tuple of lists
the third number (16) shows the number of list slots in the second typle of lists, of which the values ended up being placed on ( in the middle ).
the fourth number (8) appears to be the number of slots in the second row of tuples from where the values are placed ( a sort of index-pointer )
the remaining numbers (80 and 48)... no idea...
adding a key "field0" gets placed not in the end but just after "field1"'s data. This indicates the indexing approach.
So the question, is there a way (algorithm) to reliably directly create a DICT string from outside of Erlang ?
The comprehensive specification how dict is implemented can be found simply in the dict.erl sourcecode.
But I'm not sure replicating dict.erl's implementation in Java is worthwhile. This would only make sense if you want a fast dict like data structure that you need to pass often between Java and Erlang code. It might make more sense to use a Key-Value store both from Erlang and Java without passing it directly around. Depending on your application this could be e.g. riak or maybe even connect your different language worlds with RabbitMQ. Both examples are implemented in Erlang and are easily accessible from both worlds.

When to Define "unit" in the TypeSpecifierList for Erlang Bins

I've started learning Erlang and recently wrapped up the section on bit syntax. I feel I have a firm understanding of how they can be constructed and matched but failed to come up with an example of when I would want to change the default values of "unit" inside the TypeSpecifierList.
Can anyone share a situation when this would prove useful?
Thanks for your time.
Sometimes, just for convenience: you've got a parameter from somewhere (e.g., from a file header) specifying a count of units of a given size, such as N words of 24-bit audio data, and instead of doing some multiplication, you just say:
<<Audio:N/binary-unit:24, Rest/binary>> = Data
to extract that data (as a chunk) from the rest of the file contents. After parsing the rest of the file, you could pass that chunk to some other function that splits it up into samples.

Resources