Reading multiple wildcard paths into a DataFlow PCollection - google-cloud-dataflow

In my situation, I have a bunch of events that are stored in small files in Storage under a Date folder. My data might look like this:
2022-01-01/file1.json
2022-01-01/file2.json
2022-01-01/file3.json
2022-01-01/file4.json
2022-01-01/file5.json
2022-01-02/file6.json
2022-01-02/file7.json
2022-01-02/file8.json
2022-01-03/file9.json
2022-01-03/file10.json
The DataFlow job will take start and end date as input, and needs to read all files within that date range.
I am working off of this guide: https://pavankumarkattamuri.medium.com/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831
I see there is a way to load a list of files into a PCollection:
def run(argv=None):
# argument parser
# pipeline options, google_cloud_options
file_list = ['gs://bucket_1/folder_1/file.csv', 'gs://bucket_2/data.csv']
p = beam.Pipeline(options=pipeline_options)
p1 = p | "create PCol from list" >> beam.Create(file_list) \
| "read files" >> ReadAllFromText() \
| "transform" >> beam.Map(lambda x: x) \
| "write to GCS" >> WriteToText('gs://bucket_3/output')
result = p.run()
result.wait_until_finish()
I also see there is a way to specify wildcards, but I haven't seen them used together.
Wondering if beam.Create() supports wildcards in the file list? This is my solution:
def run(argv=None):
# argument parser
# pipeline options, google_cloud_options
file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']
p = beam.Pipeline(options=pipeline_options)
p1 = p | "create PCol from list" >> beam.Create(file_list) \
| "read files" >> ReadAllFromText() \
| "transform" >> beam.Map(lambda x: x) \
| "write to GCS" >> WriteToText('gs://bucket_3/output')
result = p.run()
result.wait_until_finish()
Have not tried this yet as I'm not sure if it's the best approach and don't see any examples online of anything similar. Wondering if I'm going in the right direction?
EDIT: This is my revised code after reading the answers:
with beam.Pipeline() as p:
file_list = ['gs://ext-pub-testjeff.appspot.com/2022-01-02/*.json', 'gs://ext-pub-testjeff.appspot.com/2022-01-03/*.json']
for i, file in enumerate(file_list):
p = (p | f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0))
p = (p | "write to GCS" >> WriteToText('gs://ext-pub-testjeff.appspot.com/output'))
EDIT2: Using the original code:
with beam.Pipeline() as p:
file_list = ['gs://ext-pub-testjeff.appspot.com/2022-01-02/*.json', 'gs://ext-pub-testjeff.appspot.com/2022-01-03/*.json']
for i, file in enumerate(file_list):
p = (p |
f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
| f"write to GCS {i}" >> WriteToText('gs://ext-pub-testjeff.appspot.com/output'))

You can use the following approach if it's not possible to use a single url with a wildcard :
def run():
# argument parser
# pipeline options, google_cloud_options
with beam.Pipeline(options=pipeline_options) as p:
file_list = ['gs://bucket_1/2022-01-02/*.json', 'gs://2022-01-03/*.json']
for i, file in enumerate(file_list):
(p
| f"Read Text {i}" >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
| f"transform {i}" >> beam.Map(lambda x: x)
| f"write to GCS {i}" >> WriteToText('gs://bucket_3/output'))
We do a foreach on the file paths with wildcard.
For each element we apply a pipeline with read and write.

Related

How to pass input to beam.Flatten()?

I started using apache beam with python and I am stuck every 30 minutes. I am trying to flatten then transformation:
lines = messages | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
output = ( lines
| 'process' >> beam.Map(process_xmls) # returns list
| 'jsons' >> beam.Map(lambda x: [beam.Create(jsons.dump(model)) for model in x])
| 'flatten' >> beam.Flatten()
| beam.WindowInto(window.FixedWindows(1, 0)))
So after running this code I get this error:
ValueError: Input to Flatten must be an iterable. Got a value of type <class 'apache_beam.pvalue.PCollection'> instead.
What should I do?
The beam.Flatten() operation takes an iterable of PCollections and returns a new PCollection that contains the union of all elements in the input PCollections. It is not possible to have a PCollection of PCollections.
I think what you're looking for here is the beam.FlatMap operation. This differs from beam.Map in that it emits multiple elements per input. For example, if you have a pcollection lines that contained the elements {'two', 'words'} then
lines | beam.Map(list)
would be the PCollection consisting of two lists
{['t', 'w', 'o'], ['w', 'o', 'r', 'd', 's']}
whereas
lines | beam.FlatMap(list)
would result in the PCollection consisting of several letters
{'t', 'w', 'o', 'w', 'o', 'r', 'd', 's'}.
Thus your final program would look something like
lines = messages | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
output = ( lines
| 'process' >> beam.FlatMap(process_xmls) # concatinates all lists returned by process_xmls into a single PCollection
| 'jsons' >> beam.Map(jsons.dumps) # apply json.dumps to each element
| beam.WindowInto(window.FixedWindows(1, 0)))
(note also json.dumps, returning strings, is probably what you want instead of json.dump which takes a second argument as the file/stream to write to).

The namespace or module is not defined

I am trying to work through the getting started docs for F#
Visual Studio Code shows an error
If I hover my mouse over the red squiggle I see the error message
The Namespace or module ClassLibraryDemo is not defined"
Here is the code for ClassLibaryDemo.fs
namespace ClassLibraryDemo
module PigLatin =
let toPigLatin (word: string) =
let isVowel (c: char) =
match c with
| 'a' | 'e' | 'i' |'o' |'u'
| 'A' | 'E' | 'I' | 'O' | 'U' -> true
|_ -> false
if isVowel word.[0] then
word + "yay"
else
word.[1..] + string(word.[0]) + "ay"
Please check the feedback in FSI when you execute #load ClassLibraryDemo.fs. You should see something like this:
FSI: [Loading c:\Users\*****\Documents\Source\SO2017\SO180207\TestModule.fs] namespace FSI_0002.TestModule val testFunc : unit -> unit
Most probably FSI can't find your file, either because the file name is misspelt or the file is in another directory. There could be other possible causes of not being able to see a namespace, for example not restoring a project, or corrupted cache (this I haven't seen in a while).

Writing Custom Expression parser or using ANTLR library?

I have expressions like follows:
eg 1: (f1 AND f2)
eg 2: ((f1 OR f2) AND f3)
eg 3: ((f1 OR f2) AND (f3 OR (f4 AND f5)))
Each of f(n) is used to generate a fragment of SQL and each of these fragments will be joined using OR / AND described in the expression.
Now I want to :
1) Parse this expression
2) Validate it
3) Generate "Expression Tree" for the expression and use this tree to generate the final SQL.
I found this series of articles on writing tokenizers, parsers..etc ex :
http://cogitolearning.co.uk/2013/05/writing-a-parser-in-java-the-expression-tree/
I also came across with the library ANTLR , which wondering whether I can use for my case.
Any tips?
I'm guessing you might only interested in Java (it would be good to say so in future), but if you have a choice of languages, then I would recommend using Python and parsy for a task like this. It is much more light weight than things like ANTLR.
Here is some example code I knocked together that parses your samples into appropriate data structures:
import attr
from parsy import string, regex, generate
#attr.s
class Variable():
name = attr.ib()
#attr.s
class Compound():
left_value = attr.ib()
right_value = attr.ib()
operator = attr.ib()
#attr.s
class Expression():
value = attr.ib()
# You could put an `evaluate` method here,
# or `generate_sql` etc.
whitespace = regex(r'\s*')
lexeme = lambda p: whitespace >> p << whitespace
AND = lexeme(string('AND'))
OR = lexeme(string('OR'))
OPERATOR = AND | OR
LPAREN = lexeme(string('('))
RPAREN = lexeme(string(')'))
variable = lexeme((AND | OR | LPAREN | RPAREN).should_fail("not AND OR ( )") >> regex("\w+")).map(Variable)
#generate
def compound():
yield LPAREN
left = yield variable | compound
op = yield OPERATOR
right = yield variable | compound
yield RPAREN
return Compound(left_value=left,
right_value=right,
operator=op)
expression = (variable | compound).map(Expression)
I'm also use attrs for simple data structures.
The result of parsing is a hierarchy of expressions:
>>> expression.parse("((f1 OR f2) AND (f3 OR (f4 AND f5)))")
Expression(value=Compound(left_value=Compound(left_value=Variable(name='f1'), right_value=Variable(name='f2'), operator='OR'), right_value=Compound(left_value=Variable(name='f3'), right_value=Compound(left_value=Variable(name='f4'), right_value=Variable(name='f5'), operator='AND'), operator='OR'), operator='AND'))

Why is $ allowed but $$, or <$> disallowed as an operator (FS0035) and what makes $ special?

$ is allowed in a custom operator, but if you try to use $$, <$> or for instance ~$% as operator name you will receive the following error:
error FS0035: This construct is deprecated: '$' is not permitted as a character in operator names and is reserved for future use
$ clearly also has the '$' in the name, but works, why?
I.e.:
let inline ( $ ) f y = f y
// using it works just fine:
let test =
let add x = x + 1
add $ 12
I see $ a lot in online examples and apparently as a particular kind of operator. What is this spcial treatment or role for $ (i.e. in Haskell or OCaml) and what should <$> do if it were allowed (edit)?
Trying to fool the system by creating a function like op_DollarDollar, doesn't fly, syntax check is done on the call site as well. Though as an example, this trick does work with other (legal) operators:
// works
let inline op_BarQmark f y = f y
let test =
let add x = x + 1
add |? 12
// also works:
let inline op_Dollar f y = f y
let test =
let add x = seq { yield x + 1 }
add $ 12
There's some inconsistency in the F# specification around this point. Section 3.7 of the F# spec defines symbolic operators as
regexp first-op-char = !%&*+-./<=>#^|~
regexp op-char = first-op-char | ?
token quote-op-left =
| <# <##
token quote-op-right =
| #> ##>
token symbolic-op =
| ?
| ?<-
| first-op-char op-char*
| quote-op-left
| quote-op-right
(and $ also doesn't appear as a symbolic keyword in section 3.6), which would indicate that it's wrong for the compiler to accept ( $ ) as an operator.
However, section 4.4 (which covers operator precedence) includes these definitions:
infix-or-prefix-op :=
+, -, +., -., %, &, &&
prefix-op :=
infix-or-prefix-op
~ ~~ ~~~ (and any repetitions of ~)
!OP (except !=)
infix-op :=
infix-or-prefix-op
-OP +OP || <OP >OP = |OP &OP ^OP *OP /OP %OP !=
(or any of these preceded by one or more ‘.’)
:=
::
$
or
?
and the following table of precedence and associativity does contain $ (but no indication that $ can appear as one character in any longer symbolic operator). Consider filing a bug so that the spec can be made consistent one way or the other.

loading elements from file into a tree in haskell

I am trying to make a tree from the info in a text document. For example in example.txt we have aritmetchic expression (3 + x) * (5 - 2). I want to make a tree which seems like this:
Node * (Node + (Leaf 3) (Leaf x)) (Node - (Leaf 5) (Leaf 2)
So far after a lot of unsuccessful attempts I have done this:
data Tree a = Empty
| Leaf a
| Node a (Tree a) (Tree a)
deriving (Show)
this is the tree I use and :
take name = do
elements <- readFile name
return elements
So how can I put the elements in the tree?
You'll need to make a data type to put in the tree that can store both operations and values. One way to do this would be to create an ADT representing everything you want to store in the tree:
data Eval a
= Val a
| Var Char
| Op (a -> a -> a)
type EvalTree a = Tree (Eval a)
But this isn't really ideal because someone could have Leaf (Op (+)), which doesn't make much sense here. Rather, I would suggest structuring it as
data Eval a
= Val a
| Var Char
| Op (a -> a -> a) (Eval a) (Eval a)
Which is essentially the tree structure you have, just restricted to be syntactically correct. Then you can write a simple evaluator as
eval :: Eval a -> Data.Map.Map Char a -> Maybe a
eval vars (Val a) = Just a
eval vars (Var x) = Data.Map.lookup x vars
eval vars (Op op l r) = do
left <- eval l
right <- eval r
return $ left `op` right
This will just walk down both branches, evaluating as it goes, then finally returning the computed value. You just have to supply it with a map of variables to values to use
So for example, (3 + x) * (5 - 2) would be represented as Op (*) (Op (+) (Val 3) (Var 'x')) (Op (-) (Val 5) (Val 2)). All that's left is to parse the file, which is another problem entirely.

Resources