Beam python silently drops type checking for unsupported types? - google-cloud-dataflow

I had the following pipeline:
from typing import Sequence, List
import apache_beam as beam
def add_type(x) -> int:
return x
# no type error with Sequence, type error with List.
def print_with_type(x: Sequence[int]):
print(x)
with beam.Pipeline(argv=["--type_check_additional", "all"]) as pipeline:
lines = (
pipeline
| beam.Create([1, 2])
| beam.Map(add_type)
# removing this line should trigger type error
# | beam.combiners.ToList()
| beam.Map(print_with_type))
I expected a type checking error when building the pipeline, but did not get it. Only after much debugging did I realize that I should use List instead of Sequence.
Is this expected, as Sequence is one of the supported types (doc)?
Is it possible to have a warning in such cases?

You should use List instead of Sequence. In the link of the docs you have referenced is no Sequence listed, but only List. Executing your code with sequence in Apache play, I obtain the following Info:
which explains why it is not throwing any typing hint error. If you switch to List, everything works as expected.
Btw, I would recommend using with_input_types and with_output_types. If your pipeline gets more complex, this approach is more readable in my opinion, since you do not have to look up all of your custom classes and methods for understanding the types, e.g.
from typing import Sequence, List
import apache_beam as beam
def add_type(x) -> int:
return x
# no type error with Sequence, type error with List.
def print_with_type(x: Sequence[int]): # <- is ignored
print(x)
with beam.Pipeline(argv=["--type_check_additional", "all"]) as pipeline:
lines = (
pipeline
| beam.Create([1, 2])
| beam.Map(add_type).with_output_types(int) # <- is checked
# removing this line should trigger type error
# | beam.combiners.ToList()
| beam.Map(print_with_type).with_input_types(List[int]) # <- is checked
)

In your case the type is not a List but an int for the current element in the PCollection :
from typing import Sequence, List
import apache_beam as beam
def add_type(x) -> int:
return x
# The expected type is int
def print_with_type(x: int):
print(x)
with beam.Pipeline(argv=["--type_check_additional", "all"]) as pipeline:
lines = (
pipeline
| beam.Create([1, 2])
| beam.Map(add_type)
# | beam.combiners.ToList()
| beam.Map(print_with_type))
When I am testing with str type instead of int, I have the expected type_hints error :
from typing import Sequence, List
import apache_beam as beam
def add_type(x) -> int:
return x
# Test with the bad type
def print_with_type(x: str):
print(x)
with beam.Pipeline(argv=["--type_check_additional", "all"]) as pipeline:
lines = (
pipeline
| beam.Create([1, 2])
| beam.Map(add_type)
# | beam.combiners.ToList()
| beam.Map(print_with_type))
The error is :
raise TypeCheckError(
'Type hint violation for \'{label}\': requires {hint} but got '
'{actual_type} for {arg}\nFull type hint:\n{debug_str}'.format(
label=self.label,
hint=hint,
actual_type=bindings[arg],
arg=arg,
debug_str=type_hints.debug_str()))
E apache_beam.typehints.decorators.TypeCheckError: Type hint violation for 'Map(print_with_type)': requires <class 'str'> but got <class 'int'> for x
E Full type hint:
E IOTypeHints[inputs=((<class 'str'>,), {}), outputs=((Any,), {})]
When I am testing with a List of int, I have again the expected error :
rom typing import Sequence, List
import apache_beam as beam
def add_type(x) -> int:
return x
# Test with the bad type
def print_with_type(x: List[int]):
print(x)
with beam.Pipeline(argv=["--type_check_additional", "all"]) as pipeline:
lines = (
pipeline
| beam.Create([1, 2])
| beam.Map(add_type)
# | beam.combiners.ToList()
| beam.Map(print_with_type))
The error is :
E apache_beam.typehints.decorators.TypeCheckError: Type hint violation for 'Map(print_with_type)': requires List[int] but got <class 'int'> for x
But when I am testing with Sequence, I don't have the expected error.
According to the documentation, the following types are supported for the type checking :
Tuple[T, U]
Tuple[T, ...]
List[T]
KV[T, U]
Dict[T, U]
Set[T]
FrozenSet[T]
Iterable[T]
Iterator[T]
Generator[T]
PCollection[T]
Sequence type is not part of this list that’s why it was ignored.
But it makes no sense to pass a List or a Sequence in your example, because the expected type is a int not a List.

Related

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Apache Beam Python - SQL Transform with named PCollection Issue

I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select.
As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA.
Instead of using PCOLLECTION in SQLTransform query, named PCollections can also be provided as below.
Code Block
class EmployeeType(typing.NamedTuple):
name:str
age:int
beam.coders.registry.register_coder(EmployeeType, beam.coders.RowCoder)
pcol = p | "Create" >> beam.Create([EmployeeType(name="ABC", age=10)]).with_output_types(EmployeeType)
(
{'a':pcol} | SqlTransform(
""" SELECT age FROM a """)
| "Map" >> beam.Map(lambda row: row.age)
| "Print" >> beam.Map(print)
)
p.run()
However the below code block errors out with error
Caused by: org.apache.beam.vendor.calcite.v1_28_0.org.apache.calcite.sql.validate.SqlValidatorException: Object 'a' not found
Apache Beam SDK used is 2.35.0, are there any known limitation in using named PCollection

Apache Beam : How to return multiple outputs

In the below function. I want to return important_col variable as well.
class FormatInput(beam.DoFn):
def process(self, element):
""" Format the input to the desired shape"""
df = pd.DataFrame([element], columns=element.keys())
if 'reqd' in df.columns:
important_col= 'reqd'
elif 'customer' in df.columns:
important_col= 'customer'
elif 'phone' in df.columns:
important_col= 'phone'
else:
raise ValueError(['Important columns not specified'])
output = df.to_dict('records')
return output
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as p:
clean_csv = (p
| 'Read input file' >> beam.dataframe.io.read_csv('raw_data.csv'))
to_process = clean_csv | 'pre-processing' >> beam.ParDo(FormatInput())
In the above pipeline, I want to return Important_col variable from the Format Input.
Once I have that variable, I want to pass it as argument to next step in pipeline
I also want to dump to_process to CSV file.
I tried the following but none of them worked.
converted to_process to to_dataframe and tried to_csv. I got error.
I also tried to dump pcollection to csv. I am not getting how to do that. I referred official apache beam documents, but I dont find any documents similar to my use case.

Google Dataflow Python Apache Beam Windowing delay issue

I have a simple pipeline that receives data from PubSub, prints it and then at every 10 seconds fires a window into a GroupByKey and prints that message again.
However this window seems to be delaying sometimes. Is this a google limitation or is there something wrong with my code:
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndPrinter())
| beam.WindowInto(window.FixedWindows(10), allowed_lateness=0, accumulation_mode=AccumulationMode.DISCARDING, trigger=AfterProcessingTime(10) )
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker()))
Edit for the most recent code. I removed the triggers however the problem persists:
class ExtractorAndCounter(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [("Message", json.loads(element)["Message"])]
class PrinterWorker(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [str(element)]
class DefineTimestamp(beam.DoFn):
def process(self, element, *args, **kwargs):
from datetime import datetime
return [(str(datetime.now()), element)]
def run(argv=None, save_main_session=True):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--output_topic',
required=True,
help=(
'Output PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
'--input_topic',
help=(
'Input PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group.add_argument(
'--input_subscription',
help=(
'Input PubSub subscription of the form '
'"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
pipeline_options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndCounter())
| beam.WindowInto(window.FixedWindows(10))
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker())
| 'encode' >> beam.Map(lambda x: x.encode('utf-8'))
| beam.io.WriteToPubSub(known_args.output_topic))
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
So what this basically ask the pipeline to do is to group elements to 10 second windows and fire every window after 10 seconds have passed since the first element was received for each window (and discard the rest of the data for that window). Was that your intention ?
Assuming this was the case, note that triggering depends on the time elements were received by the system as well as the time the first element is received for each window. Probably this is why you are seeing some variation in your results.
I think if you need more consistent grouping for your elements you should use event time triggers instead of processing time triggers.
All the triggers are based on best effort means they will be fired sometime after the specified duration, 10 sec in this case. Generally it happen close to the time specified but a few seconds delay is possible.
Also, the triggers are set for Key+Window. The window is derived from event time.
It is possible that the first GBK pint at 10:30:04 is due to the first element which as at 10:29:52
The 2nd GBK print at 10:30:07 is due to the first element at 10:29:56
So it will be good to print the window and event timestamp for each element and then co-relate the data.

How to unparse a Julia expression

I've been trying to understand the Julia from a meta-programming viewpoint and often I find myself in the position where I wish to generate the user facing Julia syntax from an Expr.
Searching through the source code on GitHub I came across the "deparse" function defined in femtolisp. But it doesn't seem to be exposed at all.
What are the ways I can generate a proper Julia expression using just the internal representation?
P. S. There ought to be some sort of prettifying tool for the generated Julia code, do you know of some such (un/registered) pkg?
~#~#~#~#~
UPDATE
I've stored all the Meta.show_sexprof a julia source file into a different file.
# This function is identical to create_adder implementation above.
function create_adder(x)
y -> x + y
end
# You can also name the internal function, if you want
function create_adder(x)
function adder(y)
x + y
end
adder
end
add_10 = create_adder(10)
add_10(3) # => 13
is converted to
(:line, 473, :none),
(:function, (:call, :create_adder, :x), (:block,
(:line, 474, :none),
(:function, (:call, :adder, :y), (:block,
(:line, 475, :none),
(:call, :+, :x, :y)
)),
(:line, 477, :none),
:adder
)),
(:line, 480, :none),
(:(=), :add_10, (:call, :create_adder, 10)),
(:line, 481, :none),
(:call, :add_10, 3))
Now, Wish to evaluate these in julia.
Here's an example of a function that takes an "s_expression" in tuple form, and generates the corresponding Expr object:
"""rxpe_esrap: parse expr in reverse :p """
function rpxe_esrap(S_expr::Tuple)
return Expr( Tuple( isa(i, Tuple) ? rpxe_esrap(i) : i for i in S_expr )... );
end
Demo
Let's generate a nice s_expression tuple to test our function.
(Unfortunately Meta.show_sexpr doesn't generate a string, it just prints to an IOStream, so to get its output as a string that we can parse / eval, we either need to get it from a file, or print straight into something like an IOBuffer)
B = IOBuffer(); # will use to 'capture' the s_expr in
Expr1 = :(1 + 2 * 3); # the expr we want to generate an s_expr for
Meta.show_sexpr(B, Expr1); # push s_expr into buffer B
seek(B, 0); # 'rewind' buffer
SExprStr = read(B, String); # get buffer contents as string
close(B); # please to be closink after finished, da?
SExpr = parse(SExprStr) |> eval; # final s_expr in tuple form
resulting in the following s_expression:
julia> SExpr
(:call, :+, 1, (:call, :*, 2, 3))
Now let's test our function:
julia> rpxe_esrap(SExpr)
:(1 + 2 * 3) # Success!
Notes:
1. This is just a bare-bones function to demonstrate the concept, obviously this would need appropriate sanity checks if to be used on serious projects.
2. This implementation just takes a single "s_expr tuple" argument; your example shows a string that corresponds to a sequence of tuples, but presumably you could tokenise such a string first to obtain the individual tuple arguments, and run the function on each one separately.
3. The usual warnings regarding parse / eval and scope apply. Also, if you wanted to pass the s_expr string itself as the function argument, rather than an "s_expr tuple", then you could modify this function to move the parse / eval step inside the function. This may be a better choice, since you can check what the string contains before evaluating potentially dangerous code, etc etc.
4. I'm not saying there isn't an official function that does this. Though if there is one, I'm not aware of it. This was fun to write though.

Resources