Apache Beam dataflow on GCS: ReadFromText from multiple path - google-cloud-dataflow

I was wondering if it was possible to use the ReadFromText PTransform passing it multiple path.
My PTranform expand method is:
def expand(self, pcoll):
dataset = (
pcoll
| "Read Dataset from text file" >> beam.io.ReadFromText(self._source)
And source right now is a string with a path with a blob pattern
self._source="gs://bucket1/folder/*
From the documentation it says:
Args:
file_pattern (str): The file path to read from as a local file path or a
GCS ``gs://`` path. The path can contain glob characters
(``*``, ``?``, and ``[...]`` sets).
But even if it works greatly if I use gs://folder/*.gz (I have multiple files under a path) I can't seem to make it work if I have different path (or, in my case, in different buckets).
I tried with the command ls with something like:
gsutils ls gs://{bucket1/folder,bucket2/folder}/*
But if I try it with the beam pipeline it doesn't work and gives me
ERROR: (gcloud.dataflow.flex-template.run) unrecognized arguments:
Is there a way to make it work ?

As you explained in your comment, you can solve it with a for loop on the Beam Pipeline, example :
bucket_paths = [
"gs://bucket/folder/file*.txt",
"gs://bucket2/folder/file*.txt"
]
with beam.Pipeline(options=PipelineOptions()) as p:
for i, bucket_path in enumerate(bucket_paths):
(p
| f"Read Dataset from text file {i}" >> beam.io.ReadFromText(bucket_path)
....
)

Related

Using io.tmpfile() with shell command, ran via io.popen, in Lua?

I'm using Lua in Scite on Windows, but hopefully this is a general Lua question.
Let's say I want to write a temporary string content to a temporary file in Lua - which I want to be eventually read by another program, - and I tried using io.tmpfile():
mytmpfile = assert( io.tmpfile() )
mytmpfile:write( MYTMPTEXT )
mytmpfile:seek("set", 0) -- back to start
print("mytmpfile" .. mytmpfile .. "<<<")
mytmpfile:close()
I like io.tmpfile() because it is noted in https://www.lua.org/pil/21.3.html :
The tmpfile function returns a handle for a temporary file, open in read/write mode. That file is automatically removed (deleted) when your program ends.
However, when I try to print mytmpfile, I get:
C:\Users\ME/sciteLuaFunctions.lua:956: attempt to concatenate a FILE* value (global 'mytmpfile')
>Lua: error occurred while processing command
I got the explanation for that here Re: path for io.tmpfile() ?:
how do I get the path used to generate the temp file created by io.tmpfile()
You can't. The whole point of tmpfile is to give you a file handle without
giving you the file name to avoid race conditions.
And indeed, on some OSes, the file has no name.
So, it will not be possible for me to use the filename of the tmpfile in a command line that should be ran by the OS, as in:
f = io.popen("python myprog.py " .. mytmpfile)
So my questions are:
Would it be somehow possible to specify this tmpfile file handle as the input argument for the externally ran program/script, say in io.popen - instead of using the (non-existing) tmpfile filename?
If above is not possible, what is the next best option (in terms of not having to maintain it, i.e. not having to remember to delete the file) for opening a temporary file in Lua?
You can get a temp filename with os.tmpname.
local n = os.tmpname()
local f = io.open(n, 'w+b')
f:write(....)
f:close()
os.remove(n)
If your purpose is sending some data to a python script, you can also use 'w' mode in popen.
--lua
local f = io.popen(prog, 'w')
f:write(....)
#python
import sys
data = sys.stdin.readline()

Bazel - depend on generated outputs

I have a yaml file in a Bazel monorepo that has constants I'd like to use in several languages. This is kind of like how protobuffs are created and used.
How can I parse this yaml file at build time and depend on the outputs?
For instance:
item1: "hello"
item2: "world"
nested:
nested1: "I'm nested"
nested2: "I'm also nested"
I then need to parse this yaml file so it can be used in many different languages (e.g., Rust, TypeScript, Python, etc.). For instance, here's the desired output for TypeScript:
export default {
item1: "hello",
item2: "world",
nested: {
nested1: "I'm nested",
nested2: "I'm also nested",
}
}
Notice, I don't want TypeScript code that reads the yaml file and converts it into an object. That conversion should be done in the build process.
For the actual conversion, I'm thinking of writing that in Python, but it doesn't need to be. This would then mean the python also needs to run at build time.
P.S. I care mostly about the functionality, so I'm flexible with the exactly how it's done. I'm even fine using another file format aside from yaml.
Thanks to help from #oakad, I was able to figure it out. Essentially, you can use genrule to create outputs.
So, assuming you have some target like python setup to generate the output named parse_config, you can just do this:
genrule(
name = "generated_output",
srcs = [],
outs = ["output.txt"],
cmd = "./$(execpath :parse_config) > $#" % name,
exec_tools = [":parse_config"],
visibility = ["//visibility:public"],
)
The generated file is output.txt. And you can access it via //lib/config:generated_output.
Note, essentially the cmd is piping the stdout into the file contents. In Python that means anything printed will appear in the generated file.

Beam/Dataflow read Parquet files and add file name/path to each record

I'm using Apache Beam Python SDK and I'm trying to read data from Parquet files using apache_beam.io.parquetio, but I also want to add the filename (or path) to the data since it contains data as well. I went over the suggested pattern here and read that Parquetio is similar to fileio but it doesn't seem like it implements the functionality that allows to go over files and add that to the party.
Anyone figured a good way to implement this?
Thanks!
If the number of files is not tremendous, you can get all the files before you read them through the IO.
import glob
filelist = glob.glob('/tmp/*.parquet')
p = beam.Pipeline()
class PairWithFile(beam.DoFn):
def __init__(self, filename):
self._filename = filename
def process(self, e):
yield (self._filename, e)
file_with_records = [
(p
| 'Read %s' % (file) >> beam.io.ReadFromParquet(file)
| 'Pair %s' % (file) >> beam.ParDo(PairWithFile(file)))
for file in filelist
] | beam.Flatten()
Then your PCollection looks like this:

How to write Bazel rules that work with external repositories?

The Bazel Starlark API does strange things with files in external repositories. I have the following Starlark snippet:
print(ctx.genfiles_dir)
print(ctx.genfiles_dir.path)
print(output_filename)
ret = ctx.new_file(ctx.genfiles_dir, output_filename)
print(ret.path)
It is creating the following output:
DEBUG: build_defs.bzl:292:5: <derived root>
DEBUG: build_defs.bzl:293:5: bazel-out/k8-fastbuild/genfiles
DEBUG: build_defs.bzl:294:5: google/protobuf/descriptor.upb.c
DEBUG: build_defs.bzl:296:5: bazel-out/k8-fastbuild/genfiles/external/com_google_protobuf/google/protobuf/descriptor.upb.c
That extra external/com_google_protobuf comes seemingly out of nowhere, and it makes my rule fail:
I tell protoc to generate into ctx.genfiles_dir.path (which is bazel-out/k8-fastbuild/genfiles).
So protoc generates bazel-out/k8-fastbuild/genfiles/google/protobuf/descriptor.upb.c
Bazel fails because I didn't generate bazel-out/k8-fastbuild/genfiles/external/com_google_protobuf/google/protobuf/descriptor.upb.c
Likewise, when I try to call file.short_path on a source file from an external repository, I get a result like ../com_google_protobuf/google/protobuf/descriptor.proto. This seems quite unhelpful, so I just wrote some manual code to strip off the leading ../com_google_protobuf/.
Am I missing something? How can I write this rule in a way that doesn't feel like I'm fighting Bazel the whole time?
Am I missing something?
The basic problem, as you already realized, is that you have two path "namespaces" the one that protoc sees (i.e. import paths) and the one that bazel sees (i.e. the path you pass to declare_file().
2 things to note:
1) All paths declared with declare_file() get the path <bin dir>/<package path incl. workspace>/<path you passed to declare_file()>
2) All actions are executed from <bin dir> (unless output_to_genfils=True in which case this switches to <gen dir> as in your example.
Trying to solve the exact same problem you encountered, I resorted to stripping the known path from the output_file's path to determine which directory to pass as p:
# This code is run from the context of the external protobuf dependency
proto_path = "google/a/b.proto"
output_file = ctx.actions.declare_file(proto_path)
# output_file.path would be `<gen_dir>/external/protobuf/google/a/b.proto`
# Strip the known proto_path from output_file.path
protoc_prefix = output_file.path[:-len(proto_path)]
print(protoc_prefix) # Prints: <gen_dir>/external/protobuf
command = "{protoc} {proto_paths} {cpp_out} {plugin} {plugin_options} {proto_file}".format(
...
cpp_out = "--cpp_out=" + protoc_prefix,
...
)
Alternatives
You may also be able to construct the same path with ctx.bin_dir, ctx.label.workspace_name, ctx.label.package, and ctx.label.name.
Misc.
proto_library recently gained an attribute strip_import_prefix. When used, the above is not correct, as all dependent files are symlinked into a new directory from which they have the relative paths declared with strip_import_prefix.
The path format is:
<bin dir>/<repo>/<package>/_virtual_base/<label name>/<path `import`ed in .proto files>
i.e.
<bin dir>/external/protobuf/_virtual_base/b_proto/google/a/b.proto
Assuming you are building an external repo called protobuf, which contains a BUILD file at its root with a target named b_proto, which in turn, relies on a proto_library wrapping google/a/b.proto AND uses the strip_import_prefix attribute.

Accessing information (Metadata) in the file name & type in a Beam pipeline

My filename contains information that I need in my pipeline, for example the identifier for my data points is part of the filename and not a field in the data. e.g Every wind turbine generates a file turbine-loc-001-007.csv. e.g And I need the loc data within the pipeline.
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))

Resources