how to save logs from c++ binary in beam python? - google-cloud-dataflow

I have a c++ binary that uses glog. I run that binary within beam python on cloud dataflow. I want to save c++ binary's stdout, stderr and any log file for later inspection. What's the best way to do that?
This guide gives an example for beam java. I tried to do something like that.
def sample(target, output_dir):
import os
import subprocess
import tensorflow as tf
log_path = target + ".log"
with tf.io.gfile.GFile(log_path, mode="w") as log_file:
subprocess.run(["/app/.../sample.runfiles/.../sample",
"--target", target,
"--logtostderr"],
stdout=log_file,
stderr=subprocess.STDOUT)
I got the following error.
...
File "apache_beam/runners/common.py", line 624, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "/home/swang/.cache/bazel/_bazel_swang/09eb83215bfa3a8425e4385b45dbf00d/execroot/__main__/bazel-out/k8-opt/bin/garage/sample_launch.runfiles/pip_parsed_deps_apache_beam/site-packages/apache_beam/transforms/core.py", line 1877, in <lambda>
wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
File "/home/swang/.cache/bazel/_bazel_swang/09eb83215bfa3a8425e4385b45dbf00d/execroot/__main__/bazel-out/k8-opt/bin/garage/sample_launch.runfiles/__main__/garage/sample_launch.py", line 17, in sample
File "/usr/local/lib/python3.8/subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/local/lib/python3.8/subprocess.py", line 808, in __init__
errread, errwrite) = self._get_handles(stdin, stdout, stderr)
File "/usr/local/lib/python3.8/subprocess.py", line 1489, in _get_handles
c2pwrite = stdout.fileno()
AttributeError: 'GFile' object has no attribute 'fileno' [while running 'Map(functools.partial(<function sample at 0x7f45e8aa5a60>, output_dir='gs://swang/sample/20220815_test'))-ptransform-28']
google.cloud.storage API also does not seem to expose fileno().
import google.cloud.storage
google.cloud.storage.blob.Blob("test", google.cloud.storage.bucket.Bucket(google.cloud.storage.client.Client(), "swang"))
<Blob: swang, test, None>
blob = google.cloud.storage.blob.Blob("test", google.cloud.storage.bucket.Bucket(google.cloud.storage.client.Client(), "swang"))
reader = google.cloud.storage.fileio.BlobReader(blob)
reader.fileno()
Traceback (most recent call last):
File "/usr/lib/python3.8/code.py", line 90, in runcode
exec(code, self.locals)
I also considered writing the logs in c++ binary rather than passing them to python. As glog is implemented on top of c++ FILE rather than iostream, I have to reset stdout etc to gcs at FILE level like this rather than reset cout to gcs in iostream level like this. But gcs c++ API is only implemented on top of iostream, so this approach does not work. Using dup2 like this is another approach but seem too complicated and expensive.

You can use the Filesystems module of Beam to open a writable channel (file handle where you have write permissions) in any of the filesystems supported by Beam. If you are running in Dataflow, this will automatically use the credentials of the Dataflow job to access Google Cloud Storage: https://beam.apache.org/releases/pydoc/current/apache_beam.io.filesystems.html?apache_beam.io.filesystems.FileSystems.create
If you are writing to GCS, you need to make sure that you don't overwrite an object, that would produce an error.

Related

Gensim: error loading pretrained vectors No such file or directory: 'word2vec.kv.vectors.npy'

I am trying to load a Pretrained word2vec embeddings that is in gensim keyedvector 'word2vec.kv'
pretrained = KeyedVectors.load(args.pretrained mmap = 'r')
where arg.pretrained is "/ptembs/word2vec.kv"
and iam getting this error
File "main.py", line 60, in main
pretrained = KeyedVectors.load(args.pretrained, mmap = 'r')
File "C:\Users\ASUS\anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 1553, in load
model = super(WordEmbeddingsKeyedVectors, cls).load(fname_or_handle, **kwargs)
File "C:\Users\ASUS\anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 228, in load
return super(BaseKeyedVectors, cls).load(fname_or_handle, **kwargs)
File "C:\Users\ASUS\anaconda3\lib\site-packages\gensim\utils.py", line 436, in load obj._load_specials(fname, mmap, compress, subname)
File "C:\Users\ASUS\anaconda3\lib\site-packages\gensim\utils.py", line 478, in _load_specials
val = np.load(subname(fname, attrib), mmap_mode=mmap)
File "C:\Users\ASUS\anaconda3\lib\site-packages\numpy\lib\npyio.py", line 417, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'ptembs/word2vec.kv.vectors.npy'
i dont understand why it need word2vec.kv.vectors.npy file ? and i dont have it.
Any idea how to solve this problem?
gensim version 3.8.3
tried it on 4.1.2 also same error.
Where did you get the file 'word2vec.kv'?
If loading that file triggers an error mentioning a 2nd file by name, then that 2nd file should've been created alongside 'word2vec.kv' when it was 1st saved using a .save() operation.
That other file needs to be kept alongside 'word2vec.kv' in order for 'word2vec.kv' to be .load()ed again in the future.

client.upload_file() for nested modules

I have a project structured as follows;
- topmodule/
- childmodule1/
- my_func1.py
- childmodule2/
- my_func2.py
- common.py
- __init__.py
From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the following
from topmodule.childmodule1.my_func1 import MyFuncClass1
from topmodule.childmodule2.my_func2 import MyFuncClass2
Then I am creating a distributed client & sending work as follows;
client = Client(YarnCluster())
client.submit(MyFuncClass1.execute)
This errors out, because the workers do not have the files of topmodule.
"/mnt1/yarn/usercache/hadoop/appcache/application_1572459480364_0007/container_1572459480364_0007_01_000003/environment/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) ModuleNotFoundError: No module named 'topmodule'
So what I tried to do is - I tried uploading every single file under "topmodule". The files directly under the "topmodule" seems to get uploaded, but the nested ones do not. Below is what I am talking about;
Code:
from pathlib import Path
for filename in Path('topmodule').rglob('*.py'):
print(filename)
client.upload_file(filename)
Console output:
topmodule/common.py # processes fine
topmodule/__init__.py # processes fine
topmodule/childmodule1/my_func1.py # throws error
Traceback:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-13-dbf487d43120> in <module>
3 for filename in Path('nodes').rglob('*.py'):
4 print(filename)
----> 5 client.upload_file(filename)
~/miniconda/lib/python3.7/site-packages/distributed/client.py in upload_file(self, filename, **kwargs)
2929 )
2930 if isinstance(result, Exception):
-> 2931 raise result
2932 else:
2933 return result
ModuleNotFoundError: No module named 'topmodule'
My question is - how can I upload an entire module and its files to workers? Our module is big so I want to avoid restructuring it just for this issue, unless the way we're structuring the module is fundamentally flawed.
Or - is there a better way to have all dask workers understand the modules perhaps from a git repository?
When you call upload_file on every file individually you lose the directory structure of your module.
If you want to upload a more comprehensive module you can package up your module into a zip or egg file and upload that.
https://docs.dask.org/en/latest/futures.html#distributed.Client.upload_file

Drake Visualizer : Unknown file extension in readPolyData when using .dae file

I am trying to add a custom mesh (a torus) .dae file for collision and visual to my .sdf model.
When I run my program, drake visualizer gives the following error
File "/opt/drake/lib/python2.7/site-packages/director/lcmUtils.py", line 119, in handleMessage
callback(msg)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 352, in onViewerLoadRobot
self.addLinksFromLCM(msg)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 376, in addLinksFromLCM
self.addLink(Link(link), link.robot_num, link.name)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 299, in __init__
self.geometry.extend(Geometry.createGeometry(link.name + ' geometry data', g))
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 272, in createGeometry
polyDataList, visInfo = Geometry.createPolyDataFromFiles(geom)
File "/opt/drake/lib/python2.7/site-packages/director/drakevisualizer.py", line 231, in createPolyDataFromFiles
polyDataList = [ioUtils.readPolyData(filename)]
File "/opt/drake/lib/python2.7/site-packages/director/ioUtils.py", line 25, in readPolyData
raise Exception('Unknown file extension in readPolyData: %s' % filename)
Exception: Unknown file extension in readPolyData: /my_path/model.dae
Since prius.sdf also uses prius.dae, I assume this is possible. What am I doing wrong?
tl;dr drake_visualizer doesn't load dae files. If you put a similarly named .obj file in the same folder it will load that (and you can leave your sdf file still referencing the dae file).
Long answer:
drake_visualizer has a very specific, arbitrary protocol for loading files. Given an arbitrary file name (e.g., my_geometry.dae) it will
Strip off the extension.
Try the following files (in order), loading the first one it finds:
my_geometry.vtm
my_geometry.vtp
my_geometry.obj
original extension.
It can load: vtm, vtp, ply, obj, and stl files.
The worst thing is if you have both a vtp and an obj file in the same folder with the same name and you specify the obj, it'll still favor the vtp file.

\[Errno -101\] NetCDF: HDF error when opening netcdf file

I have this error when opening my netcdf file.
The code was working before.
How do I fix this ?
Traceback (most recent call last):
File "", line 1, in
...
File "file.py", line 71, in gather_vgt
return xr.open_dataset(filename)
File "/.../lib/python3.6/site-packages/xarray/backends/api.py", line
286, in open_dataset
autoclose=autoclose)
File "/.../lib/python3.6/site-packages/xarray/backends/netCDF4_.py",
line 275, in open
ds = opener()
File "/.../lib/python3.6/site-packages/xarray/backends/netCDF4_.py",
line 199, in _open_netcdf4_group
ds = nc4.Dataset(filename, mode=mode, **kwargs)
File "netCDF4/_netCDF4.pyx", line 2015, in
netCDF4._netCDF4.Dataset.init
File "netCDF4/_netCDF4.pyx", line 1636, in
netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'file.nc'
When I try to open the same netcdf file with h5py I get this error :
OSError: Unable to open file (file locking disabled on this file
system (use HDF5_USE_FILE_LOCKING environment variable to override),
errno = 38, error message = '...')
You must be in this situation :
your HDF5 library has been updated (1.10.1) (netcdf uses HDF5 under the hood)
your file system does not support the file locking that the HDF5 library uses.
In order to read your hdf5 or netcdf files, you need set this environment variable :
HDF5_USE_FILE_LOCKING=FALSE
For references, this was introduced in HDF5 version 1.10.1,
Added a mechanism for disabling the SWMR file locking scheme.
The file locking calls used in HDF5 1.10.0 (including patch1)
will fail when the underlying file system does not support file
locking or where locks have been disabled. To disable all file
locking operations, an environment variable named
HDF5_USE_FILE_LOCKING can be set to the five-character string
'FALSE'. This does not fundamentally change HDF5 library
operation (aside from initial file open/create, SWMR is lock-free),
but users will have to be more careful about opening files
to avoid problematic access patterns (i.e.: multiple writers) >that the file locking was designed to prevent.
Additionally, the error message that is emitted when file lock
operations set errno to ENOSYS (typical when file locking has been
disabled) has been updated to describe the problem and potential
resolution better.
(DER, 2016/10/26, HDFFV-9918)
In my case, the solution as suggested by #Florian did not work. I found another solution, which suggested that the order in which h5py and netCDF4 are imported matters (see here).
And, indeed, the following works for me:
from netCDF4 import Dataset
import h5py
Switching the order results in the error as described by OP.

Which compression types support chunking in dask?

When processing a large single file, it can be broken up as so:
import dask.bag as db
my_file = db.read_text('filename', blocksize=int(1e7))
This works great, but the files I'm working with have a high level of redundancy and so we keep them compressed. Passing in compressed gzip files gives an error that seeking in gzip isn't supported and so it can't be read in blocks.
The documentation here http://dask.pydata.org/en/latest/bytes.html#compression suggests that some formats support random access.
The relevant internal code I think is here:
https://github.com/dask/dask/blob/master/dask/bytes/compression.py#L47
It looks like lzma might support it, but it's been commented out.
Adding lzma into the seekable_files dict like in the commented out code:
from dask.bytes.compression import seekable_files
import lzmaffi
seekable_files['xz'] = lzmaffi.LZMAFile
data = db.read_text('myfile.jsonl.lzma', blocksize=int(1e7), compression='xz')
Throws the following error:
Traceback (most recent call last):
File "example.py", line 8, in <module>
data = bag.read_text('myfile.jsonl.lzma', blocksize=int(1e7), compression='xz')
File "condadir/lib/python3.5/site-packages/dask/bag/text.py", line 80, in read_text
**(storage_options or {}))
File "condadir/lib/python3.5/site-packages/dask/bytes/core.py", line 162, in read_bytes
size = fs.logical_size(path, compression)
File "condadir/lib/python3.5/site-packages/dask/bytes/core.py", line 500, in logical_size
g.seek(0, 2)
io.UnsupportedOperation: seek
I assume that the functions at the bottom of the file (get_xz_blocks) for example can be used for this, but don't seem to be in use anywhere in the dask project.
Are there compression libraries that do support this seeking and chunking? If so, how can they be added?
Yes, you are right that the xz format can be useful to you. The confusion is, that the file may be block-formatted, but the standard implementation lzmaffi.LZMAFile (or lzma) does not make use of this blocking. Note that block-formatting is only optional for zx files, e.g., by using --block-size=size with xz-utils.
The function compression.get_xz_blocks will give you the set of blocks in a file by reading the header only, rather than the whole file, and you could use this in combination with delayed, essentially repeating some of the logic in read_text. We have not put in the time to make this seamless; the same pattern could be used to write blocked xz files too.

Resources