I'm trying to setup a simple KubeFlow pipeline, and I'm having trouble packaging up dependencies in a way that works for KubeFlow.
The code simply downloads a config file and parses it, then passes back the parsed configuration.
However, in order to parse the config file, it needs to have access to another internal python package.
I have a .tar.gz archive of the package hosted on a bucket in the same project, and added the URL of the package as a dependency, but I get an error message saying tarfile.ReadError: not a gzip file.
I know the file is good, so it's some intermediate issue with hosting on a bucket or the way kubeflow installs dependencies.
Here is a minimal example:
from kfp import compiler
from kfp import dsl
from kfp.components import func_to_container_op
from google.protobuf import text_format
from google.cloud import storage
import training_reader
def get_training_config(working_bucket: str,
working_directoy: str,
config_file: str) -> training_reader.TrainEvalPipelineConfig:
download_file(working_bucket, os.path.join(working_directoy, config_file), "ssd.config")
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
config_op_packages = ["https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz",
"google-cloud-storage",
"protobuf"
]
training_config_op = func_to_container_op(get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages)
def output_config(config: training_reader.TrainEvalPipelineConfig) -> None:
print(config)
output_config_op = func_to_container_op(output_config)
#dsl.pipeline(
name='Post Training Processing',
description='Building the post-processing pipeline'
)
def ssd_postprocessing_pipeline(
working_bucket: str,
working_directory: str,
config_file:str):
config = training_config_op(working_bucket, working_directory, config_file)
output_config_op(config.output)
pipeline_name = ssd_postprocessing_pipeline.__name__ + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
The https://storage.cloud.google.com/my_bucket/packages/training-reader-0.1.tar.gz IRL requires authentication. Try to download it in Incognito mode and you'll see the login page instead of file.
Changing the URL to https://storage.googleapis.com/my_bucket/packages/training-reader-0.1.tar.gz works for public objects, but your object is not public.
The only thing you can do (if you cannot make the package public) is to use google.cloud.storage library or gsutil program to download the file from the bucket and then manually install it suing subprocess.run([sys.executable, '-m', 'pip', 'install', ...])
Where are you downloading the data from?
What's the purpose of
pipeline_config = training_reader.TrainEvalPipelineConfig()
with open("ssd.config", 'r') as f:
text_format.Merge(f.read(), pipeline_config)
return pipeline_config
Why not just do the following:
def get_training_config(
working_bucket: str,
working_directory: str,
config_file: str,
output_config_path: OutputFile('TrainEvalPipelineConfig'),
):
download_file(working_bucket, os.path.join(working_directoy, config_file), output_config_path)
the way kubeflow installs dependencies.
Export your component to loadable component.yaml and you'll see how KFP Lighweight components install dependencies:
training_config_op = func_to_container_op(
get_training_config,
base_image="tensorflow/tensorflow:1.15.2-py3",
packages_to_install=config_op_packages,
output_component_file='component.yaml',
)
P.S. Some small pieces of info:
#dsl.pipeline(
Not required unless you want to use the dsl-compile command-line program
pipeline_name = ssd_postprocessing_pipeline.name + '.zip'
compiler.Compiler().compile(ssd_postprocessing_pipeline, pipeline_name)
Did you know that you can just kfp.Client(host=...).create_run_from_pipeline_func(ssd_postprocessing_pipeline, arguments={}) to run the pipeline right away?
Related
I have created a Python based pipeline that contains a ParDo that leverages the Python base64 package. When I run the pipeline locally with DirectRunner, all is well. When I run the same pipeline with Dataflow on Google Cloud, it fails weith an error of:
NameError: name 'base64' is not defined [while running 'ParDo(WriteToSeparateFiles)-ptransform-47']
It seems to be missing the base64 package but I believe that to be part of standard python and always present.
Here is my complete pipeline code:
import base64 as base64
import argparse
import apache_beam as beam
import apache_beam.io.fileio as fileio
import apache_beam.io.filesystems as filesystems
from apache_beam.options.pipeline_options import PipelineOptions
class WriteToSeparateFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
writer = filesystems.FileSystems.create(self.outdir + str(element) + '.txt')
message = "This is the content of my file"
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes) ### Error here
writer.write(base64_bytes)
writer.close()
argv=None
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as pipeline:
outputs = (
pipeline
| beam.Create(range(10)) # Change range here to be millions if needed
| beam.ParDo(WriteToSeparateFiles('gs://kolban-edi/'))
)
outputs | beam.Map(print)
#print(outputs)
Solution
The solution has been found and is documented here
https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
After much study, a section in the Google Dataflow Frequently Asked Questions was found titled: "How do I handle NameErrors?"
Reading that documentation, the suggestion was to add an extra runner parameter called --save_main_session. As soon as I added that to the execution, the problem was resolved.
When I tried to run a Batch transform job in AWS SageMaker, I met below error:
ImportError: No module named cv2
Please note that, I am able to "import CV2" in the notebook instance. The jupter can run "import CV2" in notebook instance. But failed to run it in endpoints during inference time. I have tried below method using "env" as the link AWS Sagemaker - Install External Library and Make it Persist
but it still not work.
anyone have good way to solve it? Thanks!
my codes are:
env = {
'SAGEMAKER_REQUIREMENTS': 'requirements.txt', # path relative to `source_dir` below.
}
image_embed_model = MXNetModel(model_data=model_data,
entry_point='sagemaker_entrypoint.py',
role=role,
source_dir = 'src',
env = env,
py_version='py3',
framework_version='1.6.0')
transformer = image_embed_model.transformer(instance_count=1, # Please pay attention here!!!
instance_type='ml.m4.xlarge',
output_path=output_path,
assemble_with = 'Line',
accept = 'text/csv'
)
transformer.transform(batch_input,
content_type='text/csv',
split_type='Line',
input_filter='$[0:]',
join_source='Input',
wait=False)
You can follow https://github.com/aws/sagemaker-python-sdk/blob/master/doc/using_mxnet.rst#use-third-party-libraries to import third party libraries to your batch transform instances. Make sure the requirement.txt file is saved under the right directory before packaging the model data.
I am trying to get my Faster R-CNN model into an Container Instance on ACI. For that I need my docker image to posses python version 3.5.*. I specify that in my conda yaml file, but every time I spin an instance up and docker run -it *** /bin/bash into it I see that it only has Python 3.6.7.
https://user-images.githubusercontent.com/21140767/50680590-82b20b80-1008-11e9-9bfe-4a0e71084ce0.png
How can I get my Docker image to have Python version 3.5.*? I already tried conda installing Python version 3.5.2, but that didn't work as eventually it didn't posses 3.5.2, but only 3.6.7. (dfimage lets you see the dockerfile from which the image was created, https://hub.docker.com/r/chenzj/dfimage/).
https://user-images.githubusercontent.com/21140767/50680673-d6245980-1008-11e9-9d48-71a7c150d925.png
My yaml:
name: project_environment
dependencies:
- python=3.5.2
- pip:
- matplotlib
- opencv-python==3.4.3.18
- azureml-core==1.0.6
- numpy
- cntk
- cython
channels:
- anaconda
Notebook cell:
from azureml.core.conda_dependencies import CondaDependencies
svmandss = CondaDependencies.create(python_version="3.5.2", pip_packages=[
"matplotlib",
"opencv-python==3.4.3.18",
"azureml-core",
"numpy",
"cntk",
"cython"], )
svmandss.add_channel('anaconda')
with open("fasterrcnn.yml","w") as f:
f.write(svmandss.serialize_to_string())
Another notebook cell with ContainerImage specifications.
image_config = ContainerImage.image_configuration(execution_script="score_fasterrcnn.py",runtime="python",conda_file="./fasterrcnn.yml",dependencies=listdir("utils"),docker_file="./Dockerfile")
service = Webservice.deploy_from_model(workspace=ws,
name='faster-rcnn',
deployment_config=aciconfig,
models=[Model(workspace=ws, name='Faster-RCNN')],
image_config=image_config)
service.wait_for_deployment(show_output=True)
Note
For better readability see my GitHub issue: (https://github.com/Azure/MachineLearningNotebooks/issues/163).
Currently, the version of Python is fixed to what's in Azure ML's base image, when deploying the web service. We're investigating removing this limitation in future.
Since this is one of the top Google answers when searching for "azureml python version" I'm posting the answer here. The documentation is not very clear when it comes to this, but the following will work:
from azureml.core import Workspace
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
ws = Workspace.from_config()
# This is the important part
conda_dep = CondaDependencies(conda_dependencies_file_path="pipeline/environment.yml")
aml_run_config = RunConfiguration(conda_dependencies=conda_dep)
# Define compute target - must be preconfigured in th workspace
compute_target = ws.compute_targets['my-azureml-target']
aml_run_config.target = compute_target
from azureml.pipeline.steps import PythonScriptStep
script_source_dir = "./pipeline"
step_1_script = "test.py"
step_1 = PythonScriptStep(
script_name=step_1_script,
source_directory=script_source_dir,
compute_target=compute_target,
runconfig=aml_run_config,
allow_reuse=True
)
from azureml.pipeline.core import Pipeline
# Build the pipeline
pipeline1 = Pipeline(workspace=ws, steps=[step_1])
from azureml.core import Experiment
# Submit the pipeline to be run
pipeline_run1 = Experiment(ws, 'Test-pipeline').submit(pipeline1)
pipeline_run1.wait_for_completion(show_output=True)
This assumes the following directory structure:
root/
create_pipeline.py
pipeline/
test.py
environment.yml
where create_pipeline.py is the file above, test.py is the script you would like to run and environment.yml is the conda environment file - including the python version.
I was able to change the Python version by registering the environment in Azure ML Workspace:
from azureml.core.environment import Environment, Workspace
environment = Environment.from_conda_specification(name='myenv', file_path='environment.yml')
environment.python.user_managed_dependencies = False
workspace = Workspace.from_config()
environment = environment.register(workspace=workspace)
env_build = environment.build(workspace=workspace)
Then, configure the endpoint for publishing as follows:
from azureml.core.model import InferenceConfig
environment = Environment.get(workspace=workspace, name='myenv')
inference_config = InferenceConfig(
entry_script='inference.py',
source_directory='.',
environment=environment
)
This is using Azure ML SDK 1.29.0. Perhaps this has already been fixed and the original method works as well, but I didn't test that.
EDIT:
This is no longer an issue for me. I found another way to get my code to work with python version 3.6.7.
This is however still an issue if you ask me. If in the future I do need python version 3.5 then there will not be a solution as of now.
You can still post an answer if you would like.
I am writing a dataflow pipeline that processes videos from a google cloud bucket. My pipeline downloads each work item to the local system and then reuploads results back to GCP bucket. Following previous question.
The pipeline works on local DirectRunner, i'm having trouble debugging on DataFlowRunnner.
The error reads
File "run_clouddataflow.py", line 41, in process
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 464, in download_to_file self._do_download(transport, file_obj, download_url, headers)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 418, in _do_download download.consume(transport) File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 101, in consume self._write_to_stream(result)
File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/download.py", line 62, in _write_to_stream with response: AttributeError: __exit__ [while running 'Run DeepMeerkat']
When trying to execute blob.download_to_file(file_obj) within:
storage_client=storage.Client()
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path="/tmp/" + parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
I'm guessing that the workers are not in permission to write locally? Or perhaps there is not a /tmp folder in the dataflow container. Where should I write objects? Its hard to debug without access to the environment. Is it possible to access stdout from workers for debugging purposes (serial console?)
EDIT #1
I've tried explicitly passing credentials:
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
as well as writing to cwd(), instead of /tmp/
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
Still getting the cryptic error on blob downloads from gcp.
Full Pipeline script is below, setup.py is here.
import logging
import argparse
import json
import logging
import os
import csv
import apache_beam as beam
from urlparse import urlparse
from google.cloud import storage
##The namespaces inside of clouddataflow workers is not inherited ,
##Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors, better to write ugly import statements then to miss a namespace
class PredictDoFn(beam.DoFn):
def process(self,element):
import csv
from google.cloud import storage
from DeepMeerkat import DeepMeerkat
from urlparse import urlparse
import os
import google.auth
DM=DeepMeerkat.DeepMeerkat()
print(os.getcwd())
print(element)
#try adding credentials?
#set credentials, inherent from worker
credentials, project = google.auth.default()
#download element locally
parsed = urlparse(element[0])
#parse gcp path
storage_client=storage.Client(credentials=credentials)
bucket = storage_client.get_bucket(parsed.hostname)
blob=storage.Blob(parsed.path[1:],bucket)
#store local path
local_path=parsed.path.split("/")[-1]
print('local path: ' + local_path)
with open(local_path, 'wb') as file_obj:
blob.download_to_file(file_obj)
print("Downloaded" + local_path)
#Assign input from DataFlow/manifest
DM.process_args(video=local_path)
DM.args.output="Frames"
#Run DeepMeerkat
DM.run()
#upload back to GCS
found_frames=[]
for (root, dirs, files) in os.walk("Frames/"):
for files in files:
fileupper=files.upper()
if fileupper.endswith((".JPG")):
found_frames.append(os.path.join(root, files))
for frame in found_frames:
#create GCS path
path="DeepMeerkat/" + parsed.path.split("/")[-1] + "/" + frame.split("/")[-1]
blob=storage.Blob(path,bucket)
blob.upload_from_filename(frame)
def run():
import argparse
import os
import apache_beam as beam
import csv
import logging
import google.auth
parser = argparse.ArgumentParser()
parser.add_argument('--input', dest='input', default="gs://api-project-773889352370-testing/DataFlow/manifest.csv",
help='Input file to process.')
parser.add_argument('--authtoken', default="/Users/Ben/Dropbox/Google/MeerkatReader-9fbf10d1e30c.json",
help='Input file to process.')
known_args, pipeline_args = parser.parse_known_args()
#set credentials, inherent from worker
try:
credentials, project = google.auth.default()
except:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = known_args.authtoken
credentials, project = google.auth.default()
p = beam.Pipeline(argv=pipeline_args)
vids = (p|'Read input' >> beam.io.ReadFromText(known_args.input)
| 'Parse input' >> beam.Map(lambda line: csv.reader([line]).next())
| 'Run DeepMeerkat' >> beam.ParDo(PredictDoFn()))
logging.getLogger().setLevel(logging.INFO)
p.run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
I spoke to the google-cloud-storage package mantainer, this was a known issue. Updating specific versiosn in my setup.py to
REQUIRED_PACKAGES = ["google-cloud-storage==1.3.2","google-auth","requests>=2.18.0"]
fixed the issue.
https://github.com/GoogleCloudPlatform/google-cloud-python/issues/3836
I am writing a package that loads additional data from the lib directory and would like to provide an easy way to load this data with something like this:
const dataPath = 'mypackage/data/data.json';
initializeMyLibrary(dataPath).then((_) {
// library is ready
});
I've made two separate libraries browser.dart and standalone.dart, similar to how it is done in the Intl package.
It is quite easy to load this data from the "browser" environment, but when it comes to the "standalone" environment, it is not so easy, because of the pub run command.
When the script is running with simple $ dart myscript.dart, I can find a package path using dart:io.Platform Platform.script and Platform.packageRoot properties.
But when the script is running with $ pub run tool/mytool, the correct way to load data should be:
detect that the script is running from the pub run command
find the pub server host
load data from this server, because there could be pub transformers and we can't load data directly from the file system.
And even if I want to load data directly from the file system, when the script is running with pub run, Platform.script returns /mytool path.
So, the question is there any way to find that the script is running from pub run and how to find server host for the pub server?
I am not sure that this is the right way, but when I am running script with pub run, Package.script actually returns http://localhost:<port>/myscript.dart. So, when the scheme is http, I can download using http client, and when it is a file, load from the file system.
Something like this:
import 'dart:async';
import 'dart:io';
import 'package:path/path.dart' as ospath;
Future<List<int>> loadAsBytes(String path) {
final script = Platform.script;
final scheme = Platform.script.scheme;
if (scheme.startsWith('http')) {
return new HttpClient().getUrl(
new Uri(
scheme: script.scheme,
host: script.host,
port: script.port,
path: 'packages/' + path)).then((req) {
return req.close();
}).then((response) {
return response.fold(
new BytesBuilder(),
(b, d) => b..add(d)).then((builder) {
return builder.takeBytes();
});
});
} else if (scheme == 'file') {
return new File(
ospath.join(ospath.dirname(script.path), 'packages', path)).readAsBytes();
}
throw new Exception('...');
}