I have setup a Dask cluster and i'm happily sending basic Prefect flows to it.
Now i want to do something more interesting and take a custom docker image with my python library on it and execute flows/tasks on the dask cluster.
My assumption was I could leave the dask cluster (scheduler and workers) as they are with their own python environment (after checking all the various message passing libraries have the matching versions everywhere). That is to say, i do not expect to need to add my library to those machines if the Flow is executed within my custom storage.
However either I have not set up storage correctly or it is not safe to assume the above. In other words, perhaps when pickling objects in my custom library, the Dask cluster does need to know about my python library. Suppose i have some generic python library called data...
import prefect
from prefect.engine.executors import DaskExecutor
#see https://docs.prefect.io/api/latest/environments/storage.html#docker
from prefect.environments.storage import Docker
#option 1
storage = Docker(registry_url="gcr.io/my-project/",
python_dependencies=["some-extra-public-package"],
dockerfile="/path/to/Dockerfile")
#this is the docker build and register workflow!
#storage.build()
#or option 2, specify image directly
storage = Docker(
registry_url="gcr.io/my-project/", image_name="my-image", image_tag="latest"
)
#storage.build()
def get_tasks():
return [
"gs://path/to/task.yaml"
]
#prefect.task
def run_task(uri):
#fails because this data needs to be pickled ??
from data.tasks import TaskBase
task = TaskBase.from_task_uri(uri)
#task.run()
return "done"
with prefect.Flow("dask-example",
storage = storage) as flow:
#chain stuff...
result = run_task.map(uri=get_tasks())
executor = DaskExecutor(address="tcp://127.0.01:8080")
flow.run(executor=executor)
Can anyone explain how/if this type of docker-based workflow should work?
Your dask workers will need access to the same python libraries that your tasks rely on to run. The simplest way to achieve this is to run your dask workers using the same image as your Flow. You could do this manually, or using something like the DaskCloudProviderEnvironment that will create short-lived Dask clusters per-flow run using the same image automatically.
Related
I have a service with in I have several modules and in the main file I am importing most of my modules like below.
from base_client import BaseClient
import request_dispatcher as rd
import utils as util
In one of the functions in main I am calling the dask client submit. When I try to get the result back from future object it give me modulenotfound error as below
****ModuleNotFoundError: No module named 'base_client'****
This is how I define my client and call the function
def mytask(url, dest):
.....
client = Client(<scheduler ip>)
f_obj = client.submit(mytask, data_url, destination)
How exactly can I make these modules available to scheduler and workers?
When you do submit, Dask wraps up your task and sends it to the worker(s). As part of this package, any variables that are required by the task are serialised and sent too. In the case of functions defined inline, this includes the whole function, but in the case of functions in a module, it is only the module and function names. This is done to same CPU and bandwidth (imagine trying to send all the source of all of the modules you happen to have imported).
On the worker side, the task is unwrappd, and for the function, this means importing the module, import base_client. This follows the normal python logic of looking in the locations defined by sys.path. If the file defining the module isn't there, you get the error above.
To solve, copy the file to a place that the worker can see it. You can do this with upload_file on a temporary basis (which uses a temporary directory), but you would be better installing the module using usual pip or conda methods. Importing from the "current directory" is likely to fail even with a local cluster.
To get more information, you would need to post a complete example, which shows the problem. In practice, functions using imports from modules are used all the time with submit without problems.
How do you automatically parse ansible warnings and errors in your jenkins pipeline jobs?
I greatly enjoy the power of leveraging in ansible in jenkins when it works. Upon a failure, the hunt to locate the actual error can be challenging.
I use WarningsNG which supports custom parsers (and allows their programmatic generation)
Do you know of any plugins or addons that already transform these logs into the kind charts similar to WarningsNG?
I figured I'd ask as I go off into deep regex land and make my own.
One good way to achieve this seems to be the following:
select an existing structured output ansible callback plugin (json, junit and yaml are all viable) . I selected junit as I can play with the format to get a really nice view into the playbook with errors reported in a very obvious way.
fork that GPL file (yes, so be careful with that license) to augment with the following:
store output as file
implement the missing callback methods (the three mentioned above do not implement the v2...item callbacks.
forward events to the default or debug callback to ensure operators see something when they execute the plan
add a secrets cleaner - if you use jenkins credentials-binding-plugin it will hide secrets from the console, it will not not hide secrets within stored files. You'll need to handle that in your playbook or via some groovy code (if groovy, try{...} finally { clean } seems a good pattern)
Snippet - forewarding to default callback
from ansible.plugins.callback.default import CallbackModule as CallbackModule_default
...
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'stdout'
CALLBACK_NAME = 'json'
def __init__(self, display=None):
super(CallbackModule, self).__init__(display)
self.default_callback = CallbackModule_default()
...
def v2_on_file_diff(self, result):
self.default_callback.v2_on_file_diff(result)
... do whatever you'd want to ensure the content appears in the json file
I'm currently working with Julia (1.0) to run some parallel code on clusters of an HPC. The HPC is managed with PBS. I'm trying to find a way for broadcasting environment variables over all processes, i.e. a way to broadcast a specific list of environment variables automatically in order to have access to them in every Julia worker.
#!/bin/bash
#PBS ...
export TOTO=toto
julia --machine-file=$PBS_NODEFILE my_script.jl
In this example, I will not be able to access to the variable TOTO in each julia worker (via ENV["TOTO"]).
The only way I found to do what I want is to set the variables in my .bashrc but I want this to be script-specific. Another way is to put in my startup.jl file :
#everywhere ENV["TOTO"] = $(ENV["TOTO"])
But it is not script-specific because I have to know in advance which variables I want to send. If I do a loop over ENV keys then I'll broadcast all the variables and then override variables I don't want to.
I tried to use DotEnv.jl but it doesn't work.
Thanks for your time.
The obvious way is to set the variables first thing in script.jl. You can also put the initialization in a separate file, e.g. environment.jl, and load that on all processes with the -L flag:
julia --machine-file=$PBS_NODEFILE -L environment.jl my_script.jl
where environment.jl would, in this case, contain
ENV["TOTO"] = "toto"
etc.
There is a way to get the google account name running the DAGs from the DAG definition?
This will be very helpful to track which users was running the DAGs.
I'm only see :
unixname --> always airflow
owner --> fixed in the dag definition
Regards
Eduardo
Possible as DAGs in Composer are essentially GCS objects. The GCS object GET API does tell you who uploads that object. So here's one possible way of getting owner info:
Define a function user_lookup() in your DAG definition.
The implementation of user_look() consists of the following steps: a) gets the current file path (e.g., os.path.basename(__file__)); b) based on how Composer GCS objects are mapped locally, determines the corresponding GCS object path (e.g., gs://{your-bucket}/object); c) reads GCS object details and returns object.owner.
In your DAG definition, set owner=user_lookup().
Let us know whether the above works for you.
I want to have a single script, that either collects tensorboard data or not, depending on how I run it. I am aware that I can pass flags to tell my script how I want it to be run. I could even hard code it in the script and just manually change the script.
Either solution has a bigger problem. I find myself having to write an if statement everywhere on my script when I want the summary writer operations to be ran or not. For example I find that I would have to do something like:
if tb_sys_arg = 'tensorboard':
merged = tf.merge_all_summaries()
and then depending on the value of tb_sys_arg run the summaries or not, as in:
if tb_sys_arg = 'tensorboard':
merged = tf.merge_all_summaries()
else:
train_writer = tf.train.SummaryWriter(tensorboard_data_dump_train, sess.graph)
this seems really silly to me. I'd rather not have to do that. Is this the right way to do this? I just don't want to collect statistics each time I run my main script but I also don't want to have two separate scripts either.
As an anecdotical story, few months ago I started using TensorBoard and it seems I have been running my main file as follow:
python main.py —logdir=/tmp/mdl_logs
so that it collects tensorboard data. But realized that I don't think I need that last flag to collect tensorboard data. Its been so long that now I forget if I actually need that. I've been reading the documentation and tutorials but it seems I don't need that last flag (its only needed to run the web app as in tensorboard --logdir=path/to/log-directory, right?) Have I been doing this wrong all this time?
You can launch Supervisor without "summary" service, so it won't run the summary nodes, see "Launching fewer services" section of the Supervisor docs -- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/functions_and_classes/shard6/tf.train.Supervisor.md#launching-fewer-services