Kubeflow parallel for loop is ignoring `parallelism` parameter - kubeflow

There is not good documentation for dsl.ParallelFor but I assume the parallelism param means - how many pods can be opened in parallel.
However, when I look at the DAG visualization, it seems like it opens all the for loop list of tasks, resulting in out of quota resources.
This step is in Pending state with this message: pods "pipeline-pdtbc-2302481418" is forbidden: exceeded quota: kf-resource-quota, requested: cpu=1500m, used: cpu=35100m, limited: cpu=36
Since my parallelism is set to 1, it should not have asked so much CPU, rather than running one by one.

Apparently, it is a bug in Kubeflow: see https://github.com/kubeflow/pipelines/issues/6588.
Here is a hack fix:
def fix_parallelism(source_pipeline_path, parallelism = 10, target_pipeline_path = None):
"""
limits the number of parallel tasks
Args:
source_pipeline_file(str) - path to the source pipeline yaml to be edited.
parallelism(int) - parallelisim to use
target_pipeline_path (str) - target path, default same as source.
Returns:
None - edits the source file.
"""
# see https://github.com/kubeflow/pipelines/issues/6588
with open(source_pipeline_path,'rt') as f:
data = yaml.load(f, Loader=SafeLoader)
pipeline_name = json.loads(data['metadata']['annotations']['pipelines.kubeflow.org/pipeline_spec'])['name']
pipeline_index = [i for i, t in enumerate(data['spec']['templates']) if t['name']==pipeline_name][0]
data['spec']['templates'][pipeline_index]['parallelism']=parallelism
target_pipeline_path = target_pipeline_path or source_pipeline_path
with open(target_pipeline_path,'wt') as f:
yaml.dump(data, f)

Related

How to not rebuild artifacts on every invocation

I want to download and build ruby within a workspace. I've been trying to implement this by mimicking rules_go. I have that part working. The issue I'm having is it rebuilds the openssl and ruby artifacts each time ruby_download_sdk is invoked. In the code below the download artifacts are cached but the builds of openssl and ruby are always executed.
def ruby_download_sdk(name, version = None):
# TODO detect os and arch
os, arch = "osx", "x86_64"
_ruby_download_sdk(
name = name,
version = version,
)
_register_toolchains(name, os, arch)
def _ruby_download_sdk_impl(repository_ctx):
# TODO detect platform
platform = ("osx", "x86_64")
_sdk_build_file(repository_ctx, platform)
_remote_sdk(repository_ctx)
_ruby_download_sdk = repository_rule(
_ruby_download_sdk_impl,
attrs = {
"version": attr.string(),
},
)
def _remote_sdk(repository_ctx):
_download_openssl(repository_ctx, version = "1.1.1c")
_download_ruby(repository_ctx, version = "2.6.3")
openssl_path, ruby_path = "openssl/build", ""
_build(repository_ctx, "openssl", openssl_path, ruby_path)
_build(repository_ctx, "ruby", openssl_path, ruby_path)
def _build(repository_ctx, name, openssl_path, ruby_path):
script_name = "build-{}.sh".format(name)
template_name = "build-{}.template".format(name)
repository_ctx.template(
script_name,
Label("#rules_ruby//ruby/private:{}".format(template_name)),
substitutions = {
"{ssl_build}": openssl_path,
"{ruby_build}": ruby_path,
}
)
repository_ctx.report_progress("Building {}".format(name))
res = repository_ctx.execute(["./" + script_name], timeout=20*60)
if res.return_code != 0:
print("res %s" % res.return_code)
print(" -stdout: %s" % res.stdout)
print(" -stderr: %s" % res.stderr)
Any advice on how I can make bazel aware such that it doesn't rebuild these build artifacts every time?
Problem is, that bazel isn't really building your ruby and openssl. When it prepares your build tree and runs the repository rule, it just executes a shell script as instructed, which apparently happens to build, but that fact is essentially opaque to bazel (and it also happens before bazel itself would even build).
There might be other, but I see the following as your options from top of my head:
Pre-build your ruby environment and its results as external dependency. The obvious downside (which may or may not be quite a lot of pain) being you need to do so for all platforms you need to supports (incl. making sure correct detection and corresponding download). The upside being you really only build once (per platform) and also have control over tooling used across all hosts. This would likely be my primary choice.
Build ssl and ruby as any other C sources making them just another bazel target. This however means you'd need to bazelify their builds (describe and maintain bazel build of otherwise bazel unaware project).
You can continue further along the path you've started and just (sort of) leave bazel out of it. I.e. for these builds extend the magic and in the build scripts used for instance using deterministic location and perhaps manifest files of what is around (also to make corruption less likely) make it possible to determine that the build has indeed already taken place and you can just collect its previous results.

F# How to implement parameterized CI – execution loop using FAKE

The question is mostly about step #3.
I need to implement the following loop in F#:
Load some input parameters from database. If input parameters specify that the loop should be terminated, then quit. This is straightforward, thanks to type providers.
Run a model generator, which generates a single F# source file, call it ModelData.fs. The generation is based on input parameters and some random values. This may take up to several hours. I already have that.
Compile the project. I can hard code a call to MSBuild, but that does not feel right.
Copy executables into some temporary folder, let’s say <SomeTempData>\<SomeModelName>.
Asynchronously run the executable from the output location of the previous step. The execution time is usually between several hours to several days. Each process runs in a single thread because this is the most efficient way to do it in this case. I tried the parallel version, but it did not beat the single threaded one.
Catch when the execution finishes. This seems straightforward: https://learn.microsoft.com/en-us/dotnet/api/system.diagnostics.process.exited?view=netframework-4.7.2 . The running process is responsible for storing useful results, if any. This event can be handled by F# MailBoxProcessor.
Repeat from the beginning. Because execution is async, monitor the number of running tasks to ensure that it does not exceed some allowed number. Again, MailBoxProcessor will do that with ease.
The whole thing will run on Windows, so there is no need to maintain multiple platforms. Just NET Framework (let’s say 4.7.2 as of this date) will do fine.
This seems like a very straightforward CI-like exercise and F# FAKE seemed as a proper solution. Unfortunately, none of the provided scarce examples worked (even with reasonable tweaks) and the bugs were cryptic. However, the worst part was that the compiling feature did not work at all. The provided example: http://fake.build/fake-gettingstarted.html#Compiling-the-application cannot be run at all and even after accounting for something like that: https://github.com/fsharp/FAKE/issues/1579 : it still silently choses not to compile the project. I’d appreciate any advice.
Here is the code that I was trying to run. It is based on the references above:
#r #"C:\GitHub\ClmFSharp\Clm\packages\FAKE.5.8.4\tools\FakeLib.dll"
#r #"C:\GitHub\ClmFSharp\Clm\packages\FAKE.5.8.4\tools\System.Reactive.dll"
open System.IO
open Fake.DotNet
open Fake.Core
open Fake.IO
open Fake.IO.Globbing.Operators
let execContext = Fake.Core.Context.FakeExecutionContext.Create false "build.fsx" []
Fake.Core.Context.setExecutionContext (Fake.Core.Context.RuntimeContext.Fake execContext)
// Properties
let buildDir = #"C:\Temp\WTF\"
// Targets
Target.create "Clean" (fun _ ->
Shell.cleanDir buildDir
)
Target.create "BuildApp" (fun _ ->
!! #"..\SolverRunner\SolverRunner.fsproj"
|> MSBuild.runRelease id buildDir "Build"
|> Trace.logItems "AppBuild-Output: "
)
Target.create "Default" (fun _ ->
Trace.trace "Hello World from FAKE"
)
open Fake.Core.TargetOperators
"Clean"
==> "BuildApp"
==> "Default"
Target.runOrDefault "Default"
The problem is that it does not build the project at all but no error messages are produced! This is the output when running it in FSI:
run Default
Building project with version: LocalBuild
Shortened DependencyGraph for Target Default:
<== Default
<== BuildApp
<== Clean
The running order is:
Group - 1
- Clean
Group - 2
- BuildApp
Group - 3
- Default
Starting target 'Clean'
Finished (Success) 'Clean' in 00:00:00.0098793
Starting target 'BuildApp'
Finished (Success) 'BuildApp' in 00:00:00.0259223
Starting target 'Default'
Hello World from FAKE
Finished (Success) 'Default' in 00:00:00.0004329
---------------------------------------------------------------------
Build Time Report
---------------------------------------------------------------------
Target Duration
------ --------
Clean 00:00:00.0025260
BuildApp 00:00:00.0258713
Default 00:00:00.0003934
Total: 00:00:00.2985910
Status: Ok
---------------------------------------------------------------------

Error creating vocabulary from big text file on disk

I try to perform example from https://cran.r-project.org/web/packages/text2vec/vignettes/files-multicore.html but with my file "text" - 3.7Gb plain text, build from Wikipedia XML dump with Perl script from here - http://mattmahoney.net/dc/textdata.html
setwd("c:/rtest")
library(text2vec)
library(doParallel)
N_WORKERS = 2
registerDoParallel(N_WORKERS)
it_files_par = ifiles_parallel(file_paths = "text")
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)
vocab = create_vocabulary(it_token_par)
This causes error:
Error in unserialize(socklist[[n]]) : error reading from connection
I have 8Gb RAM, word2vec model from this file is created without any errors.
First of all it doesn't make sense to use parallel iterators on a single file - each file processed in a separate R worker process. So here it will be worse than just itoken. Also it involves sending result from each worker to the master process. Here we see that result it too big to be send through socket.
Long story short - just use itoken or split your file into several smaller files.

Override dask scheduler to concurrently load data on multiple workers

I want to run graphs/futures on my distributed cluster which all have a 'load data' root task and then a bunch of training tasks that run on that data. A simplified version would look like this:
from dask.distributed import Client
client = Client(scheduler_ip)
load_data_future = client.submit(load_data_func, 'path/to/data/')
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Running this as above the scheduler gets one worker to read the file, then it spills that data to disk to share it with the other workers. However, loading the data is usually reading from a large HDF5 file, which can be done concurrently, so I was wondering if there was a way to force all workers to read this file concurrently (they all compute the root task) instead of having them wait for one worker to finish then slowly transferring the data from that worker.
I know there is the client.run() method which I can use to get all workers to read the file concurrently, but how would you then get the data you've read to feed into the downstream tasks?
I cannot use the dask data primitives to concurrently read HDF5 files because I need things like multi-indexes and grouping on multiple columns.
Revisited this question and found a relatively simple solution, though it uses internal API methods and involves a blocking call to client.run(). Using the same variables as in the question:
from distributed import get_worker
client_id = client.id
def load_dataset():
worker = get_worker()
data = {'load_dataset-0': load_data_func('path/to/data')}
info = worker.update_data(data=data, report=False)
worker.scheduler.update_data(who_has={key: [worker.address] for key in data},
nbytes=info['nbytes'], client=client_id)
client.run(load_dataset)
Now if you run client.has_what() you should see that each worker holds the key load_dataset-0. To use this in downstream computations you can simply create a future for the key:
from distributed import Future
load_data_future = Future('load_dataset-0', client=client)
and this can be used with client.compute() or dask.delayed as usual. Indeed the final line from the example in the question would work fine:
train_task_futures = [client.submit(train_func, load_data_future, params)
for params in train_param_set]
Bear in mind that it uses internal API methods Worker.update_data and Scheduler.update_data and works fine as of distributed.__version__ == 1.21.6 but could be subject to change in future releases.
As of today (distributed.__version__ == 1.20.2) what you ask for is not possible. The closest thing would be to compute once and then replicate the data explicitly
future = client.submit(load, path)
wait(future)
client.replicate(future)
You may want to raise this as a feature request at https://github.com/dask/distributed/issues/new

What causes flume with GCS sink to throw a OutOfMemoryException

I am using flume to write to Google Cloud Storage. Flume listens on HTTP:9000. It took me some time to make it work (add gcs libaries, use a credentials file...) but now it seems to communicate over the network.
I am sending very small HTTP request for my tests, and I have plenty of RAM available:
curl -X POST -d '[{ "headers" : { timestamp=1417444588182, env=dev, tenant=myTenant, type=myType }, "body" : "some body ONE" }]' localhost:9000
I encounter this memory exception on first request (then of course, it stops working):
2014-11-28 16:59:47,748 (hdfs-hdfs_sink-call-runner-0) [INFO - com.google.cloud.hadoop.util.LogUtil.info(LogUtil.java:142)] GHFS version: 1.3.0-hadoop2
2014-11-28 16:59:50,014 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:467)] process failed
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:820)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
(see complete stack trace as a gist for full details)
The strange part is that folders and files are created the way I want, but files are empty.
gs://my_bucket/dev/myTenant/myType/2014-12-01/14-36-28.1417445234193.json.tmp
Is it something wrong with the way I configured flume + GCS or is it a bug in GCS.jar ?
Where should I check to gather more data ?
ps : I am running flume-ng inside docker.
My flume.conf file:
# Name the components on this agent
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = gs://my_bucket/%{env}/%{tenant}/%{type}/%Y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 10000
a1.channels.mem.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
related question in my flume/gcs journey: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?
When uploading files, the GCS Hadoop FileSystem implementation sets aside a fairly large (64MB) write buffer per FSDataOutputStream (file open for write). This can be changed by setting "fs.gs.io.buffersize.write" to a smaller value, in bytes, in core-site.xml. I imagine 1MB would suffice for low-volume log collection.
In addition, check what the maximum heap size is set to when launching the JVM for flume. The flume-ng script sets a default JAVA_OPTS value of -Xmx20m to limit the heap to 20MB. This can be set to a larger value in flume-env.sh (see conf/flume-env.sh.template in the flume tarball distribution for details).

Resources