Openwhisk: scaling issues on the distributed setup - serverless

I setup a distributed openwhisk installation on a few vitual machines as described here https://github.com/apache/incubator-openwhisk/blob/master/ansible/README_DISTRIBUTED.md (had to also install some dependencies on VMs manually because they were expected but not installed by default).
My host file looks like this:
; the first parameter in a host is the inventory_hostname
; used for local actions only
ansible ansible_connection=local
[registry]
xxx.xx.xx.173 ansible_host=xxx.xx.xx.173
[edge]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[apigateway:children]
edge
[redis:children]
edge
[controllers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[kafkas]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[zookeepers:children]
kafkas
[invokers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[db]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
Everything seems to be running fine in general, I can create actions, invoke them etc.
On the two VMs which host invokers and controllers I turned on htop to check the CPU usage and tried running a python script invoking the same action (prime number calculation which takes time for a large enough input) multiple times in parallel.
The result seems to be that the first invoker works on 100% CPU while the calculation is happening, while the second one is still idling on 5-7% CPU. I also tried different ways of distributing components across multiple VMs, e.g. setting invokers on two machines and one controller separately on another machine, but the result is the same.
What could be the reason for this? And what could be the proper use case to make Openwhisk get the second invoker involved?

In a small deployment, there is a fraction of the invoker pool allocated strictly for docker actions. This is called the blackbox fraction which is 10% by default (with a minimum of 1 invoker, which is why you see one loaded invoker and one idle).
This recent pull request allows all the invokers to be used for small numbers of invokers (up to the reciprocal of the blackbox fraction): https://github.com/apache/incubator-openwhisk/pull/3751

Related

How to run a python script in the background on azure

I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.

Behavior of docker compose v3's deploy resources limits 'cpus' parameter setting (is it an absolute number or a percentage of available cores)

Folks,
With regards to docker compose v3's 'cpus' parameter setting (under 'deploy' 'resources' 'limits') to limit the available CPUs to a service, is it an absolute number that specifies the count of CPUs or is it a more useful percentage of available CPUs setting.
From what i read it appears to be an absolute number, where in, say if a host has 4 CPUs and one were to set two services in the compose file with 0.5 then both the services combined can only use a max of 1 CPU (0.5 each) while leaving the 3 remaining CPUs idle.
But thinking loudly it appears to me that it would be nicer if this is a percentage of available cores setting in which case for the same previous example this would result in both services each being able to use up to 2 CPUs each & thereby the two combined could use up all 4 when needed. This way when i increase or decrease the available cores the relative settings would help avoid modifying this value again.
EDIT(09/10/21):
On reading this it appears that the above can be achieved with 'cpu-shares' setting instead of setting 'cpus'. Is my understanding correct?
The doc for 'cpu-shares' however mentions the below cautionary note,
"It does not guarantee or reserve any specific CPU access."
If the above is achieved with this setting, then what does it mean (what is to lose) to not have a guarantee or reservation?
EDIT(09/13/21):
Just to summarize,
The 'cpus' parameter setting is an an absolute number that refers to the number of CPUs a service has reserved for it to use at all times. Correct?
The 'cpu-shares' parameter setting is a relative weight number the value of which is used to compute/determine the percentage of total available CPU that a service can use only when there is contention. Correct?

GNU parallel saturates one server instead of distributing jobs equally

I am using GNU parallel 20160222. I have four servers configured in my ~/.parallel/sshloginfile:
48/big1
48/big2
8/small1
8/small2
when I run, say, 32 jobs, I'd expect parallel to start eight on each server. Or even better, two or three each on small1 and small2, and twelve or so each on big1 and big2. But what it is doing is starting 8 jobs on small2 and the remaining jobs locally.
Here is my invocation (I actually use a --profile but I removed it for simplicity):
parallel --verbose --workdir . --sshdelay 0.2 --controlmaster --sshloginfile .. \
"my_cmd {} | gzip > {}.gz" ::: $(seq 1 32)
Here is the main question:
Is there an option missing that would do a more equal allocation of jobs?
Here is another related question:
Is there a way to specify --memfree, --load, etc. per server? Especially --memfree.
I recall GNU Parallel used to fill job slots "from one end". This did not matter if you had way more jobs than job slots: All job slots (both local and remote) would fill up.
It did, however, matter if you had fewer jobs. So it was changed, so GNU Parallel today gives jobs to sshlogins in a round robin fashion - thus spreading it more evenly.
Unfortunately I do not recall which version this change was done. But your can tell if you version does it by running:
parallel -vv -t
and look at which sshlogin is being used.
Re: --memfree
You can build your own using --limit.
I am curious why you want different limits for different servers. The idea behind --memfree is that it is set to the amount of RAM that a single job takes. So if there is enough RAM for a single job, a new job should be started - no matter the server.
You clearly have another situation, so explain about that.
Re: upgrading
Look into parallel --embed.

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

ML-Engine with GPUs workers errors

Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.
The model is training a CNN. However, there seem to be trouble with the workers.
This is an image of the logs https://i.stack.imgur.com/VJqE0.png.
The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.
With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.
I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.
I followed the Census example and loaded the TF_CONFIG environment variable.
tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')
# If cluster information is empty run local
if job_name is None or task_index is None:
return run('', True, *args, **kwargs)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
job_name=job_name,
task_index=task_index)
# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
server.join()
return
elif job_name in ['master', 'worker']:
return run(server.target, job_name == 'master', *args, **kwargs)
Then used the tf.replica_device_setter before defining the main graph.
As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.
Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png
Optimizer: AdaDelta
I appreciate the help!
In the comments, you seem to have answered your own question (using cluster_spec in replica_setter). Allow me to address the issue of throughput of a cluster of CPUs vs. a cluster of GPUs.
GPUs are fairly powerful. You'll typically get higher throughput by getting a single machine with many GPUs rather than having many machines each with a single GPU. That's because the communication overhead becomes a bottleneck (the bandwidth and latency to main memory on the same machine is much better than communicating with a parameter server on a remote machine).
The reason for the GPUs being slower than CPUs may be due to the extra overhead of GPUs needing to copy data from main memory to the GPU and back. If you're doing a lot of parallelizable computation, then this copy is negligible. Your model may be doing too little on the GPU and the overhead may swamp the actual computation.
For more information about building high performance models, see this guide.
In the meantime, I recommend using a single machine with more GPUs to see if that helps:
{
"scaleTier": "CUSTOM",
"masterType": "complex_model_l_gpu",
...
}
Just beware, that you'll have to modify your code to assign ops to the right GPUs, probably using towers.

Resources