I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.
Related
I am confronted with a problem when submitting many jobs to a cluster, where each job is calculating some data and saving it (with many variables in terms of a .jld file) to some drive, for example like this
function f(savedir, pid, params)
...
save(savedir*"$(pid).jld",result)
end
After the calculation I need to process the data and load each .jld file to access the variables individually. Even though the final reduction is rather small, this takes a lot of time. I though about saving it all to one .jld file, but in that case I run into the problem that the file is potentially accessed at the same time, since the jobs run in parallel. Further I though about collecting the data in a out-of-core fashion using juliaDB, but in the end I do not see why this should be any better. I know that this could be solved with some database server, but this seems to be an overkill for my problem. How do you deal with this kind of problems?
Best,
v.
If the data is small simply use the IOBuffer mechanism and send it from workers to the master:
using Distributed, Serialization
addprocs(4)
#everywhere using Distributed, Serialization
rrs = #distributed (hcat) for i in 1:12
b=IOBuffer()
myres = (rand(), randn(), myid()) # emulates some big computations
# that you are running
serialize(b,myres)
b.data
end
And here is a sample code deserializing back the results:
julia> for i in 1:size(rrs,2)
res = deserialize(IOBuffer(#view rrs[:, i]))
println(res)
end
(0.8656737453513623, 1.0594978554855077, 2)
(0.6637467726391784, 0.35682413048990763, 2)
(0.32579653913039386, 0.2512902466296038, 2)
(0.3033490905926888, 1.7662416364260713, 3)
...
If your data is too big and your cluster is distributed than you need to use some other orchestration mechanism. One possible lightweight solution that I use sometimes is the following bunch of bash codes: https://github.com/pszufe/KissCluster This tool is a set of bash script built around the following bash command very useful for any file-based scenario:
nohup seq $start $end | xargs --max-args=1 --max-procs=$nproc julia run.jl &>> somelogfile.txt &
Nevertheless when possible consider using Julia's Distributed package.
I was trying to move from Gatling to Locust (Python is a nicer language) for load tests. In Gatling I can get data for a chart like 'Requests per seconds over time', 'Response time percentiles over time', etc. ( https://gatling.io/docs/2.3/general/reports/ ) and the really useful 'Responses per second over time'
In Locust I can see the two report (requests, distribution), where (if I understand it correctly), 'Distribution' is the one that does 'over time'? But I can't see where things started failing, or the early history of that test.
Is Locust able to provide 'over time' data in a CSV format (or something else easily graph-able)? If so, how?
Looked through logs, can output the individual commands, but it would be a pain to assemble them (it would push the balance toward 'just use Gatling')
Looked over https://buildmedia.readthedocs.org/media/pdf/locust/latest/locust.pdf but not spotting it
I can (and have) created a loop that triggers the locust call at incremental intervals
increment_user_count = [1, 10, 100, 1000]
# for total_users in range(user_min, user_max, increment_count):
for users in increment_user_count:
[...]
system(assembled_command)
And that works... but it loses the whole advantage of setting a spawn rate, and would be painful for gradually incrementing up to a large number (then having to assemble all the files back together)
Currently executing with something like
locust -f locust_base_testing.py --no-web -c 1000 -r 2 --run-time 8m30s --only-summary --csv=output_stats_20190405-130352_1000
(need to use this in automation so Web UI is not a viable use-case)
I would expect a flag, in the call or in some form of setup, that outputs the summary at regular ticks. Basically I'd expect (with no-web) to get the data that I could use to replicate the graph the web version seems to know about:
Actual: just one final summary of the overall test (and logs per individual call)
I'm trying to break fusion with a GroupByKey. This creates one huge window and since my job is big I'd rather start emitting.
With the direct runner using something like what I found here it seems to work. However, when run on Cloud Dataflow it seems to batch the GBK together and not emit output until the source nodes have "succeeded".
I'm doing a bounded/batch job. I'm extracting the contents of archive files and then writing them to gcs.
Everything works except it takes longer than I expected and cpu utilization is low. I suspect that this is due to fusion -- my hypothesis is that the extraction is fused to the write operation and so there's a pattern of extraction / higher CPU followed by less CPU because we're doing network calls and back again.
The code looks like:
.apply("Window",
Window.<MyType>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(5))))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
.apply("Add key", MapElements...)
.apply(GroupByKey.create())
Locally I verify using debug logs so that I can see work is being done after the GBK. The timestamp between the first extraction finishing and the first post-GBK op usually reflects the 5s duration (or other values I change it to (1,5,10,20,30)).
On GCP I verify by looking at the pipeline structure and I can see that everything after the GBK is "not started" and the output collection of the GBK is empty ("-") while the input collection has millions of elements.
Edit:
this is on beam v2.10.0.
the extraction is being done by a SplittableDoFn (not sure this is relevant)
Looks like the answer you referred to was for a streaming pipeline (unbounded input). For batch pipeline processing a bounded input, GroupByKey will not emit till all data for a given key has been processed. Please see here for more details.
I setup a distributed openwhisk installation on a few vitual machines as described here https://github.com/apache/incubator-openwhisk/blob/master/ansible/README_DISTRIBUTED.md (had to also install some dependencies on VMs manually because they were expected but not installed by default).
My host file looks like this:
; the first parameter in a host is the inventory_hostname
; used for local actions only
ansible ansible_connection=local
[registry]
xxx.xx.xx.173 ansible_host=xxx.xx.xx.173
[edge]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[apigateway:children]
edge
[redis:children]
edge
[controllers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[kafkas]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
[zookeepers:children]
kafkas
[invokers]
xxx.xx.xx.174 ansible_host=xxx.xx.xx.174
xxx.xx.xx.175 ansible_host=xxx.xx.xx.175
[db]
xxx.xx.xx.176 ansible_host=xxx.xx.xx.176
Everything seems to be running fine in general, I can create actions, invoke them etc.
On the two VMs which host invokers and controllers I turned on htop to check the CPU usage and tried running a python script invoking the same action (prime number calculation which takes time for a large enough input) multiple times in parallel.
The result seems to be that the first invoker works on 100% CPU while the calculation is happening, while the second one is still idling on 5-7% CPU. I also tried different ways of distributing components across multiple VMs, e.g. setting invokers on two machines and one controller separately on another machine, but the result is the same.
What could be the reason for this? And what could be the proper use case to make Openwhisk get the second invoker involved?
In a small deployment, there is a fraction of the invoker pool allocated strictly for docker actions. This is called the blackbox fraction which is 10% by default (with a minimum of 1 invoker, which is why you see one loaded invoker and one idle).
This recent pull request allows all the invokers to be used for small numbers of invokers (up to the reciprocal of the blackbox fraction): https://github.com/apache/incubator-openwhisk/pull/3751
I have a cluster app that uses a distributed Redis back-end, with dynamically generated Lua scripts dispatched to the redis instances. The Lua component scripts can get fairly complex and have a significant runtime, and I'd like to be able to profile them to find the hot spots.
SLOWLOG is useful for telling me that my scripts are slow, and exactly how slow they are, but that's not my problem. I know how slow they are, I'd like to figure out which parts of them are slow.
The redis EVAL docs are clear that redis does not export any timekeeping functions to lua, which makes it seem like this might be a lost cause.
So, short a custom fork of Redis, is there any way to tell which parts of my Lua script are slower than others?
EDIT
I took Doug's suggestion and used debug.sethook - here's the hook routine I inserted at the top of my script:
redis.call('del', 'line_sample_count')
local function profile()
local line = debug.getinfo(2)['currentline']
redis.call('zincrby', 'line_sample_count', 1, line)
end
debug.sethook(profile, '', 100)
Then, to see the hottest 10 lines of my script:
ZREVRANGE line_sample_count 0 9 WITHSCORES
If your scripts are processing bound (not I/O bound), then you may be able to use the debug.sethook function with a count hook:
The count hook: is called after the interpreter executes every
count instructions. (This event only happens while Lua is executing a
Lua function.)
You'll have to build a profiler based on the counts you receive in your callback.
The PepperfishProfiler would be a good place to start. It uses os.clock which you don't have, but you could just use hook counts for a very crude approximation.
This is also covered in PiL 23.3 – Profiles
In standard Lua C, you can't. It's not a built-in function - it only returns seconds. So, there are two options available: You either write your own Lua extension DLL to return the time in msec, or:
You can do a basic benchmark using a millisecond-resolution time. You can access the current millisecond time with LuaSocket. Though this adds a dependency to your project, it's an effective way to do trivial benchmarking.
require "socket"
t = socket.gettime();