Saving data in parallel in julia - save

I am confronted with a problem when submitting many jobs to a cluster, where each job is calculating some data and saving it (with many variables in terms of a .jld file) to some drive, for example like this
function f(savedir, pid, params)
...
save(savedir*"$(pid).jld",result)
end
After the calculation I need to process the data and load each .jld file to access the variables individually. Even though the final reduction is rather small, this takes a lot of time. I though about saving it all to one .jld file, but in that case I run into the problem that the file is potentially accessed at the same time, since the jobs run in parallel. Further I though about collecting the data in a out-of-core fashion using juliaDB, but in the end I do not see why this should be any better. I know that this could be solved with some database server, but this seems to be an overkill for my problem. How do you deal with this kind of problems?
Best,
v.

If the data is small simply use the IOBuffer mechanism and send it from workers to the master:
using Distributed, Serialization
addprocs(4)
#everywhere using Distributed, Serialization
rrs = #distributed (hcat) for i in 1:12
b=IOBuffer()
myres = (rand(), randn(), myid()) # emulates some big computations
# that you are running
serialize(b,myres)
b.data
end
And here is a sample code deserializing back the results:
julia> for i in 1:size(rrs,2)
res = deserialize(IOBuffer(#view rrs[:, i]))
println(res)
end
(0.8656737453513623, 1.0594978554855077, 2)
(0.6637467726391784, 0.35682413048990763, 2)
(0.32579653913039386, 0.2512902466296038, 2)
(0.3033490905926888, 1.7662416364260713, 3)
...
If your data is too big and your cluster is distributed than you need to use some other orchestration mechanism. One possible lightweight solution that I use sometimes is the following bunch of bash codes: https://github.com/pszufe/KissCluster This tool is a set of bash script built around the following bash command very useful for any file-based scenario:
nohup seq $start $end | xargs --max-args=1 --max-procs=$nproc julia run.jl &>> somelogfile.txt &
Nevertheless when possible consider using Julia's Distributed package.

Related

How to run a python script in the background on azure

I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.

Can Dask computational graphs keep intermediate data so re-compute is not necessary?

I am very impressed with Dask and I am trying to determine if it is the right tool for my problem. I am building a project for interactive data exploration where users can interactively change parameters of a figure. Sometimes these changes requires re-computing the entire pipeline to make the graph (e.g. "show data from a different time interval"), but sometimes not. For instance, "change the smoothing parameter" should not require the system to reload the raw unsmoothed data, because the underlying data is the same, only the processing changes. The system should instead use the existing raw data that has already been loaded. I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed. It looks like the caching system in Dask is close to what I need, but was designed with a bit of a different use-case in mind. I see there is a persist method, but I'm not sure if that would work either. Is there an easy way to accomplish this in Dask, or is there another project that would be more appropriate?
"change the smoothing parameter" should not require the system to reload the raw unsmoothed data
Two options:
The builtin functools.lru_cache will cache every unique input. The check on memory is with the maxsize parameter, which controls how many input/output pairs are stored.
Using persist in the right places will compute that object as mentioned at https://distributed.dask.org/en/latest/manage-computation.html#client-persist. It will not require re-running computation to get the object in later computation; functionally, it's the same as lru_cache.
For example, this code will read from disk twice:
>>> import dask.dataframe as dd
>>> df = dd.read_csv(...)
>>> # df = df.persist() # uncommenting this line → only read from disk once
>>> df[df.x > 0].mean().compute()
24.9
>>> df[df.y > 0].mean().compute()
0.1
Uncommented the line will mean this code only reads from disk once because the task graph for the CSV is computed and the value is stored in memory. For your application is sounds like I would use persist intelligently: https://docs.dask.org/en/latest/best-practices.html#persist-when-you-can
What if two smoothing parameters want to be visualized? In that case, I'd avoid calling compute repeatedly: https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
lower, upper = client.compute(df.x.min(), df.x.max())
This will share the task graph for min and max so unnecessary computation is not performed.
I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed.
Dask Distributed has a smart caching ability: https://docs.dask.org/en/latest/caching.html#automatic-opportunistic-caching. Part of the documentation says
Another approach is to watch all intermediate computations, and guess which ones might be valuable to keep for the future. Dask has an opportunistic caching mechanism that stores intermediate tasks that show the following characteristics:
Expensive to compute
Cheap to store
Frequently used
I think this is what you're looking for; it'll store values depending on those attributes.

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines

Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Can dask work with an endless streaming input

I understand that dask work well in batch mode like this
def load(filename):
...
def clean(data):
...
def analyze(sequence_of_data):
...
def store(result):
with open(..., 'w') as f:
f.write(result)
dsk = {'load-1': (load, 'myfile.a.data'),
'load-2': (load, 'myfile.b.data'),
'load-3': (load, 'myfile.c.data'),
'clean-1': (clean, 'load-1'),
'clean-2': (clean, 'load-2'),
'clean-3': (clean, 'load-3'),
'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
'store': (store, 'analyze')}
from dask.multiprocessing import get
get(dsk, 'store') # executes in parallel
Can we use dask to process streaming channel , where the number of chunks is unknown or even endless?
Can it perform the computation in an incremental way. for example can the 'analyze' step above could process ongoing chunks?
must we call the "get" operation only after all the data chunks are known , could we add new chunks after the "get" was called
Edit: see newer answer below
No
The current task scheduler within dask expects a single computational graph. It does not support dynamically adding to or removing from this graph. The scheduler is designed to evaluate large graphs in a small amount of memory; knowing the entire graph ahead of time is critical for this.
However, this doesn't stop one from creating other schedulers with different properties. One simple solution here is just to use a module like conncurrent.futures on a single machine or distributed on multiple machines.
Actually Yes
The distributed scheduler now operates fully asynchronously and you can submit tasks, wait on a few of them, submit more, cancel tasks, add/remove workers etc. all during computation. There are several ways to do this, but the simplest is probably the new concurrent.futures interface described briefly here:
http://dask.pydata.org/en/latest/futures.html

Resources