spaCy PhraseMatcher running out of memory/utilizing 100% CPU - memory

I am trying to create a PhraseMatcher with 20 million patterns. For example:
terms = [''.join(random.choices(string.ascii_uppercase, k = 4)) for i in range(20000000)]
nlp = English()
matcher_large = PhraseMatcher(nlp.vocab, attr = 'LOWER')
terms_large = list(nlp.tokenizer.pipe(terms))
matcher_large.add('Terms', None, *terms_large)
This is causing the kernel to die in Jupyter, or the process to get killed in the terminal. It was also running at 100% CPU. Is there a less memory-intensive way to create this matcher? I thought about creating matchers in chunks, but I don't want to end up with hundreds of matchers.

It's true that the PhraseMatcher may not be the best choice this many patterns, but you can add patterns incrementally rather than creating a huge list up front and passing a likewise huge number of args at once to the add method:
for doc in nlp.tokenizer.pipe(terms):
matcher.add("Terms", [doc]) # newer API
Jupyter notebooks often have a relatively low default memory limit, which is probably what you're running into.

Related

Saving data in parallel in julia

I am confronted with a problem when submitting many jobs to a cluster, where each job is calculating some data and saving it (with many variables in terms of a .jld file) to some drive, for example like this
function f(savedir, pid, params)
...
save(savedir*"$(pid).jld",result)
end
After the calculation I need to process the data and load each .jld file to access the variables individually. Even though the final reduction is rather small, this takes a lot of time. I though about saving it all to one .jld file, but in that case I run into the problem that the file is potentially accessed at the same time, since the jobs run in parallel. Further I though about collecting the data in a out-of-core fashion using juliaDB, but in the end I do not see why this should be any better. I know that this could be solved with some database server, but this seems to be an overkill for my problem. How do you deal with this kind of problems?
Best,
v.
If the data is small simply use the IOBuffer mechanism and send it from workers to the master:
using Distributed, Serialization
addprocs(4)
#everywhere using Distributed, Serialization
rrs = #distributed (hcat) for i in 1:12
b=IOBuffer()
myres = (rand(), randn(), myid()) # emulates some big computations
# that you are running
serialize(b,myres)
b.data
end
And here is a sample code deserializing back the results:
julia> for i in 1:size(rrs,2)
res = deserialize(IOBuffer(#view rrs[:, i]))
println(res)
end
(0.8656737453513623, 1.0594978554855077, 2)
(0.6637467726391784, 0.35682413048990763, 2)
(0.32579653913039386, 0.2512902466296038, 2)
(0.3033490905926888, 1.7662416364260713, 3)
...
If your data is too big and your cluster is distributed than you need to use some other orchestration mechanism. One possible lightweight solution that I use sometimes is the following bunch of bash codes: https://github.com/pszufe/KissCluster This tool is a set of bash script built around the following bash command very useful for any file-based scenario:
nohup seq $start $end | xargs --max-args=1 --max-procs=$nproc julia run.jl &>> somelogfile.txt &
Nevertheless when possible consider using Julia's Distributed package.

How to ensure number of `partitions` is equally distributed across workers with dask and dask-cudf?

I am trying to do a basic ETL workflow on large files across workers using dask-cudf across a large amount of workers .
Problem:
Initially the scheduler schedules equal amounts of partitions to be read across workers but during the pre-processing it tends to distribute/shuffle them across workers.
The minimum number of partitions that a worker gets is 4 and the maximum partitions that it gets is 19 (total partitions = apprx. 300, num_workers = 22) this behavior causes problem downstream as i want equal distribution of partitions across workers.
Is there a way to prevent this behavior ?
I thought below will help with that but it does not .
# limit work-stealing as much as possible
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.set({'distributed.scheduler.bandwidth': 1})
Workflow being done:
read
fill-na
down-casting/other logic
df = dask_cudf.read_csv(path = `big_files`,
names = names,
delimiter='\t',
dtype = read_dtype_ls,
chunksize=chunksize)
df = df.map_partitions(lambda df:df.fillna(-1))
def transform_col_int64_to_int32(df, columns):
"""
This function casts int64s columns to int32s
we are using this to transform int64s to int32s and overflows seem to be consitent
"""
for col in columns:
df[col] = df[col].astype(np.int32)
return df
df = df.map_partitions(transform_col_int64_to_int32,cat_col_names)
df = df.persist()
Dask schedules where tasks run based on a number of factors, including data dependencies, runtime, memory use, and so on. Typically the answer to these questions is "just let it do it's thing". The most common cause of poor scheduling is having too few chunks.
However, if you explicitly need a more rebalanced distribution then you can try the Client.rebalance method.
wait(df)
client.rebalance(df)
However beware that rebalance is not as robust as other Dask operations. It's best to do it at a time when there isn't a ton of other work going on (hence the call to dask.distributed.wait above).
Also, I would turn on work stealing. Work stealing is another name for load balancing.

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Dask performances: workflow doubts

I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.
For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.

ML-Engine with GPUs workers errors

Hi I am using ML Engine with a custom tier made up of a complex_m master, four workers each with a GPU and one complex_m as parameter server.
The model is training a CNN. However, there seem to be trouble with the workers.
This is an image of the logs https://i.stack.imgur.com/VJqE0.png.
The master still seems to be working because there are session checkpoints being saved, however, this is nowwhere near the speed it should be.
With complex_m workers, the model works. It just gives a waiting for the model to be ready in the beginning (i assume it is until the master intializes global variables, correct me if i am wrong) and then works normally. With GPUs however there seem to be a problem with the task.
I didnt' use the tf.Device() function anywhere, in the cloud i thought the device is set automatically if a GPU is available.
I followed the Census example and loaded the TF_CONFIG environment variable.
tf.logging.info('Setting up the server')
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
job_name = tf_config_json.get('task', {}).get('type')
task_index = tf_config_json.get('task', {}).get('index')
# If cluster information is empty run local
if job_name is None or task_index is None:
return run('', True, *args, **kwargs)
cluster_spec = tf.train.ClusterSpec(cluster)
server = tf.train.Server(cluster_spec,
job_name=job_name,
task_index=task_index)
# Wait for incoming connections forever
# Worker ships the graph to the ps server
# The ps server manages the parameters of the model.
if job_name == 'ps':
server.join()
return
elif job_name in ['master', 'worker']:
return run(server.target, job_name == 'master', *args, **kwargs)
Then used the tf.replica_device_setter before defining the main graph.
As a session i am using tf.train.MonitoredTrainingSession, this should handle the initialization of variables and checkpoint saving. I do not know why the workers are saying that the variables are not initialized.
Variables to be initialized are all variables: https://i.stack.imgur.com/hAHPL.png
Optimizer: AdaDelta
I appreciate the help!
In the comments, you seem to have answered your own question (using cluster_spec in replica_setter). Allow me to address the issue of throughput of a cluster of CPUs vs. a cluster of GPUs.
GPUs are fairly powerful. You'll typically get higher throughput by getting a single machine with many GPUs rather than having many machines each with a single GPU. That's because the communication overhead becomes a bottleneck (the bandwidth and latency to main memory on the same machine is much better than communicating with a parameter server on a remote machine).
The reason for the GPUs being slower than CPUs may be due to the extra overhead of GPUs needing to copy data from main memory to the GPU and back. If you're doing a lot of parallelizable computation, then this copy is negligible. Your model may be doing too little on the GPU and the overhead may swamp the actual computation.
For more information about building high performance models, see this guide.
In the meantime, I recommend using a single machine with more GPUs to see if that helps:
{
"scaleTier": "CUSTOM",
"masterType": "complex_model_l_gpu",
...
}
Just beware, that you'll have to modify your code to assign ops to the right GPUs, probably using towers.

Resources