I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html
But still not work, my single machine is 32GB memory and 8 cores CPU.
My code below, I used to processing 10GB file even not work, the error is running couple of hours without any notification the jupyter was collapsed, i tried on Ubuntu and Windows both system is the same problem. So i suspect if dask bag can processing data out of memory? or is that my code incorrect?
The test data from http://files.pushshift.io/reddit/comments/
import dask.bag as db
import json
b = db.read_text('D:\RC_2015-01\RC_2012-04')
records = b.map(json.loads)
result = b.str.split().concat().frequencies().topk(10, lambda x: x[1])
%time f = result.compute()
f
Try setting a blocksize in the 10MB range when reading from the single file to break it up a bit.
In [1]: import dask.bag as db
In [2]: b = db.read_text('RC_2012-04', blocksize=10000000)
In [3]: %time b.count().compute()
CPU times: user 1.22 s, sys: 56 ms, total: 1.27 s
Wall time: 20.4 s
Out[3]: 19044534
Also, as a warning, you create a bag records but then don't do anything with it. You might want to remove that line.
Related
I have this function that I would like to apply to a large dataframe in parallel:
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
def standardize_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
# note that no attempt is made at reionization at this step
# nor at ionization at some pH (rdkit has no pKa caculator)
# the main aim to to represent all molecules from different sources
# in a (single) standard way, for use in ML, catalogue, etc.
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
#except:
# return False
standardize_smiles('CCC')
'CCC'
However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.
Native Pandas
import pandas as pd
N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test
CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s
Wall time: 3.58 s
Swifter 1.3.4
smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)
CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms
Wall time: 5.14 s
While this WORKS with the dummy data, it does not with the real data, which looks like this:
The strings are a bit more complicated than the ones in the dummy data.
it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.
I have the same issue with other frameworks such as dask, ray, modin, swifter.
Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?
I Have 3 machine
16 cores 32GB
One of the machine is the scheduler and the worker.
When I ran this code:
from distributed import Client
client = Client('ip')
import dask.array as da
x = da.random.random((400000,400000), chunks=(10000, 10000))
y = da.exp(x).sum()
y.compute()
It ran fast and look effcient :
But when I tried to read parquet files
It ran very slow and the warning message was:
distributed.core - INFO - Event loop was unresponsive in Worker for 3.27s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
I tried many chunks size - '8MB','16MB','32MB','64MB','128MB','256MB'
but it act the same so I guess it 'GIL lock '
import dask.dataframe as dd
"files = 152 parquet files of 150MB in average"
ddf = dd.read_parquet(files,parse_dates=['event_time'],chunksize=chunk)
df = ddf.groupby('uid').agg({'article_id':'count'})
df = df.compute()
distributed version - '2.20.0'
pandas version - '1.2.4'
OS - Centos7
I've 7 csv files with 8 GB each and need to convert to parquet.
Memory usage goes to 100 GB and I had to kill it .
I tried with Distributed Dask as well . The memory is limited to 12 GB but no output produced for long time.
FYI. I used to traditional pandas with Chunking + Prod consumer --> was able to convert in 30 mins
What I'm missing for Dask processing ?
def ProcessChunk(df,...):
df.to_parquet()
for factfile in fArrFileList:
df = dd.read_csv(factfile, blocksize="100MB",
dtype=fColTypes, header=None, sep='|',names=fCSVCols)
result = ProcessChunk(df,output_parquet_file, chunksize, fPQ_Schema, fCSVCols, fColTypes)
Thanks all for suggestions. map_partitions worked.
df = dd.read_csv(filename, blocksize="500MB",
dtype=fColTypes, header=None, sep='|',names=fCSVCols)
df.map_partitions(DoWork,output_parquet_file, chunksize, Schema, CSVCols, fColTypes).compute(num_workers=2)
But the same approach for Dask Distributed Local Cluster didn't work well.when the csv size < 100 MB it worked in local cluster mode.
I had a similar problem and I found that use dask to split to smallest parquet is very slow and will eventually fail. If you have access to a Linux Terminal you can use parallel or split. For an example of their usage check answers from here
My workflow is supposing your files are called file1.csv,..., file7.csv and are stored in data/raw. I'm assuming you are using the terminal commands from your notebook and this is the reason I'm adding the %%bash magic
create folders data/raw_part/part1/,... ,data/raw_part/part7/
%%bash
for year in {1..7}
do
mkdir -p data/raw_parts/part${i}
done
for each file run (in case you want to use parallel)
%%bash
cat data/raw/file1.csv | parallel --header : --pipe -N1000000 'cat >data/raw_parts/part1/file_{#}.csv'```
convert files to parquet
first create output folders
%%bash
for year in {1..7}
do
mkdir -p data/processed/part${i}
done
define function to convert csv to parquet
import pandas as pd
import os
from dask import delayed, compute
# this can run in parallel
#delayed
def convert2parquet(fn, fldr_in, fldr_out):
fn_out = fn.replace(fldr_in, fldr_out)\
.replace(".csv", ".parquet")
df = pd.read_csv(fn)
df.to_parquet(fn_out, index=False)
get all files you want to convert
jobs = []
fldr_in = "data/raw_parts/"
for (dirpath, dirnames, filenames) in os.walk(fldr_in):
if len(filenames) > 0:
jobs += [os.path.join(dirpath, fn)
for fn in filenames]
process all in parallel
%%time
to_process = [convert2parquet(job, fldr_in, fldr_out) for job in jobs]
out = compute(to_process)
I'm trying to use Xarray and Dask to open a multi-file dataset. However, I'm running into memory errors.
I have files that are typically this shape:
xr.open_dataset("/work/ba0989/a270077/coupled_ice_paper/model_data/coupled/LIG_coupled/outdata/fesom//LIG_coupled_fesom_thetao_19680101.nc")
<xarray.Dataset>
Dimensions: (depth: 46, nodes_2d: 126859, time: 366)
Coordinates:
* time (time) datetime64[ns] 1968-01-02 1968-01-03 ... 1969-01-01
* depth (depth) float64 -0.0 10.0 20.0 30.0 ... 5.4e+03 5.65e+03 5.9e+03
Dimensions without coordinates: nodes_2d
Data variables:
thetao (time, depth, nodes_3d) float32 ...
Attributes:
output_schedule: unit: d first: 1 rate: 1
30 files --> 41.5 GB
I also can set up a dask.distributed Client object:
Client()
<Client: 'tcp://127.0.0.1:43229' processes=8 threads=48, memory=68.72 GB>
So, if I suppose there is enough memory for the data to be loaded. However, when I then run xr.open_mfdataset, I very often get these sorts of warnings:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 8.25 GB -- Worker memory limit: 8.59 GB
I guess there is something I can do with the chunks argument?
Any help would be very appreciated; unfortunately I'm not sure where to begin trying. I could, in principle, open just the first file (they will always have the same shape) to figure out how to ideally rechunk the files.
Thanks!
Paul
Examples of the chunks and parallel keywords to the opening functions, which correspond to how you utilise dask, can be found in this doc section.
That should be all you need!
I am new to Dask and having some troubles with it.
I am using a machine ( 4GB RAM, 2 cores) to analyse two csv files ( key.csv: ~2 million rows about 300Mb, sig.csv: ~12 million row about 600Mb). With these data, pandas can't fit in the memory, so I switch to use Dask.dataframe, What I expect is that Dask will process things in small chunks that can be fit in the memory ( the speed can be slower, i don't mind at all as long as it works), however, somehow, Dask still uses up all of the memory.
My code as below:
key=dd.read_csv("key.csv")
sig=dd.read_csv("sig.csv")
merge=dd.merge(key, sig, left_on=["tag","name"],
right_on=["key_tag","query_name"], how="inner")
merge.to_csv("test2903_*.csv")
# store results into a hard disk since it can't be fit in memory
Did I make any mistakes? Any help is appreciated.
Big CSV files generally aren't the best for distributed compute engines like Dask. In this example, the CSVs are 600MB and 300MB, which aren't huge. As specified in the comments, you can set the blocksize when reading the CSVs to make sure the CSVs are read into Dask DataFrames with the right number of partitions.
Distributed compute joins are always going to run faster when you can broadcast the small DataFrame before running the join. Your machine has 4GB of RAM and the small DataFrame is 300MB, so it's small enough to be broadcasted. Dask automagically broadcasts Pandas DataFrames. You can convert a Dask DataFrame to a Pandas DataFrame with compute().
key is the small DataFrame in your example. Column pruning the small DataFrame and making it even smaller before broadcasting is even better.
key=dd.read_csv("key.csv")
sig=dd.read_csv("sig.csv", blocksize="100 MiB")
key_pdf = key.compute()
merge=dd.merge(key_pdf, sig, left_on=["tag","name"],
right_on=["key_tag","query_name"], how="inner")
merge.to_csv("test2903_*.csv")
Here's a MVCE:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(
{
"id": [1, 2, 3, 4],
"cities": ["Medellín", "Rio", "Bogotá", "Buenos Aires"],
}
)
large_ddf = dd.from_pandas(df, npartitions=2)
small_df = pd.DataFrame(
{
"id": [1, 2, 3, 4],
"population": [2.6, 6.7, 7.2, 15.2],
}
)
merged_ddf = dd.merge(
large_ddf,
small_df,
left_on=["id"],
right_on=["id"],
how="inner",
)
print(merged_ddf.compute())
id cities population
0 1 Medellín 2.6
1 2 Rio 6.7
0 3 Bogotá 7.2
1 4 Buenos Aires 15.2