I'm trying to use Xarray and Dask to open a multi-file dataset. However, I'm running into memory errors.
I have files that are typically this shape:
xr.open_dataset("/work/ba0989/a270077/coupled_ice_paper/model_data/coupled/LIG_coupled/outdata/fesom//LIG_coupled_fesom_thetao_19680101.nc")
<xarray.Dataset>
Dimensions: (depth: 46, nodes_2d: 126859, time: 366)
Coordinates:
* time (time) datetime64[ns] 1968-01-02 1968-01-03 ... 1969-01-01
* depth (depth) float64 -0.0 10.0 20.0 30.0 ... 5.4e+03 5.65e+03 5.9e+03
Dimensions without coordinates: nodes_2d
Data variables:
thetao (time, depth, nodes_3d) float32 ...
Attributes:
output_schedule: unit: d first: 1 rate: 1
30 files --> 41.5 GB
I also can set up a dask.distributed Client object:
Client()
<Client: 'tcp://127.0.0.1:43229' processes=8 threads=48, memory=68.72 GB>
So, if I suppose there is enough memory for the data to be loaded. However, when I then run xr.open_mfdataset, I very often get these sorts of warnings:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 8.25 GB -- Worker memory limit: 8.59 GB
I guess there is something I can do with the chunks argument?
Any help would be very appreciated; unfortunately I'm not sure where to begin trying. I could, in principle, open just the first file (they will always have the same shape) to figure out how to ideally rechunk the files.
Thanks!
Paul
Examples of the chunks and parallel keywords to the opening functions, which correspond to how you utilise dask, can be found in this doc section.
That should be all you need!
Related
I have this function that I would like to apply to a large dataframe in parallel:
from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
def standardize_smiles(smiles):
if smiles is None:
return None
try:
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
# note that no attempt is made at reionization at this step
# nor at ionization at some pH (rdkit has no pKa caculator)
# the main aim to to represent all molecules from different sources
# in a (single) standard way, for use in ML, catalogue, etc.
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
#except:
# return False
standardize_smiles('CCC')
'CCC'
However, neither Dask, nor Swifter, nor Ray can do the job. All frameworks use a single CPU for some reason.
Native Pandas
import pandas as pd
N = 1000
smilest_test = pd.DataFrame({'smiles': ['CCC']*N})
smilest_test
CPU times: user 3.58 s, sys: 0 ns, total: 3.58 s
Wall time: 3.58 s
Swifter 1.3.4
smiles_test['standardized_siles'] = smiles_test.smiles.swifter.allow_dask_on_strings(True).apply(standardize_smiles)
CPU times: user 892 ms, sys: 31.4 ms, total: 923 ms
Wall time: 5.14 s
While this WORKS with the dummy data, it does not with the real data, which looks like this:
The strings are a bit more complicated than the ones in the dummy data.
it seems first swifter needs some time to prepare the parallel execution and only uses one core, but then uses more cores. However, for the real data, it only uses 3 out of 8 cores.
I have the same issue with other frameworks such as dask, ray, modin, swifter.
Is there something that I miss here? Is there a problem when the dataframe contains stings? Why does the parallel execution take so much time even on a single computer (with multiple cores)? Or is there an issue with the RDKit library that I am using that makes it difficult to parallelize the above function?
I am capturing video from a Ricoh Theta V camera. It delivers the video as Motion JPEG (MJPEG). To get the video you have to do an HTTP POST alas which means I cannot use the cv2.VideoCapture(url) feature.
So the way to do this per numerous posts on the web and SO is something like this:
bytes = bytes()
while True:
bytes += stream.read(1024)
a = bytes.find(b'\xff\xd8')
b = bytes.find(b'\xff\xd9')
if a != -1 and b != -1:
jpg = bytes[a:b+2]
bytes = bytes[b+2:]
i = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow('i', i)
if cv2.waitKey(1) == 27:
exit(0)
That actually works, except it is slow. I'm processing a 1920x1080 jpeg stream. on a Mac Book Pro running OSX 10.12.6. The call to imdecode takes approx 425000 microseconds to process each image
Any idea how to do this without imdecode or make imdecode faster? I'd like it to work at 60FPS with HD video (at least).
I'm using Python3.7 and OpenCV4.
Updated Again
I looked into JPEG decoding from the memory buffer using PyTurboJPEG, the code goes like this to compare with OpenCV's imdecode():
#!/usr/bin/env python3
import cv2
from turbojpeg import TurboJPEG, TJPF_GRAY, TJSAMP_GRAY
# Load image into memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# Decode JPEG from memory into Numpy array using OpenCV
i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
# Use default library installation
jpeg = TurboJPEG()
# Decode JPEG from memory using turbojpeg
i1 = jpeg.decode(r)
cv2.imshow('Decoded with TurboJPEG', i1)
cv2.waitKey(0)
And the answer is that TurboJPEG is 7x faster! That is 4.6ms versus 32.2ms.
In [18]: %timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
32.2 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit i1 = jpeg.decode(r)
4.63 ms ± 55.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Kudos to #Nuzhny for spotting it first!
Updated Answer
I have been doing some further benchmarks on this and was unable to verify your claim that it is faster to save an image to disk and read it with imread() than it is to use imdecode() from memory. Here is how I tested in IPython:
import cv2
# First use 'imread()'
%timeit i1 = cv2.imread('image.jpg', cv2.IMREAD_COLOR)
116 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Now prepare the exact same image in memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# And try again with 'imdecode()'
%timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
113 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, I find imdecode() around 3% faster than imread() on my machine. Even if I include the np.asarray() into the timing, it is still quicker from memory than disk - and I have seriously fast 3GB/s NVME disks on my machine...
Original Answer
I haven't tested this but it seems to me that you are doing this in a loop:
read 1k bytes
append it to a buffer
look for JPEG SOI marker (0xffdb)
look for JPEG EOI marker (0xffd9)
if you have found both the start and the end of a JPEG frame, decode it
1) Now, most JPEG images with any interesting content I have seen are between 30kB to 300kB so you are going to do 30-300 append operations on a buffer. I don't know much abut Python but I guess that may cause a re-allocation of memory, which I guess may be slow.
2) Next you are going to look for the SOI marker in the first 1kB, then again in the first 2kB, then again in the first 3kB, then again in the first 4kB - even if you have already found it!
3) Likewise, you are going to look for the EOI marker in the first 1kB, the first 2kB...
So, I would suggest you try:
1) allocating a bigger buffer at the start and acquiring directly into it at the appropriate offset
2) not searching for the SOI marker if you have already found it - e.g. set it to -1 at the start of each frame and only try and find it if it is still -1
3) only look for the EOI marker in the new data on each iteration, not in all the data you have already searched on previous iterations
4) furthermore, actually, don't bother looking for the EOI marker unless you have already found the SOI marker, because the end of a frame without the corresponding start is no use to you anyway - it is incomplete.
I may be wrong in my assumptions, (I have been before!) but at least if they are public someone cleverer than me can check them!!!
I recommend to use turbo-jpeg. It has a python API: PyTurboJPEG.
I am new to Dask and having some troubles with it.
I am using a machine ( 4GB RAM, 2 cores) to analyse two csv files ( key.csv: ~2 million rows about 300Mb, sig.csv: ~12 million row about 600Mb). With these data, pandas can't fit in the memory, so I switch to use Dask.dataframe, What I expect is that Dask will process things in small chunks that can be fit in the memory ( the speed can be slower, i don't mind at all as long as it works), however, somehow, Dask still uses up all of the memory.
My code as below:
key=dd.read_csv("key.csv")
sig=dd.read_csv("sig.csv")
merge=dd.merge(key, sig, left_on=["tag","name"],
right_on=["key_tag","query_name"], how="inner")
merge.to_csv("test2903_*.csv")
# store results into a hard disk since it can't be fit in memory
Did I make any mistakes? Any help is appreciated.
Big CSV files generally aren't the best for distributed compute engines like Dask. In this example, the CSVs are 600MB and 300MB, which aren't huge. As specified in the comments, you can set the blocksize when reading the CSVs to make sure the CSVs are read into Dask DataFrames with the right number of partitions.
Distributed compute joins are always going to run faster when you can broadcast the small DataFrame before running the join. Your machine has 4GB of RAM and the small DataFrame is 300MB, so it's small enough to be broadcasted. Dask automagically broadcasts Pandas DataFrames. You can convert a Dask DataFrame to a Pandas DataFrame with compute().
key is the small DataFrame in your example. Column pruning the small DataFrame and making it even smaller before broadcasting is even better.
key=dd.read_csv("key.csv")
sig=dd.read_csv("sig.csv", blocksize="100 MiB")
key_pdf = key.compute()
merge=dd.merge(key_pdf, sig, left_on=["tag","name"],
right_on=["key_tag","query_name"], how="inner")
merge.to_csv("test2903_*.csv")
Here's a MVCE:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(
{
"id": [1, 2, 3, 4],
"cities": ["Medellín", "Rio", "Bogotá", "Buenos Aires"],
}
)
large_ddf = dd.from_pandas(df, npartitions=2)
small_df = pd.DataFrame(
{
"id": [1, 2, 3, 4],
"population": [2.6, 6.7, 7.2, 15.2],
}
)
merged_ddf = dd.merge(
large_ddf,
small_df,
left_on=["id"],
right_on=["id"],
how="inner",
)
print(merged_ddf.compute())
id cities population
0 1 Medellín 2.6
1 2 Rio 6.7
0 3 Bogotá 7.2
1 4 Buenos Aires 15.2
I'm trying to use dask bag for wordcount 30GB of json files, I strict according to the tutoral from offical web: http://dask.pydata.org/en/latest/examples/bag-word-count-hdfs.html
But still not work, my single machine is 32GB memory and 8 cores CPU.
My code below, I used to processing 10GB file even not work, the error is running couple of hours without any notification the jupyter was collapsed, i tried on Ubuntu and Windows both system is the same problem. So i suspect if dask bag can processing data out of memory? or is that my code incorrect?
The test data from http://files.pushshift.io/reddit/comments/
import dask.bag as db
import json
b = db.read_text('D:\RC_2015-01\RC_2012-04')
records = b.map(json.loads)
result = b.str.split().concat().frequencies().topk(10, lambda x: x[1])
%time f = result.compute()
f
Try setting a blocksize in the 10MB range when reading from the single file to break it up a bit.
In [1]: import dask.bag as db
In [2]: b = db.read_text('RC_2012-04', blocksize=10000000)
In [3]: %time b.count().compute()
CPU times: user 1.22 s, sys: 56 ms, total: 1.27 s
Wall time: 20.4 s
Out[3]: 19044534
Also, as a warning, you create a bag records but then don't do anything with it. You might want to remove that line.
I have the "'System.OutOfMemoryException" exception for this simple code (a 10 000 * 10 000 matrix) multiplied by itself:
#time
#r "Microsoft.Office.Interop.Excel"
#r "FSharp.PowerPack.dll"
open System
open System.IO
open Microsoft.FSharp.Math
open System.Collections.Generic
let mutable Matrix1 = Matrix.create 10000 10000 0.
let matrix4 = Matrix1 * Matrix1
I have the following error:
System.OutOfMemoryException: An exception 'System.OutOfMemoryException' has been raised
Microsoft.FSharp.Collections.Array2DModule.ZeroCreate[T](Int32 length1, Int32 length2)
Microsoft.FSharp.Math.DoubleImpl.mulDenseMatrixDS(DenseMatrix`1 a, DenseMatrix`1 b)
Microsoft.FSharp.Math.SpecializedGenericImpl.mulM[a](Matrix`1 a, Matrix`1 b)
<StartupCode$FSI_0004>.$FSI_0004.main#() dans C:\Users\XXXXXXX\documents\visual studio 2010\Projects\Library1\Library1\Module1.fs:line 92
Stop due to an error
I have therefore 2 questions:
I have a 8 GB memory on my computer and according to my calculation a 10 000 * 10 000 matrix should take 381 MB [computed this way : 10 000 * 10 000 = 100 000 000 integers in the matrix => 100 000 000 * 4 bytes (integers of 32 bits) = 400 000 000 => 400 000 000 / (1024*1024) = 381 MB] so I cannot understand why there is an OutOfMemoryException
More generally (it's not the case here I think), I have the impression that F# interactive registers all the data and therefore overloads the memory, do you know of a way to free all the data registered by F# interactive without exiting F#?
In summary, fsi is a 32-bit process; at most it can hold 2GB of data. Run your test as a 64-bit Windows application; you can increase the size of the matrix, but it still has 2GB limit of .NET objects.
I correct your calculation a little bit. Matrix1 is a float matrix, so each element occupies 8 bytes in memory. The total size of Matrix1 and matrix4 in memory is at least:
2 * 10000 * 10000 * 8 = 1 600 000 000 bytes ~ 1.6 GB
(ignoring some bookkeeping parts of matrix)
So it's no surprise when fsi*32 runs out of memory in this case.
Execute the test as a 64-bit Windows process, you can create float matrices of size around 15000 but not more than that. Check out this informative article for concrete numbers with different types of matrix elements.
The amount of physical memory on your computer is not the relevant bottleneck - see Eric Lippert's great blog post for more information.