What's the best way to handle large timeseries in dask / xarray? - dask

I've got 17,000 CSV files, each ordered by timestamp (some with missing data). The total CSV files are around 85GB, which is much larger than my 32GB RAM.
I'm trying to figure out the best way to get these into a time-aligned, out-of-memory data structure, such that I can compute things like PCA.
What's the right approach?
(I've tried to set up an xarray.DataSet, with dim=(filename, time), and then I'm trying to xr.merge() on each CSV file into the DataSet, but it gets slower with every insert, and I expect it will crash when RAM runs out.)

Have you tried dd.read_csv(...).
Dask reads CSVs in a lazily and can perform certain operations in a streaming manner, so you can run an analysis on a larger than memory dataset.
Make sure that Dask is able to properly set divisions when you read in your data. Once the data is read, check dd.divisions and make sure they're values.
You can also use a Dask cluster to access more memory of course.
Those files are really small and Dask typically works best with partitions that are around 100MB. You might want to compact your data a bit.

Related

Dask looping overhead from libraries

When calling another libary to dask such as scikit image contrast stretch, I realise that dask is creating a result for each block, storing in either memory or spilling to disk seperately. Then it attempts to merge all the results. Thats fine if your on a cluster or on a single computer and the dataset for the array is small, everything is fairly controlled. The problems start to happen when you work with data sets that are much larger than your RAM or disk. Is there a way to mitigate this or use the zarr file format to save to updating values as you go along? May be thats too fanciful. Any other ideas bar buy more ram would be helpful.
edit
I was looking at the documentation on dask and the suggestions on chunk sizes for dask, is something like about 100MB. I ended up reducing significantly from this amount to 30-70MB depending on file size. I then ran a contrast stretch (not from a library but with numpy unfunc and I didnt have any issue! In fact i played with the way the compuation is done. Since I start with a uint8 3dim array, when multiplying by the ratio for contrast stretch I am inevitably increasing the array chunk to a float64 array. Which takes up significant memory and computation. So what I have been do is treating the da.array as np.asarray(float64) but only prior to the multiplication by a float number. Then returning to a uint8 to finish the computation. The stretch time has reduced to just under 5 mins for a 20GB file. So I think thats a positive step. Just means image processing without libraries, I will, have a look at rechunker though.
The image processing pipeline i am building is to inevitable be used for a merged dataset of about 250-300GB (definitely outside the limits of my laptop). I also dotn have time to get to grips with cloud or parralell processing in the cloud. Thats for a few months down the line. Right now its trying to get through this analysis.
Yes, you can do the kind of thing you are talking about. I encourage you to check out the rechunker project, which is specialied around changing the layout of the data in zarr storage, but shows the idea of how to save temporary intermediated for the purpose of mitigating memory and communication issues.

Dask is slow with many disk-read and disk-write blocks showing up in the status page

My Dask computation is slow. When I look at the status page of the diagnostics dashboard I see that most of the time is spent in disk-read-* and disk-write-* tasks.
What does this mean?
How do I diagnose this issue?
When Dask workers start to run out of memory they write extra data to disk. This is recorded in the status page as a disk-write- task. When that data is needed again it is read from disk and a disk-read- task is shown on the status page. You might confirm this by looking at the upper left plot that shows memory use per worker, or by looking at the solid portion of the progress bars that show the number of tasks of each particular type that are still in memory.
Ways you can address this:
Figure out why Dask needs to keep data in memory. Common causes:
when you persist a lot of data
when Dask has to keep a lot of intermediate results, such as in the case of a full shuffle, or computations that have a high cardinality of results
Get more memory
Get faster disk. Modern disk bandwidth has improved in the last few years. It's possible to get drives on consumer-grade personal laptops with 1-2GB/s bandwidth.

Neo4j inserting large files - huge difference in time between

I am inserting a set of files (pdfs, of each 2 MB) in my database.
Inserting 100 files at once takes +- 15 seconds, while inserting 250 files at once takes 80 seconds.
I am not quite sure why this big difference is happening, but I assume it is because the amount of free memory is full between this amount. Could this be the problem?
If there is any more detail I can provide, please let me know.
Not exactly sure of what is happening on your side but it really looks like what is described here in the neo4j performance guide.
It could be:
Memory issues
If you are experiencing poor write performance after writing some data
(initially fast, then massive slowdown) it may be the operating system
that is writing out dirty pages from the memory mapped regions of the
store files. These regions do not need to be written out to maintain
consistency so to achieve highest possible write speed that type of
behavior should be avoided.
Transaction size
Are you using multiple transactions to upload your files ?
Many small transactions result in a lot of I/O writes to disc and
should be avoided. Too big transactions can result in OutOfMemory
errors, since the uncommitted transaction data is held on the Java
Heap in memory.
If you are on linux, they also suggest some tuning to improve performance. See here.
You can look up the details on the page.
Also, if you are on linux, you can check memory usage by yourself during import by using this command:
$ free -m
I hope this helps!

iOS SQLite or 1000 loose files?

Suppose I have 1000 records of variable size, ranging from around 256 bytes to a few K. I wonder is there any advantage of putting them into a sqlite database versus just reading/writing 1000 loose files on iOS? I don't need to do any operations other than access by a single key, which I can use as the filename. Seems like the file system would be the winner unless the number of records grows very large.
If your system were read-only, I would say that the file system is the clear winner: a simple binary file and perhaps a small index to know where each record starts would be all that you need. You could read the entire index into memory, and then grab your records from the file system as needed, for a performance that would be extremely tough to match for any RDBMS.
However, since you are planning on writing data back, I would suggest going with SQLite because of potential data integrity issues.
Performance concerns should not be underestimated, too: since your records are of variable size, writing the data back may prove to be difficult in cases when records need to expand. Moreover, since you are on a mobile platform, you would need to build something in to avoid data corruption when the program is killed unexpectedly in the middle of a write. SQLite takes care of this; your code would have to build something comparable to it, or risk data corruption problems.

OutOfMemoryException Processing Large File

We are loading a large flat file into BizTalk Server 2006 (Original release, not R2) - about 125 MB. We run a map against it and then take each row and make a call out to a stored procedure.
We receive the OutOfMemoryException during orchestration processing, the Windows Service restarts, uses full 2 GB memory, and crashes again.
The server is 32-bit and set to use the /3GB switch.
Also I've separated the flow into 3 hosts - one for receive, the other for orchestration, and the third for sends.
Anyone have any suggestions for getting this file to process wihout error?
Thanks,
Krip
If this is a flat file being sent through a map you are converting it to XML right? The increase in size could be huge. XML can easily add a factor of 5-10 times over a flat file. Especially if you use descriptive or long xml tag names (which normally you would).
Something simple you could try is to rename the xml nodes to shorter names, depending on the number of records (sounds like a lot) it might actually have a pretty significant impact on your memory footprint.
Perhaps a more enterprise approach, would be to subdivide this in a custom pipeline into separate message packets that can be fed through the system in more manageable chunks (similar to what Chris suggests). Then the system throttling and memory metrics could take over. Without knowing more about your data it would be hard to say how to best do this, but with a 125 MB file I am guessing that you probably have a ton of repeating rows that do not need to be processed sequentially.
Where does it crash? Does it make it past the Transform shape? Another suggestion to try is to run the transform in the Receive Port. For more efficient processing, you could even debatch the message and have multiple simultaneous orchestration instances be calling the stored procs. This would definately reduce the memory profile and increase performance.

Resources