Out of memory exception for a matrix - f#

I have the "'System.OutOfMemoryException" exception for this simple code (a 10 000 * 10 000 matrix) multiplied by itself:
#time
#r "Microsoft.Office.Interop.Excel"
#r "FSharp.PowerPack.dll"
open System
open System.IO
open Microsoft.FSharp.Math
open System.Collections.Generic
let mutable Matrix1 = Matrix.create 10000 10000 0.
let matrix4 = Matrix1 * Matrix1
I have the following error:
System.OutOfMemoryException: An exception 'System.OutOfMemoryException' has been raised
Microsoft.FSharp.Collections.Array2DModule.ZeroCreate[T](Int32 length1, Int32 length2)
Microsoft.FSharp.Math.DoubleImpl.mulDenseMatrixDS(DenseMatrix`1 a, DenseMatrix`1 b)
Microsoft.FSharp.Math.SpecializedGenericImpl.mulM[a](Matrix`1 a, Matrix`1 b)
<StartupCode$FSI_0004>.$FSI_0004.main#() dans C:\Users\XXXXXXX\documents\visual studio 2010\Projects\Library1\Library1\Module1.fs:line 92
Stop due to an error
I have therefore 2 questions:
I have a 8 GB memory on my computer and according to my calculation a 10 000 * 10 000 matrix should take 381 MB [computed this way : 10 000 * 10 000 = 100 000 000 integers in the matrix => 100 000 000 * 4 bytes (integers of 32 bits) = 400 000 000 => 400 000 000 / (1024*1024) = 381 MB] so I cannot understand why there is an OutOfMemoryException
More generally (it's not the case here I think), I have the impression that F# interactive registers all the data and therefore overloads the memory, do you know of a way to free all the data registered by F# interactive without exiting F#?

In summary, fsi is a 32-bit process; at most it can hold 2GB of data. Run your test as a 64-bit Windows application; you can increase the size of the matrix, but it still has 2GB limit of .NET objects.
I correct your calculation a little bit. Matrix1 is a float matrix, so each element occupies 8 bytes in memory. The total size of Matrix1 and matrix4 in memory is at least:
2 * 10000 * 10000 * 8 = 1 600 000 000 bytes ~ 1.6 GB
(ignoring some bookkeeping parts of matrix)
So it's no surprise when fsi*32 runs out of memory in this case.
Execute the test as a 64-bit Windows process, you can create float matrices of size around 15000 but not more than that. Check out this informative article for concrete numbers with different types of matrix elements.

The amount of physical memory on your computer is not the relevant bottleneck - see Eric Lippert's great blog post for more information.

Related

When increasing size of data created, why does Jupyter Kernel die before MemoryError appears?

I ran jupyter through docker run -p 8888:8888 jupyter/scipy-notebook.
from sklearn.datasets import make_classification
X, y = make_classification(10000,2000)
causes kernel to die, while X, y = make_classification(100000,2000) gives
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
Cell In [1], line 3
1 from sklearn.datasets import make_classification
----> 3 X, y = make_classification(100000,2000)
File /opt/conda/lib/python3.10/site-packages/sklearn/datasets/_samples_generator.py:220, in make_classification(n_samples, n_features, n_informative, n_redundant, n_repeated, n_classes, n_clusters_per_class, weights, flip_y, class_sep, hypercube, shift, scale, shuffle, random_state)
217 n_samples_per_cluster[i % n_clusters] += 1
219 # Initialize X and y
--> 220 X = np.zeros((n_samples, n_features))
221 y = np.zeros(n_samples, dtype=int)
223 # Build the polytope whose vertices become cluster centroids
MemoryError: Unable to allocate 1.49 GiB for an array with shape (100000, 2000) and data type float64
It looks like for the larger (100000,2000 ) data size, allocation failed and errored out early.
Did the smaller (10000,2000 ) data size allocate successfully
since there's no error raised?
So what is killing the kernel and why is the initial memory allocation insufficient as a check? It looks like there is extra memory allocated as the code runs?
When running the 10000, 2000 code, docker stats shows
MEM USAGE / LIMIT beginning at 152.2MiB / 966.2MiB then shooting to 369.2MiB / 966.2MiB, waiting for a few seconds, then kernel died pop-up.
369 looks very far from 966 available, did the MEM USAGE suddenly jump from 369 to above 966 in 1 step?
If the kernel indeed died because MEM USAGE was too high, how do I know when (eg. what % of available memory usage) to start being careful and deleting unused variables?

Handling Xarray/Dask Memory

I'm trying to use Xarray and Dask to open a multi-file dataset. However, I'm running into memory errors.
I have files that are typically this shape:
xr.open_dataset("/work/ba0989/a270077/coupled_ice_paper/model_data/coupled/LIG_coupled/outdata/fesom//LIG_coupled_fesom_thetao_19680101.nc")
<xarray.Dataset>
Dimensions: (depth: 46, nodes_2d: 126859, time: 366)
Coordinates:
* time (time) datetime64[ns] 1968-01-02 1968-01-03 ... 1969-01-01
* depth (depth) float64 -0.0 10.0 20.0 30.0 ... 5.4e+03 5.65e+03 5.9e+03
Dimensions without coordinates: nodes_2d
Data variables:
thetao (time, depth, nodes_3d) float32 ...
Attributes:
output_schedule: unit: d first: 1 rate: 1
30 files --> 41.5 GB
I also can set up a dask.distributed Client object:
Client()
<Client: 'tcp://127.0.0.1:43229' processes=8 threads=48, memory=68.72 GB>
So, if I suppose there is enough memory for the data to be loaded. However, when I then run xr.open_mfdataset, I very often get these sorts of warnings:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 8.25 GB -- Worker memory limit: 8.59 GB
I guess there is something I can do with the chunks argument?
Any help would be very appreciated; unfortunately I'm not sure where to begin trying. I could, in principle, open just the first file (they will always have the same shape) to figure out how to ideally rechunk the files.
Thanks!
Paul
Examples of the chunks and parallel keywords to the opening functions, which correspond to how you utilise dask, can be found in this doc section.
That should be all you need!

python opencv create image from bytearray

I am capturing video from a Ricoh Theta V camera. It delivers the video as Motion JPEG (MJPEG). To get the video you have to do an HTTP POST alas which means I cannot use the cv2.VideoCapture(url) feature.
So the way to do this per numerous posts on the web and SO is something like this:
bytes = bytes()
while True:
bytes += stream.read(1024)
a = bytes.find(b'\xff\xd8')
b = bytes.find(b'\xff\xd9')
if a != -1 and b != -1:
jpg = bytes[a:b+2]
bytes = bytes[b+2:]
i = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow('i', i)
if cv2.waitKey(1) == 27:
exit(0)
That actually works, except it is slow. I'm processing a 1920x1080 jpeg stream. on a Mac Book Pro running OSX 10.12.6. The call to imdecode takes approx 425000 microseconds to process each image
Any idea how to do this without imdecode or make imdecode faster? I'd like it to work at 60FPS with HD video (at least).
I'm using Python3.7 and OpenCV4.
Updated Again
I looked into JPEG decoding from the memory buffer using PyTurboJPEG, the code goes like this to compare with OpenCV's imdecode():
#!/usr/bin/env python3
import cv2
from turbojpeg import TurboJPEG, TJPF_GRAY, TJSAMP_GRAY
# Load image into memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# Decode JPEG from memory into Numpy array using OpenCV
i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
# Use default library installation
jpeg = TurboJPEG()
# Decode JPEG from memory using turbojpeg
i1 = jpeg.decode(r)
cv2.imshow('Decoded with TurboJPEG', i1)
cv2.waitKey(0)
And the answer is that TurboJPEG is 7x faster! That is 4.6ms versus 32.2ms.
In [18]: %timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
32.2 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit i1 = jpeg.decode(r)
4.63 ms ± 55.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Kudos to #Nuzhny for spotting it first!
Updated Answer
I have been doing some further benchmarks on this and was unable to verify your claim that it is faster to save an image to disk and read it with imread() than it is to use imdecode() from memory. Here is how I tested in IPython:
import cv2
# First use 'imread()'
%timeit i1 = cv2.imread('image.jpg', cv2.IMREAD_COLOR)
116 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Now prepare the exact same image in memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# And try again with 'imdecode()'
%timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
113 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, I find imdecode() around 3% faster than imread() on my machine. Even if I include the np.asarray() into the timing, it is still quicker from memory than disk - and I have seriously fast 3GB/s NVME disks on my machine...
Original Answer
I haven't tested this but it seems to me that you are doing this in a loop:
read 1k bytes
append it to a buffer
look for JPEG SOI marker (0xffdb)
look for JPEG EOI marker (0xffd9)
if you have found both the start and the end of a JPEG frame, decode it
1) Now, most JPEG images with any interesting content I have seen are between 30kB to 300kB so you are going to do 30-300 append operations on a buffer. I don't know much abut Python but I guess that may cause a re-allocation of memory, which I guess may be slow.
2) Next you are going to look for the SOI marker in the first 1kB, then again in the first 2kB, then again in the first 3kB, then again in the first 4kB - even if you have already found it!
3) Likewise, you are going to look for the EOI marker in the first 1kB, the first 2kB...
So, I would suggest you try:
1) allocating a bigger buffer at the start and acquiring directly into it at the appropriate offset
2) not searching for the SOI marker if you have already found it - e.g. set it to -1 at the start of each frame and only try and find it if it is still -1
3) only look for the EOI marker in the new data on each iteration, not in all the data you have already searched on previous iterations
4) furthermore, actually, don't bother looking for the EOI marker unless you have already found the SOI marker, because the end of a frame without the corresponding start is no use to you anyway - it is incomplete.
I may be wrong in my assumptions, (I have been before!) but at least if they are public someone cleverer than me can check them!!!
I recommend to use turbo-jpeg. It has a python API: PyTurboJPEG.

Is there an SIMD instruction to achieve batch array memory index mapping?

In my RGB to grey case:
Y = (77*R + 150*G + 29*B) >> 8;
I know SIMD (NEON, SSE2) can do like:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = 77*{R0,R1,R2,R3,R4,R5,R6,R7}
{B0,B1,B2,B3,B4,B5,B6,B7} = 150*{G0,G1,G2,G3,G4,G5,G6,G7}
{C0,C1,C2,C3,C4,C5,C6,C7} = 29*{B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
However, the multiply instruction take at least 2 clock cycles, and R,G,B in [0-255],
we can use three lookup table(an array, length=256) to store the partial result of
77*R(mark as X), 150*G(mark as Y), 29*B(mark as Z).
So I'm looking for instructions can do the intention:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = {X[R0],X[R1],X[R2],X[R3],X[R4],X[R5],X[R6],X[R7]}
{B0,B1,B2,B3,B4,B5,B6,B7} = {Y[G0],Y[G1],Y[G2],Y[G3],Y[G4],Y[G5],Y[G6],Y[G7]}
{C0,C1,C2,C3,C4,C5,C6,C7} = {Z[B0],Z[B1],Z[B2],Z[B3],Z[B4],Z[B5],Z[B6],Z[B7]}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
Any good suggestions?
There are no byte or word gather instructions in AVX2 / AVX512, and no gathers at all in NEON. The DWORD gathers that do exist are much slower than a multiply! e.g. one per 5 cycle throughput for vpgatherdd ymm,[reg + scale*ymm], ymm, according to Agner Fog's instruction table for Skylake.
You can use shuffles as a parallel table-lookup. But your table for each lookup is 256 16-bit words. That's 512 bytes. AVX512 has some shuffles that select from the concatenation of 2 registers, but that's "only" 2x 64 bytes, and the byte or word element-size versions of those are multiple uops on current CPUs. (e.g. AVX512BW vpermi2w). They are still fantastically powerful compared to vpshufb, though.
So using a shuffle as a LUT won't work in your case, but it does work very well for some cases, e.g. for popcount you can split bytes into 4-bit nibbles and use vpshufb to do 32 lookups in parallel from a 16-element table of bytes.
Normally for SIMD you want to replace table lookups with computation, because computation is much more SIMD friendly.
Suck it up and use pmullw / _mm_mullo_epi16. You have instruction-level parallelism, and Skylake has 2 per clock throughput for 16-bit SIMD multiply (but 5 cycle latency). For image processing, normally throughput matters more than latency, as long as you keep the latency within reason so out-of-order execution can hide it.
If your multipliers ever have few enough 1 bits in their binary representation, you could consider using shift/add instead of an actual multiply. e.g. B * 29 = B * 32 - B - B * 2. Or B<<5 - B<<1 - B. That many instructions probably has more throughput cost than a single multiply, though. If you could do it with just 2 terms, it might be worth it. (But then again, still maybe not, depending on the CPU. Total instruction throughput and vector ALU bottlenecks are a big deal.)

Select an integer number of periods

Suppose we have sinusoidal with frequency 100Hz and sampling frequency of 1000Hz. It means that our signal has 100 periods in a second and we are taking 1000 samples in a second. Therefore, in order to select a complete period I'll have to take fs/f=10 samples. Right?
What if the sampling period is not a multiple of the frequency of the signal (like 550Hz)? Do I have to find the minimum multiple M of f and fs, and than take M samples?
My goal is to select an integer number of periods in order to be able to replicate them without changes.
You have f periods a second, and fs samples a second.
If you take M samples, it would cover M/fs part of a second, or P = f * (M/fs) periods. You want this number to be integer.
So you need to take M = fs / gcd(f, fs) samples.
For your example P = 1000 / gcd(100, 1000) = 1000 / 100 = 10.
If you have 60 Hz frequency and 80 Hz sampling frequency, it gives P = 80 / gcd(60, 80) = 80 / 20 = 4 -- 4 samples will cover 4 * 1/80 = 1/20 part of a second, and that will be 3 periods.
If you have 113 Hz frequency and 512 Hz sampling frequency, you are out of luck, since gcd(113, 512) = 1 and you'll need 512 samples, covering the whole second and 113 periods.
In general, an arbitrary frequency will not have an integer number of periods. Irrational frequencies will never even repeat ever. So some means other than concatenation of buffers one period in length will be needed to synthesize exactly periodic waveforms of arbitrary frequencies. Approximation by interpolation for fractional phase offsets is one possibility.

Resources