cv::cuda::GpuMat::create allocates much more than requested - opencv

I'm using the latest OpenCV 4.x with CUDA supoprt + CUDA 11.6.
I want to allocate GpuMat image in device memory by doing so:
cv::cuda::GpuMat test1;
test1.create(100, 1000000, CV_8UC1);
and I measure consumed memory before create function call and after (using nvidia-smi tool).
| 0 N/A N/A 372354 C ...aur/example_build/example 199MiB |
| 0 N/A N/A 389636 C ...aur/example_build/example 295MiB |
So + ~100 MB - makes sense.
But when I allocate the image this way (changed W and H):
cv::cuda::GpuMat test1;
test1.create(1000000, 100, CV_8UC1);
I see this:
| 0 N/A N/A 379124 C ...aur/example_build/example 199MiB |
| 0 N/A N/A 379124 C ...aur/example_build/example 689MiB |
I expected the same increment as in test1 though.
In various cases, consumption is x5 more than expected, when the image is "high and narrow". What do I understand wrong?

In various cases, consumption is x5 more than expected, when the image is "high and narrow". What do I understand wrong?
OpenCV GpuMat uses a pitched allocation. If the minimum pitch is for example 512 bytes, then allocating a "narrow" image is going to be extra-expensive.
On my tesla V100, the minimum pitch (kind of like saying the minimum "width" for each line) for a pitched allocation is 512. 512/100 = 5x.
No I don't have any suggestions for workarounds. Allocate a wider image. Or accept the extra cost.
I think most CUDA GPUs will have a minimum pitch of 512 bytes, because the minimum texture alignment is 512 bytes. You can use the following code to find yours:
$ cat
#include <iostream>
int main(){
char *d;
size_t p;
cudaMallocPitch(&d, &p, 1, 100);
std::cout << p << std::endl;
$ nvcc -o t2060
$ compute-sanitizer ./t2060
========= ERROR SUMMARY: 0 errors
(As an aside, I don't know how you decided that your first example shows +100MB. I see 199MiB and 201MiB. The difference between those two appears to be 2MB. But this doesn't seem to be the crux of your question, and the 500MB allocation for a 100MB image of width 100 bytes is explained above.)


MedianBlur() calculate max kernel size

I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
sum += H.coarse[k];
if ( sum > t )
sum -= H.coarse[k];
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm:
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?
The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.

python opencv create image from bytearray

I am capturing video from a Ricoh Theta V camera. It delivers the video as Motion JPEG (MJPEG). To get the video you have to do an HTTP POST alas which means I cannot use the cv2.VideoCapture(url) feature.
So the way to do this per numerous posts on the web and SO is something like this:
bytes = bytes()
while True:
bytes +=
a = bytes.find(b'\xff\xd8')
b = bytes.find(b'\xff\xd9')
if a != -1 and b != -1:
jpg = bytes[a:b+2]
bytes = bytes[b+2:]
i = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow('i', i)
if cv2.waitKey(1) == 27:
That actually works, except it is slow. I'm processing a 1920x1080 jpeg stream. on a Mac Book Pro running OSX 10.12.6. The call to imdecode takes approx 425000 microseconds to process each image
Any idea how to do this without imdecode or make imdecode faster? I'd like it to work at 60FPS with HD video (at least).
I'm using Python3.7 and OpenCV4.
Updated Again
I looked into JPEG decoding from the memory buffer using PyTurboJPEG, the code goes like this to compare with OpenCV's imdecode():
#!/usr/bin/env python3
import cv2
from turbojpeg import TurboJPEG, TJPF_GRAY, TJSAMP_GRAY
# Load image into memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# Decode JPEG from memory into Numpy array using OpenCV
i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
# Use default library installation
jpeg = TurboJPEG()
# Decode JPEG from memory using turbojpeg
i1 = jpeg.decode(r)
cv2.imshow('Decoded with TurboJPEG', i1)
And the answer is that TurboJPEG is 7x faster! That is 4.6ms versus 32.2ms.
In [18]: %timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
32.2 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit i1 = jpeg.decode(r)
4.63 ms ± 55.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Kudos to #Nuzhny for spotting it first!
Updated Answer
I have been doing some further benchmarks on this and was unable to verify your claim that it is faster to save an image to disk and read it with imread() than it is to use imdecode() from memory. Here is how I tested in IPython:
import cv2
# First use 'imread()'
%timeit i1 = cv2.imread('image.jpg', cv2.IMREAD_COLOR)
116 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Now prepare the exact same image in memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# And try again with 'imdecode()'
%timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
113 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, I find imdecode() around 3% faster than imread() on my machine. Even if I include the np.asarray() into the timing, it is still quicker from memory than disk - and I have seriously fast 3GB/s NVME disks on my machine...
Original Answer
I haven't tested this but it seems to me that you are doing this in a loop:
read 1k bytes
append it to a buffer
look for JPEG SOI marker (0xffdb)
look for JPEG EOI marker (0xffd9)
if you have found both the start and the end of a JPEG frame, decode it
1) Now, most JPEG images with any interesting content I have seen are between 30kB to 300kB so you are going to do 30-300 append operations on a buffer. I don't know much abut Python but I guess that may cause a re-allocation of memory, which I guess may be slow.
2) Next you are going to look for the SOI marker in the first 1kB, then again in the first 2kB, then again in the first 3kB, then again in the first 4kB - even if you have already found it!
3) Likewise, you are going to look for the EOI marker in the first 1kB, the first 2kB...
So, I would suggest you try:
1) allocating a bigger buffer at the start and acquiring directly into it at the appropriate offset
2) not searching for the SOI marker if you have already found it - e.g. set it to -1 at the start of each frame and only try and find it if it is still -1
3) only look for the EOI marker in the new data on each iteration, not in all the data you have already searched on previous iterations
4) furthermore, actually, don't bother looking for the EOI marker unless you have already found the SOI marker, because the end of a frame without the corresponding start is no use to you anyway - it is incomplete.
I may be wrong in my assumptions, (I have been before!) but at least if they are public someone cleverer than me can check them!!!
I recommend to use turbo-jpeg. It has a python API: PyTurboJPEG.

GPU vs CPU end to end latency for dynamic image resizing

I have currently used OpenCV and ImageMagick for some throughput benchmarking and I am not finding working with GPU to be much faster than CPUs. Our usecase on site is to resize dynamically to the size requested from a master copy based on a service call and trying to evaluate if having GPU makes sense to resize per service call dynamically.
Sharing the code I wrote for OpenCV. I am running the following function for all the images stored in a folder serially and Ultimately I am running N such processes to achieve X number of image resizes.I want to understand if my approach is incorrect to evaluate or if the usecase doesn't fit typical GPU usecases. And what exactly might be limiting GPU performance. I am not even maximizing the utilization to anywhere close to 100%
cv::Mat::setDefaultAllocator(cv::cuda::HostMem::getAllocator (cv::cuda::HostMem::AllocType::PAGE_LOCKED));
auto t_start = std::chrono::high_resolution_clock::now();
Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
auto t_end_read = std::chrono::high_resolution_clock::now();
std::cout<<"Image Not Found: "<< input_file << std::endl;
cuda::GpuMat d_src;
auto t_end_h2d = std::chrono::high_resolution_clock::now();
cuda::GpuMat d_dst;
cuda::resize(d_src, d_dst, Size(400, 400),0,0, CV_INTER_AREA,stream);
auto t_end_resize = std::chrono::high_resolution_clock::now();
Mat dst;,stream);
auto t_end_d2h = std::chrono::high_resolution_clock::now();
std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",host2device,"<<std::chrono::duration<double, std::milli>(t_end_h2d-t_end_read).count()
<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_h2d).count()
<<",device2host,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_end_resize).count()
<<",total,"<<std::chrono::duration<double, std::milli>(t_end_d2h-t_start).count()<<endl;
auto t_start = std::chrono::high_resolution_clock::now();
Mat src = imread(input_file,CV_LOAD_IMAGE_COLOR);
auto t_end_read = std::chrono::high_resolution_clock::now();
std::cout<<"Image Not Found: "<< input_file << std::endl;
Mat dst;
resize(src, dst, Size(400, 400),0,0, CV_INTER_AREA);
auto t_end_resize = std::chrono::high_resolution_clock::now();
std::cout<<"read,"<<std::chrono::duration<double, std::milli>(t_end_read-t_start).count()<<",resize,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_end_read).count()
<<",total,"<<std::chrono::duration<double, std::milli>(t_end_resize-t_start).count()<<endl;
Compiling : g++ -std=c++11 resizeCPU.cpp -o resizeCPU pkg-config --cflags --libs opencv
I am running each program N number of times controlled by following code :
echo $1
for (( c=$START; c<=$END; c++ ))
./resizeGPU "$c" &#>/dev/null #&disown;
echo All done
Run : ./
Those timers around lead to following aggregate data
No_processes resizeCPU resizeGPU memcpyGPU totalresizeGPU
1 1.51 0.55 2.13 2.68
10 5.67 0.37 2.43 2.80
15 6.35 2.30 12.45 14.75
20 6.30 2.05 10.56 12.61
30 8.09 4.57 23.97 28.55
No of images run per process : 267
Average size of the image: 624Kb
According to data above, as we increase the number of processes(leading to increased number of simultaneous resizes) the resize perform
ance(which includes actual resize + host to device and device to host copy) increases significantly on GPU vs CPU.
Similar results after using ImageMagick which uses OpenCL beneath
Code :
setenv("MAGICK_OCL_DEVICE","OFF",1); //Turn in ON to use GPU acceleration
Image image;
auto t_start_read = std::chrono::high_resolution_clock::now(); full_path );
auto t_end_read = std::chrono::high_resolution_clock::now();
image.resize( Geometry(400,400) );
auto t_end_resize = std::chrono::high_resolution_clock::now();
Results :
No_procs resizeCPU resizeGPU
1 63.23 8.54
10 76.16 31.04
15 76.56 50.79
20 76.58 71.68
30 86.29 140.17
Test Machine configuration:
4 GPU (Tesla P100) - but test only utilizes 1 GPU
64 CPU cores (over Intel Xeon 2680 v4 CPU )
OpenCV version : 3.4.0
ImageMagick version : 6.9.9-26 Q16 x86_64 2018-01-17
Cuda Toolkit : 9.0
Highly propable this is too late to help you. However for people looking at this answer this is my suggestion to improve performance. The way you are setting pinned memory does not give you the boost you are looking for.
This is: Using
//method 1
In the comments of this discussion. Somebody suggested doing as you. The person answering said that it was slower. I was timing the implementation of a sobel derivatives close to the one in Sobel The main steps are:
Read color image
gaussian blurring of the color image using a
radius of 3 and a delta of 1;
grayscale conversion;
computing the x and y gradiants
merging them into the final output image.
Instead I implemented a version swaping the order of step 2 and 3. Converting to gray scale first and then denoising the result by passing a gaussian.
I was running openCV 3.4 in windows 10. Cuda 9.0. My CPU is an i7-6820HQ. GPU is a Quadro M1200.
I try your method and this one:
//Method 2
//allocate pinned memory
cv::cuda::HostMem memory(siz.height, siz.width, CV_8U, cv::cuda::HostMem::PAGE_LOCKED);
//Read input image from the disk
Mat input = imread(input_file, CV_LOAD_IMAGE_COLOR);
if (input.empty())
std::cout << "Image Not Found: " << input_file << std::endl;
// copy the input image from CPU to GPU memory
cuda::GpuMat gpuInput;
cv::cuda::Stream stream;
gpuInput.upload(memory, stream);
//Do your processing...
//allocate pinned memory for output
cv::cuda::HostMem outMemory(siz.height, siz.width, CV_8U, cv::cuda::HostMem::PAGE_LOCKED);, stream);
cv::Mat output = outMemory.createMatHeader();
I calculated the gain as: (t1-t2)/t1*100. Where t1 is the time running the code normally. t2 running it using pinned memory. The negative values is when the method is slower than running in non-pinned memory.
image size Gain % Method 1 Gain % Method 2
800x600 2.9 8.2
1280x1024 2.5 15.3
1600x1200 0.2 7.0
2048x1536 -2.3 14.6
4096x3072 -1.0 17.2

OpenCL can not detect my AMD GPU using OpenCV

I am using AMD Radeon R9 M375. I tried following this answer but it didn't work for me.
I followed this:
Here is my output from clinfo.exe
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Vendor ID: 1002h
Board name: AMD Radeon (TM) R9 M375
Device Topology: PCI[ B#4, D#0, F#0 ]
Max compute units: 10
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1015Mhz
Address bits: 32
Max memory allocation: 3019898880
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 3221225472
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 00007FFF209D0188
Name: Capeverde
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2348.3
Version: OpenCL 1.2 AMD-APP (2348.3)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing
cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing
cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event cl_amd_liquid_flash
Vendor ID: 1002h
Board name:
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 8
Preferred vector width double: 4
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 2200Mhz
Address bits: 64
Max memory allocation: 2147483648
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 32768
Global memory size: 8499593216
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2147483648
Max global variable size: 1879048192
Max global variable preferred total size: 1879048192
Max read/write image args: 64
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 465
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 00007FFF209D0188
Name: Intel(R) Core(TM) i5-5200U CPU # 2.20GHz
Vendor: GenuineIntel
Device OpenCL C version: OpenCL C 1.2
Driver version: 2348.3 (sse2,avx)
Version: OpenCL 1.2 AMD-APP (2348.3)
What works:
std::vector<cv::ocl::PlatformInfo> platforms;
//OpenCL Platforms
for (size_t i = 0; i < platforms.size(); i++)
//Access to Platform
const cv::ocl::PlatformInfo* platform = &platforms[i];
//Platform Name
std::cout << "Platform Name: " << platform->name().c_str() << "\n";
//Access Device within Platform
cv::ocl::Device current_device;
for (int j = 0; j < platform->deviceNumber(); j++)
//Access Device
platform->getDevice(current_device, j);
//Device Type
int deviceType = current_device.type();
cout << "Device Number: " << platform->deviceNumber() << endl;
cout << "Device Type: " << deviceType << endl;
The above code displays
Platform Name: Intel(R) OpenCL
Device Number: 2
Device Type: 2
Device Number: 2
Device Type: 4
Platform Name: AMD Accelerated Parallel Processing
Device Number: 2
Device Type: 4
Device Number: 2
Device Type: 2
How do I go about making a Context from here using AMD as my GPU? The linked post says to use method initializeContextFromHandlerbut the documentation on OpenCV is not sufficient enough. Documentation Link
Issue is resolved. I don't know what I did but AMD is working now.
Current settings (On Windows):
Environment Variable:
Value: AMD:GPU:Capeverde
Using setUseOpenCL(bool foo) present in ocl.hpp to select whether to use GPU or CPU.
Most likely problem: In my actual code, I wasn't doing any computation but when I wrote a simple code for subtraction of two matrices, AMD started working.
#include <opencv2/core/ocl.hpp>
#include <opencv2/opencv.hpp>
int main() {
cv::UMat mat1 = cv::UMat::ones(10, 10, CV_32F);
cv::UMat mat2 = cv::UMat::zeros(10, 10, CV_32F);
cv::UMat output = cv::UMat(10, 10, CV_32F);
cv::subtract(mat1, mat2, output);
std::cout << output << "\n";

Out of memory exception for a matrix

I have the "'System.OutOfMemoryException" exception for this simple code (a 10 000 * 10 000 matrix) multiplied by itself:
#r "Microsoft.Office.Interop.Excel"
#r "FSharp.PowerPack.dll"
open System
open System.IO
open Microsoft.FSharp.Math
open System.Collections.Generic
let mutable Matrix1 = Matrix.create 10000 10000 0.
let matrix4 = Matrix1 * Matrix1
I have the following error:
System.OutOfMemoryException: An exception 'System.OutOfMemoryException' has been raised
Microsoft.FSharp.Collections.Array2DModule.ZeroCreate[T](Int32 length1, Int32 length2)
Microsoft.FSharp.Math.DoubleImpl.mulDenseMatrixDS(DenseMatrix`1 a, DenseMatrix`1 b)
Microsoft.FSharp.Math.SpecializedGenericImpl.mulM[a](Matrix`1 a, Matrix`1 b)
<StartupCode$FSI_0004>.$FSI_0004.main#() dans C:\Users\XXXXXXX\documents\visual studio 2010\Projects\Library1\Library1\Module1.fs:line 92
Stop due to an error
I have therefore 2 questions:
I have a 8 GB memory on my computer and according to my calculation a 10 000 * 10 000 matrix should take 381 MB [computed this way : 10 000 * 10 000 = 100 000 000 integers in the matrix => 100 000 000 * 4 bytes (integers of 32 bits) = 400 000 000 => 400 000 000 / (1024*1024) = 381 MB] so I cannot understand why there is an OutOfMemoryException
More generally (it's not the case here I think), I have the impression that F# interactive registers all the data and therefore overloads the memory, do you know of a way to free all the data registered by F# interactive without exiting F#?
In summary, fsi is a 32-bit process; at most it can hold 2GB of data. Run your test as a 64-bit Windows application; you can increase the size of the matrix, but it still has 2GB limit of .NET objects.
I correct your calculation a little bit. Matrix1 is a float matrix, so each element occupies 8 bytes in memory. The total size of Matrix1 and matrix4 in memory is at least:
2 * 10000 * 10000 * 8 = 1 600 000 000 bytes ~ 1.6 GB
(ignoring some bookkeeping parts of matrix)
So it's no surprise when fsi*32 runs out of memory in this case.
Execute the test as a 64-bit Windows process, you can create float matrices of size around 15000 but not more than that. Check out this informative article for concrete numbers with different types of matrix elements.
The amount of physical memory on your computer is not the relevant bottleneck - see Eric Lippert's great blog post for more information.
