Google colab memory - memory

I am working on local Python for my research, and planning to use the "Colab plan" as I am facing the following memory error:
"MemoryError: Unable to allocate 3.38 TiB for an array with shape (681528, 681528) and data type float64"
My question is : how many compute units do you think I need as a total?
.…………,……….…………,……….…………,……….…………,…….…………,……….…………,…………

Related

Error calling GET /3/Jobs h2o model training Error on large data

i am trying to build model on a large data(2 millions transaction data) and getting below error.Thee is no progress in model building in progress bar and after some time job stops with below error.We are running this in single node and h2o is not distributed.
Please suggest is this is related to memory issue.Like If we have 20 GB training data then how much memory,heap size should be given to h2o?
Does all the complete training frame stores in heap memory?
Error fetching job '$03010a010d6832d4ffffffff$_9bf0e32df1dba1c2d24eb8a513f47a4'
Error calling GET /3/Jobs/%2403010a010d6832d4ffffffff%24_9bf0e32df1dba1c2d24eb8a513f47a4
HTTP connection failure: status=error, code=503, error=Service Temporarily Unavailable
Thanks
Deepti
It is possible that an H2O cluster goes down due to out-of-memory and your client loses communication with it. You would need to review the H2O logs to determine the error/cause.
A general rule of thumb is to have about 4x memory of your dataset. See the docs. In your case, you should have about 80GB needed to handle data manipulation and modeling.

SuperLU dgstrf returns a memory allocation failure when I increase the size of my sparse matrix

I am creating a FEA program and I am dealing with matrices on the order 290-96 thousand squared or 27-9 billion elements. These matrices are largely sparse so I am using SuperLU to solve them. I have been able to successfully use SuperLU to solve the problem which has matched up with validation data reasonably well with smaller matrices. However, as I increase the size of my matrices, SuperLU's dgstrf function is outputting an info value of roughly 900 million (914459273 one time, 893813121 another).
The documentation says that this info value is "number of bytes allocated when memory allocation failure occurred, plus A->ncol." However this does not give any information on how to work through this error. What is limiting the memory in this case? Does the library limit the memory? Is it hard coded into the library or is it determined during compilation? Is the memory limited in the compilation of my Fortran code?
I am writing my code in Fortran and am using the prebuilt c_fortran_dgssv.c file to link with SuperLU. This file does allow the system to "allocate space internally by system malloc" (lwork=0). Is this something that I could change in order to have more space.
I am calling the code using similar calls as the fortran example.
nrhs = 1
ldb = Dim3DFull
iopt = 1
call c_fortran_dgssv(iopt,Dim3DFull,TotalNonZeroElements_BCs,nrhs, &
Global_Matrix_T_Value_BC_CSC,Global_Matrix_T_Row_BC_CSC, &
Global_Matrix_T_Col_BC_CSC,Global_Temp,ldb,factors,info)
if (info .eq. 0) then
write (*,*) 'Factorization succeeded'
else
write(*,*) 'INFO from factorization = ', info
endif
!Second, solve the system using the existing factors.
iopt = 2
call c_fortran_dgssv(iopt,Dim3DFull,TotalNonZeroElements_BCs,nrhs, &
Global_Matrix_T_Value_BC_CSC,Global_Matrix_T_Row_BC_CSC, &
Global_Matrix_T_Col_BC_CSC,Global_Temp,ldb,factors,info)
if (info .eq. 0) then
write (*,*) 'Solve succeeded'
else
write(*,*) 'INFO from triangular solve = ', info
endif
!Last, free the storage allocated inside SuperLU
iopt = 3
call c_fortran_dgssv(iopt,Dim3DFull,TotalNonZeroElements_BCs,nrhs, &
Global_Matrix_T_Value_BC_CSC,Global_Matrix_T_Row_BC_CSC, &
Global_Matrix_T_Col_BC_CSC,Global_Temp,ldb,factors,info)
Your matrix is way too large to use a direct (ie. factorisation-based) solver. Direct solvers create huge number of new non-zero elements; that causes to program to run out of RAM. An iterative solver is the only solution; it is too little space here for a discussion, you might be interested to see more detailed information on the following blog (and ask any questions there as well): http://comecau.blogspot.com/2018_09_05_archive.html

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.
It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.
Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.
As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

opencl- async work group copy and maximum work group size

I'm trying to copy global to local memory in OpenCL.
I use "async work group copy" instruction for copying data from global memory to local memory .
__local float gau2_sh[1024];
event_t tevent = (event_t)0;
__local float gau4_sh[256];
tevent = async_work_group_copy(gau2_sh, GAU2, 1024, tevent);
tevent = async_work_group_copy(gau4_sh, GAU4, 256, tevent);
wait_group_events(2, &tevent);
Global memory size of gau2 is 1024 * 4. When I use less than 128 threads, it works fine. But if I use more than 128 threads, kernel results in error CL_INVALID_WORK_GROUP_SIZE.
My GPU is an Adreno420, where the maximum work group size is 1024.
Do I need to consider other thing for local memory copy?
It is caused by register usage and local memory.
Similarly to -cl-nv-maxrregcount=<N> of CUDA, for Qualcomm Adreno series, they have compile option for reducing register usage.
.
The official document related with this thing is proprietary.
So if you concerned about it, please read document included in Qualcomm Adreno SDK.
For the details, please refer to the following links:
Using a barrier causes a CL_INVALID_WORK_GROUP_SIZE error
Questions about global and local work size
Qualcomm Forums - Strange Behavior With OpenCL on Adreno 320
Mobile Gaming & Graphics (Adreno) Tools and Resources

opencl asked about size of type Memory Model

how can know the maximum size of all type memory model for opencl (gpu & cpu)
Global Memory,Constant Memory,Local Memory,Private Memory.
and what of this type is the faster ???
"All of the memory" isn't really a relevant phrase for opencl. You use devices separately, and you poll them for their specs separately. use clGetDeviceInfo and pass in the constant/code you want to ask from the driver. Read more about this here: clGetDeviceInfo
If you want to know about total memory, you need to sum up the values for the memory types you ask about.
As for the speed of memory types, I answered a related question about this not too long ago: stack overflow.

Resources