I have a go program that is using too much memory and therefore getting killed, so I want to try and keep memory usage down. Here's a simplified silly version of what I'm doing, revealing the problem:
package main
import (
"fmt"
"io/ioutil"
"log"
"os"
"os/exec"
"runtime"
"runtime/debug"
"strconv"
"time"
)
func main() {
source := "/tmp/1G.source"
repeats, _ := strconv.Atoi(os.Args[1])
m := &runtime.MemStats{}
err := exec.Command("dd", "if=/dev/zero", "of="+source, "bs=1073741824", "count=1").Run()
if err != nil {
log.Fatalf("failed to create 1GB file: %s\n", err)
}
fmt.Printf("created 1GB source file, %s\n", memory_usage(m))
// read it multiple times
switch os.Args[2] {
case "1":
fmt.Println("re-using a byte slice and emptying it each time")
// var data []byte
for i := 1; i <= repeats; i++ {
data, _ := ioutil.ReadFile(source)
if len(data) > 0 { // just so we use data
data = nil
}
fmt.Printf("did read %d, %s\n", i, memory_usage(m))
}
case "2":
fmt.Println("ignoring the return value entirely")
for i := 1; i <= repeats; i++ {
ioutil.ReadFile(source)
fmt.Printf("did read %d, %s\n", i, memory_usage(m))
}
case "3":
fmt.Println("ignoring the return value entirely, forcing memory freeing")
for i := 1; i <= repeats; i++ {
ioutil.ReadFile(source)
debug.FreeOSMemory()
fmt.Printf("did read %d, %s\n", i, memory_usage(m))
}
}
// wait incase garbage collection needs time to do something
<-time.After(5 * time.Second)
fmt.Printf("all done, %s\n", memory_usage(m))
os.Exit(0)
}
func memory_usage(m *runtime.MemStats) string {
runtime.ReadMemStats(m)
return fmt.Sprintf("system memory: %dMB; heap alloc: %dMB; heap idle-released: %dMB", int((m.Sys/1024)/1024), int((m.HeapAlloc/1024)/1024), int(((m.HeapIdle-m.HeapReleased)/1024)/1024))
}
If I call this with main 7 2 I get:
created 1GB source file, system memory: 2MB; heap alloc: 0MB; heap idle-released: 1MB
ignoring the return value entirely
did read 1, system memory: 4233MB; heap alloc: 3072MB; heap idle-released: 1024MB
did read 2, system memory: 4233MB; heap alloc: 3072MB; heap idle-released: 1024MB
did read 3, system memory: 4233MB; heap alloc: 3072MB; heap idle-released: 1024MB
did read 4, system memory: 4233MB; heap alloc: 3072MB; heap idle-released: 1023MB
did read 5, system memory: 6347MB; heap alloc: 3584MB; heap idle-released: 2559MB
did read 6, system memory: 6347MB; heap alloc: 3072MB; heap idle-released: 3071MB
did read 7, system memory: 6347MB; heap alloc: 3072MB; heap idle-released: 3071MB
all done, system memory: 6347MB; heap alloc: 3072MB; heap idle-released: 3071MB
Perhaps off-topic, but is it expected that reading in a 1GB file results in 4GB of system memory usage?
Anyway, Ideally I want an unlimited number of identical loops to use a ~constant amount of memory, instead of increasing from 4GB to 6GB.
So I thought forcing freeing of memory would help, but main 7 3 gives:
created 1GB source file, system memory: 1MB; heap alloc: 0MB; heap idle-released: 0MB
ignoring the return value entirely, forcing memory freeing
did read 1, system memory: 4237MB; heap alloc: 0MB; heap idle-released: 0MB
did read 2, system memory: 4237MB; heap alloc: 0MB; heap idle-released: 0MB
did read 3, system memory: 6351MB; heap alloc: 0MB; heap idle-released: 0MB
did read 4, system memory: 6351MB; heap alloc: 0MB; heap idle-released: 0MB
did read 5, system memory: 6351MB; heap alloc: 0MB; heap idle-released: 0MB
did read 6, system memory: 6351MB; heap alloc: 0MB; heap idle-released: 0MB
did read 7, system memory: 6351MB; heap alloc: 0MB; heap idle-released: 0MB
all done, system memory: 6351MB; heap alloc: 0MB; heap idle-released: 0MB
How can I keep the memory usage down for all loops?
Following suggestions in the comments, I tried a new case:
case "4":
fmt.Println("doing a streaming read")
b := make([]byte, 10000, 10000)
for i := 1; i <= repeats; i++ {
f, _ := os.Open(source)
r := bufio.NewReader(f)
for {
_, err := r.Read(b)
if err != nil {
break
}
}
fmt.Printf("did read %d, %s\n", i, memory_usage(m))
}
}
But I still get memory usage increase with number of loops:
created 1GB source file, system memory: 1MB; heap alloc: 0MB; heap idle-released: 0MB
doing a streaming read
did read 1, system memory: 1MB; heap alloc: 0MB; heap idle-released: 0MB
did read 2, system memory: 1MB; heap alloc: 0MB; heap idle-released: 0MB
did read 3, system memory: 1MB; heap alloc: 0MB; heap idle-released: 0MB
did read 4, system memory: 1MB; heap alloc: 0MB; heap idle-released: 0MB
did read 5, system memory: 2MB; heap alloc: 0MB; heap idle-released: 0MB
did read 6, system memory: 2MB; heap alloc: 0MB; heap idle-released: 0MB
did read 7, system memory: 2MB; heap alloc: 0MB; heap idle-released: 0MB
all done, system memory: 2MB; heap alloc: 0MB; heap idle-released: 0MB
To generalise the question, when you're using 3rd party functions (ie. where you have no control over how they're using memory within themselves) in a loop, and are doing the exact same thing every time in the loop, is there any way to force Go to re-use the memory it has already allocated instead of requesting more from the OS?
Related
I'm using the latest OpenCV 4.x with CUDA supoprt + CUDA 11.6.
I want to allocate GpuMat image in device memory by doing so:
cv::cuda::GpuMat test1;
test1.create(100, 1000000, CV_8UC1);
and I measure consumed memory before create function call and after (using nvidia-smi tool).
Before:
| 0 N/A N/A 372354 C ...aur/example_build/example 199MiB |
After:
| 0 N/A N/A 389636 C ...aur/example_build/example 295MiB |
So + ~100 MB - makes sense.
But when I allocate the image this way (changed W and H):
cv::cuda::GpuMat test1;
test1.create(1000000, 100, CV_8UC1);
I see this:
Before:
| 0 N/A N/A 379124 C ...aur/example_build/example 199MiB |
After:
| 0 N/A N/A 379124 C ...aur/example_build/example 689MiB |
I expected the same increment as in test1 though.
In various cases, consumption is x5 more than expected, when the image is "high and narrow". What do I understand wrong?
In various cases, consumption is x5 more than expected, when the image is "high and narrow". What do I understand wrong?
OpenCV GpuMat uses a pitched allocation. If the minimum pitch is for example 512 bytes, then allocating a "narrow" image is going to be extra-expensive.
On my tesla V100, the minimum pitch (kind of like saying the minimum "width" for each line) for a pitched allocation is 512. 512/100 = 5x.
No I don't have any suggestions for workarounds. Allocate a wider image. Or accept the extra cost.
I think most CUDA GPUs will have a minimum pitch of 512 bytes, because the minimum texture alignment is 512 bytes. You can use the following code to find yours:
$ cat t2060.cu
#include <iostream>
int main(){
char *d;
size_t p;
cudaMallocPitch(&d, &p, 1, 100);
std::cout << p << std::endl;
}
$ nvcc -o t2060 t2060.cu
$ compute-sanitizer ./t2060
========= COMPUTE-SANITIZER
512
========= ERROR SUMMARY: 0 errors
$
(As an aside, I don't know how you decided that your first example shows +100MB. I see 199MiB and 201MiB. The difference between those two appears to be 2MB. But this doesn't seem to be the crux of your question, and the 500MB allocation for a 100MB image of width 100 bytes is explained above.)
I've been testing the dask_ml.xgboost regressor on a synthetic 10GB dataset. When training, the memory usage of the workers exceeds the amount available on my local laptop. I am aware that I can try running on an online dask cluster with larger memory, or that I can sample the data (and ignore the rest) before training. But is there a different solution? I tried limiting the number and the depth of the trees generated, subsampling the rows and columns, and changing the tree construction algorithm but the workers still run out of memory.
Given a fixed memory allocation, is there a way to reduce the memory consumption of each worker when training dask_ml.xgboost?
Here is a code snippet:
import dask.dataframe as dd
from dask.distributed import Client
from dask_ml.xgboost import XGBRegressor
client = Client(memory_limit='7GB')
ddf = dd.read_csv('10GB_float.csv')
X = ddf[ddf.columns.difference(['float_1'])].persist()
y = ddf['float_1'].persist()
reg = XGBRegressor(
objective='reg:squarederror', n_estimators=10, max_depth=2, tree_method='hist',
subsample=0.001, colsample_bytree=0.5, colsample_bylevel=0.5,
colsample_bynode=0.5, n_jobs=-1)
reg.fit(X, y)
The synthetic dataset 10GB_float.csv has 50 columns and 26758707 rows containing random floats (float64) ranging from 0 to 1. Below are the cluster details:
Cluster
Workers: 4
Cores: 12
Memory: 28.00 GB
And some information about my local laptop:
Memory: 31.1 GiB
Processor: Intel® Core™ i7-8750H CPU # 2.20GHz × 12
Additionally, here are the parameters of XGBRegressor (using .get_params()):
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 0.5,
'colsample_bynode': 0.5,
'colsample_bytree': 0.5,
'gamma': 0,
'importance_type': 'gain',
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 2,
'min_child_weight': 1,
'missing': None,
'n_estimators': 10,
'n_jobs': -1,
'nthread': None,
'objective': 'reg:squarederror',
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.001,
'verbosity': 1,
'tree_method': 'hist'}
Thank you very much for your time!
I'm trying to use Xarray and Dask to open a multi-file dataset. However, I'm running into memory errors.
I have files that are typically this shape:
xr.open_dataset("/work/ba0989/a270077/coupled_ice_paper/model_data/coupled/LIG_coupled/outdata/fesom//LIG_coupled_fesom_thetao_19680101.nc")
<xarray.Dataset>
Dimensions: (depth: 46, nodes_2d: 126859, time: 366)
Coordinates:
* time (time) datetime64[ns] 1968-01-02 1968-01-03 ... 1969-01-01
* depth (depth) float64 -0.0 10.0 20.0 30.0 ... 5.4e+03 5.65e+03 5.9e+03
Dimensions without coordinates: nodes_2d
Data variables:
thetao (time, depth, nodes_3d) float32 ...
Attributes:
output_schedule: unit: d first: 1 rate: 1
30 files --> 41.5 GB
I also can set up a dask.distributed Client object:
Client()
<Client: 'tcp://127.0.0.1:43229' processes=8 threads=48, memory=68.72 GB>
So, if I suppose there is enough memory for the data to be loaded. However, when I then run xr.open_mfdataset, I very often get these sorts of warnings:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 8.25 GB -- Worker memory limit: 8.59 GB
I guess there is something I can do with the chunks argument?
Any help would be very appreciated; unfortunately I'm not sure where to begin trying. I could, in principle, open just the first file (they will always have the same shape) to figure out how to ideally rechunk the files.
Thanks!
Paul
Examples of the chunks and parallel keywords to the opening functions, which correspond to how you utilise dask, can be found in this doc section.
That should be all you need!
I am using AMD Radeon R9 M375. I tried following this answer https://stackoverflow.com/a/34250412/8731839 but it didn't work for me.
I followed this: http://answers.opencv.org/question/108646/opencl-can-not-detect-my-nvidia-gpu-via-opencv/?answer=108784#post-id-108784
Here is my output from clinfo.exe
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon (TM) R9 M375
Device Topology: PCI[ B#4, D#0, F#0 ]
Max compute units: 10
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1015Mhz
Address bits: 32
Max memory allocation: 3019898880
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 3221225472
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 00007FFF209D0188
Name: Capeverde
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2348.3
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2348.3)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing
cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing
cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event cl_amd_liquid_flash
Device Type: CL_DEVICE_TYPE_CPU
Vendor ID: 1002h
Board name:
Max compute units: 4
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 1024
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 8
Preferred vector width double: 4
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 8
Native vector width double: 4
Max clock frequency: 2200Mhz
Address bits: 64
Max memory allocation: 2147483648
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 4096
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 32768
Global memory size: 8499593216
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Global
Local memory size: 32768
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2147483648
Max global variable size: 1879048192
Max global variable preferred total size: 1879048192
Max read/write image args: 64
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 1
Error correction support: 0
Unified memory for Host and Device: 1
Profiling timer resolution: 465
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: Yes
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 00007FFF209D0188
Name: Intel(R) Core(TM) i5-5200U CPU # 2.20GHz
Vendor: GenuineIntel
Device OpenCL C version: OpenCL C 1.2
Driver version: 2348.3 (sse2,avx)
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2348.3)
What works:
std::vector<cv::ocl::PlatformInfo> platforms;
cv::ocl::getPlatfomsInfo(platforms);
//OpenCL Platforms
for (size_t i = 0; i < platforms.size(); i++)
{
//Access to Platform
const cv::ocl::PlatformInfo* platform = &platforms[i];
//Platform Name
std::cout << "Platform Name: " << platform->name().c_str() << "\n";
//Access Device within Platform
cv::ocl::Device current_device;
for (int j = 0; j < platform->deviceNumber(); j++)
{
//Access Device
platform->getDevice(current_device, j);
//Device Type
int deviceType = current_device.type();
cout << "Device Number: " << platform->deviceNumber() << endl;
cout << "Device Type: " << deviceType << endl;
}
}
The above code displays
Platform Name: Intel(R) OpenCL
Device Number: 2
Device Type: 2
Device Number: 2
Device Type: 4
Platform Name: AMD Accelerated Parallel Processing
Device Number: 2
Device Type: 4
Device Number: 2
Device Type: 2
How do I go about making a Context from here using AMD as my GPU? The linked post says to use method initializeContextFromHandlerbut the documentation on OpenCV is not sufficient enough. Documentation Link
Issue is resolved. I don't know what I did but AMD is working now.
Current settings (On Windows):
Environment Variable:
Name: OPENCV_OPENCL_DEVICE
Value: AMD:GPU:Capeverde
Using setUseOpenCL(bool foo) present in ocl.hpp to select whether to use GPU or CPU.
Most likely problem: In my actual code, I wasn't doing any computation but when I wrote a simple code for subtraction of two matrices, AMD started working.
Code:
#include <opencv2/core/ocl.hpp>
#include <opencv2/opencv.hpp>
int main() {
cv::UMat mat1 = cv::UMat::ones(10, 10, CV_32F);
cv::UMat mat2 = cv::UMat::zeros(10, 10, CV_32F);
cv::UMat output = cv::UMat(10, 10, CV_32F);
cv::subtract(mat1, mat2, output);
std::cout << output << "\n";
std::getchar();
}
I am doing a convolution in Theano:
theano.tensor.nnet.conv.conv2d(x,h, border_mode='full')
and it runs out of memory, I get the following message:
RuntimeError: GpuCorrMM failed to allocate working memory of 3591 x 319086
Apply node that caused the error: GpuCorrMM_gradInputs{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, (True, False, True, False)), CudaNdarrayType(float32, (False, True, False, False))]
Inputs shapes: [(1, 513, 1, 7), (1, 1, 513, 622)]
Inputs strides: [(0, 7, 0, 1), (0, 0, 622, 1)]
Inputs values: ['not shown', 'not shown']
I have tried setting theano flags to 'optimizer_excluding=conv_dnn', but still didn't work. Is there any way around this?
You are trying to allocate a matrix which need something like 9TB of memory. An individual neuron needs 2.5GB of memory. The only optimization I know for such issues is to either decrease the number of units or buying more RAM. Loads of RAM :)
For me, I disabled g++ during runtime by simply remove the (MinGW) bin directory from the path variable. The processing time is slow, but it completes process.
My program execution enviroment: OS Windows Vista 32 bit, CPU Intel 2.16 GHz, RAM 4.00 GB and no GPU