I am running TensorFlow on NVidia Jetson TX1 and faced memory shortage when I train large network such as GoogleNet.
CPU and GPU in TX1 do not have separate memory and they share one memory. However, It seems that TensorFlow is trying to allocate separate memory space and copy from CPU side to GPU side. Thus it requests 2x memory than it really needs.
In my opinion, this situation can be handled by something like DMA access between CPU and GPU. As far as I know, TensorFlow utilizes DMA between GPUs (not sure which one handles this. TensorFlow? or GPU driver?). Can I use DMA between CPU and GPU also in TensorFlow? or any other suggestions?
EDIT: I just found that there is Zero Copy feature in CUDA which is what I exactly wanted. However, can I use this feature in TensorFlow?
Related
I am new to the RAPIDS AI world and I decided to try CUML and CUDF out for the first time.
I am running UBUNTU 18.04 on WSL 2. My main OS is Windows 11. I have a 64 GB RAM and a laptop RTX 3060 6 GB GPU.
At the time I am writing this post, I am running a TSNE fitting calculation over a CUDF dataframe composed by approximately 26 thousand values, stored in 7 columns (all the values are numerical or binary ones, since the categorical ones have been one hot encoded).
While classifiers like LogisticRegression or SVM were really fast, TSNE seems taking a while to output results (it's been more than a hour now, and it is still going on even if the Dataframe is not so big). The task manager is telling me that 100% of GPU is being used for the calculations even if, by running "nvidia-smi" on the windows powershell, the command returns that only 1.94 GB out of a total of 6 GB are currently in use. This seems odd to me since I read papers on RAPIDS AI's TSNE algorithm being 20x faster than the standard scikit-learn one.
I wonder if there is a way of increasing the percentage of dedicated GPU memory to perform faster computations or if it is just an issue related to WSL 2 (probably it limits the GPU usage at just 2 GB).
Any suggestion or thoughts?
Many thanks
The task manager is telling me that 100% of GPU is being used for the calculations
I'm not sure if the Windows Task Manager will be able to tell you of GPU throughput that is being achieved for computations.
"nvidia-smi" on the windows powershell, the command returns that only 1.94 GB out of a total of 6 GB are currently in use
Memory utilisation is a different calculation than GPU throughput. Any GPU application will only use as much memory as is requested, and there is no correlation between higher memory usage and higher throughput, unless the application specifically mentions a way that it can achieve higher throughput by using more memory (for example, a different algorithm for the same computation may use more memory).
TSNE seems taking a while to output results (it's been more than a hour now, and it is still going on even if the Dataframe is not so big).
This definitely seems odd, and not the expected behavior for a small dataset. What version of cuML are you using, and what is your method argument for the fit task? Could you also open an issue at www.github.com/rapidsai/cuml/issues with a way to access your dataset so the issue can be reproduced?
On the Google Cloud Platform (GCP), I have the following specs:
Machine type: n1-standard-8 (8 vCPUs, 30 GB memory)
CPU platform: Intel Haswell
I am using Jupyter notebook to fit an SVM to large amounts of NLP data. This process is very slow, and according to the GCP I am only utilizing around 0.12% of CPUs
How do I increase CPU utilization?
As DazWilkin mentioned actually, you're using 12% (12/100). This corresponds to one vCPU. This is because -- IIRC -- Jupyter is a Python app and Python's single-threaded so you're stuck using one core. You could reduce the number of cores (the OS will use multiple cores, of course) to save yourself some money but you'll need to evaluate alternatives to use more cores.
I know i can use ImageMagick with OpenMP to use all cores of my CPU, but the question is: Can i use all the cores of my 10 computer cluster? or imagemagick will only use the cores of my local CPU?
Thanks a lot.
OpenMP is based on the fork-join paradigm which is bound to shared-memory systems because of the very nature of the fork function. Thus, you won't be able to use all the cores of your cluster since they don't share their memory. OpenMP-enabled programs are then limited to cores that share their memory (cores from the local CPU).
There is a way around this, though, but it may not be worth it. You can simulate a virtual NUMA architecture over your 10 computer cluster and execute ImageMagick on this virtual machine. ImageMagick will hence believe it runs on a single system containing all your cores. That's what ScaleMP offers with their vSMP software. But the performance gains are strongly related to the program's memory accesses, as they might occur over network in this VM, which is magnitudes slower than cache or RAM access. You may get a significant performance hit depending on how memory is accessed in ImageMagick.
In order to directly perform what you ask, instead of trying to use OpenMP outside its use cases, you could use a Message-Passing Interface (MPI) framework (such as OpenMPI) to parallelize the program over multiple networked computers. That would require to rewrite the parallel portion of ImageMagick to use OpenMPI instead, which may be a pretty daunting task depending on ImageMagick's code base.
My system has 32GB of ram, but the device information for the Intel OpenCL implementation says "CL_DEVICE_GLOBAL_MEM_SIZE: 2147352576" (~2GB).
I was under the impression that on a CPU platform the global memory is the "normal" ram and thus something like ~30+GB should be available to the OpenCL CPU implementation. (ofcourse I'm using the 64bit version of the SDK)
Is there some sort of secret setting to tell the Intel OpenCL driver to increase global memory and use all the system memory ?
SOLVED: Got it working by recompiling everything to 64bit. Quite stupid as it seems, but I thought that OpenCL was working similar to OpenGL, where you can easily allocate e.g. 8GB texture memory from a 32bit process and the driver handles the details for you (ofcourse you can't allocate 8GB in one sweep, but e.g. transfer multiple textures that add up to more that 4GB).
I still think that limiting the OpenCL memory abstraction to the adress space of the process (at least for intel/amd drivers) is irritating, but maybe there are some subtle details or performance tradeoff why this implementation was chosen.
How does a kernel's working set like number of registers, affect the GPUs ability to hide memory latencies.
By spreading the lookup latency across a group of parallel threads (warp). Refer to the CUDA Programming Guide in the CUDA SDK for detail