a disastrous slowdown of cudaMalloc in nvidia drivers from version 285 - memory

recent years, we have used CUDA for time-critical tasks within many of our 64-bit projects. A few days ago I updated the nvidia drivers on my development system and found a disastrous slowdown of the algorithms associated with CUDA. After some digging, it became clear that many sequential calls of a cudaMalloc lead to a latency increase (with each next call):
void *p[65000];
for (int n = 0; 65000 > n; n++)
cudaMalloc(&p[n], 256);
this code runs for about 4 seconds on nvidia drivers up to version 285 , but starting with the drivers version 285 execution of this code takes more than 8 minutes (120 times slower). tested on GeForce GTX 560Ti, GeForce GTX 460 and Quadro FX4600 on different x64 systems.
Well, the question is: is it a bug of the new drivers? or is it, maybe, some kind of attempt to deal with fragmentation and improving memory management in CUDA (through more complicated allocation)? or something else?
UPDATE:
I have reported this issue to nvidia and was answered that they were able to reproduce it and have assigned it for investigation.

I tracked this down based on the OP's bug report. Turns out it was a known issue already reported and it is fixed in CUDA 5.0. If you download the CUDA 5.0 preview (available to registered CUDA developers) Release Candidate or later you should see an improvement.
edit: The fix will be in the CUDA 5 RC, not in the preview. So as of this edit (May 31, 2012), the fix is not yet available.

Related

Delphi TParallel not using all available cpu

We are migrating our multi-threaded application to Delphi XE7 and are testing the new TParallel.For function. We have found that it parallelizes well on laptops (Core I-5/Windows 7 with 4 cores) achieving close to consistent 100% cpu usage.
When we run exactly the same code on an Intel Xeon/Windows 2008 R2 with 2x12 cores, it only achieves about 3% usage and appears to be only using 2 of the cores.
The same problem is evident using the Conways Life demo sample application.
We have tried using the OTL which parallelizes close to 100% on the Xeon, but unfortunately we have run into the "Not enough quota" issue and can't seem to resolve that, either.
Has anyone else run into this? We have tried using the Stride parameter, SetMinWorkerThreads and SetMaxWorkerThreads() methods but to no avail.

The maximum supported number of monitors in DirectX3D (D3D9)

I'm developing the video wall system with Nvidia CUDA Decoder library for decoding and Directx3d(D3D9) for rendering. So, we assume that dozens of monitors can be installed in the system.
(System: Intel I7 Processor, NVIDIA GTX 780 x 4EA, Windos 8 OS)
But, IDirect3D9::GetAdapterCount API returns up to 12, even if more than 12 monitors are installed in the system. Namely, if there are 11 monitors in the system, the API returns 11. And if there are 12 monitors in the system, it returns 12. But, 13 monitors are installed in the system, the API returns 12, not 13.
So, in that case, we cannot identify adapter ID of exceed monitors for rendering.
As I know, Windows supports up 64 monitors. So I think that it is not a limitation of OS.
I wonder if it is a limitation of D3D9. If you have some knowledge about it, please reply.
Thank you.

Is Dart Editor (28355) now much slower on win7?

I upgraded to the latest Dart Editor (28355) both on Win7-32 and Win8-64 today. On Win7 it appears to run extremely slowly, on Win8 it is fine. Running both the Dart-Editor and the browser concurrently on Win7 I found unworkable. With just the dart-Editor running, it was still very slow to respond to input, but much better (4 v 25 seconds). I did reboot a couple of times in desperation. The Win7 machine is an older dual-core pentium with 4gb Ram running at about 2ghz (I'm on Win8 now). Previously I've had no major problems with the Dart-Editor on Win7 over a period of 6-months or more.
Is this a known problem with 28355?
Do you have a lot of folders in the 'Files' tab?
I find reducing the number of folders or right-clicking and selecting 'Don't Analyse' on some folders helps performance.
I have not experienced anything as dramatically slow as you have reported though.
I've found that 28355 uses much more RAM than before (about 1GB total now, for me). If you have a browser running and a couple of other programs, 4GB will be used up pretty fast and cause slowdowns. I'm not sure about the difference with Windows 8, perhaps it has better memory efficiency.

Installing odd number of DDR memory modules

I have three DDR PC 3200 memory modules, two of which are 512 MB and one of which is 1 GB. My motherboard manual describes 1-, 2-, and 4-modules configurations; is it okay to install three?
If you're not using dual channel memory configurations it should work. Only way to know is to try it and see.
This is not a programming question, BTW and will be closed very shortly.

Ubuntu 32 bit maximum address space

Jeff covered this a while back on his blog in terms of 32 bit Vista.
Does the same 32 bit 4 GB memory cap that applies in 32 bit Vista apply to 32 bit Ubuntu? Are there any 32 bit operating systems that have creatively solved this problem?
Ubuntu server has PAE enabled in the kernel, the desktop version does not have this feature enabled by default.
This explains, by the way, why Ubuntu server does not work in some hardware emulators whereas the desktop edition does
Well, with windows, there's something called PAE, which means you can access up to 64 GB of memory on a windows machine. The downside is that most apps don't support actually using more than 4 GB of RAM. Only a small number of apps, like SQL Server are programmed to actually take advantage of all the extra memory.
Yes, 32 bit ubuntu has the same memory limitations.
There are exceptions to the 4GB limitation, but they are application specific... As in, Microsoft Sql Server can use 16 gigabytes with "Physical address Extensions" [PAE] configured and supported and... ugh
http://forums.microsoft.com/TechNet/ShowPost.aspx?PostID=3703755&SiteID=17
Also drivers in ubuntu and windows both reduce the amount of memory available from the 4GB address space by mapping memory from that 4GB to devices. Graphics cards are particularly bad at this, your 256MB graphics card is using up at least 256MB of your address space...
If you can [your drivers support it, and cpu is new enough] install a 64 bit os. Your 32 bit applications and games will run fine.
There seems to be some confusion around PAE. PAE is "Page Address Extension", and is by no means a Windows feature. It is a hack Intel put in their Pentium II (and newer) chips to allow machines to access 64GB of memory. On Windows, applications need to support PAE explicitely, but in the open source world, packages can be compiled and optimized to your liking. The packages that could use more than 4GB of memory on Ubuntu (and other Linux distro's) are compiled with PAE support. This includes all server-specific software.
In theory, all 32-bit OSes have that problem. You have 32 bits to do addressing.
2^32 bits / 2^10 (bits per kb) / 2^10 (kb per mb) / 2^10 (mb per gb) = 2^2 = 4gb.
Although there are some ways around it. (Look up the jump from 16-bit computing to 32-bit computing. They hit the same problem.)
Linux supports a technology called PAE that lets you use more than 4GB of memory, however I don't know whether Ubuntu has it on by default. You may need to compile a new kernel.
Edit: Some threads on the Ubuntu forums suggest that the server kernel has PAE on by default, you could try installing that.

Resources