cuMemGetInfo() and size_t limitation - memory

I would like to know my free and total memory on my GPU device thanks to the function cuMemGetInfo()
// ----- Before any variable initialization -----
size_t free;
size_t total;
CUresult result=cuMemGetInfo(&free,&total);
I'm getting the result :
Free memory : 4095 MB
Total memory : 4095 MB
I'm working on a Tesla C2070 with 6GB of memory on a 64bit Windows 7. However, my application is running in 32 bit. My code should give me something like :
Free memory : 5376 MB
Total memory : 5376 MB // values given by the deviceQuery.exe example of CUDA
I 4095*1024*1024 = 4293918720 is about 2^32 (after rounding). indeed, size_t s a pointer to a unsigned int (on 4 Bytes).
So here's my question. Is it possible to change the definition of size_t in order to point to a unsigned long for example?
thanks

size_t is usually 32 bits when compiling for 32 bits targets. If you want to get a 64 bits size_t, simply set your compiler to target 64 bits platforms.
If you're using the Visual Studio build chain, here is the documentation on the topic. If you're using the Qt build chain, I don't know how it's done, but it's certainly feasible somehow.
You say in the comments:
I can't run a 64bit application because of the compatibility between Qt, CUDA and Visual Studio. Is there any way to get around that?
To the best of my knowledge, it's not possible. If you build for 32 bits, then you will link against the 32 bits version of cuMemGetInfo, which works with 32 bits size_t. There's no getting around that.
However, if memory serves, I've worked in the past with 64 bits applications built with VS2012, CUDA 6 and Qt 4.8, so it should be possible to set your tool chain to work with 64 bits.

Related

Does memory address always refer to one byte, not one bit?

Can you confirm that memory address in a PC is alway pointing to one
byte (8 bits)?
If a float number needs 32 bits in memory, does the
computer allocate 4-sequential bytes (32 bits total) to represent
that number?
Yes, a memory address always contains a byte address. I can't think of a single CPU architecture that supports bit-level addressing.
A CPU native float will always be stored in sequential memory locations. This is true for all native CPU types.
let's make it more generic and say "computing unit" instead of "PC" so the answer will be YES. Some Designers do that for a purpose of performance.
EX: In ARM cortex m4 processor there are some memory space that is reserved for bit access (i.e. each address in that space contains only 1 bit). Read about bit banding for more details.
Yes absolutely and that is applicable for all data types (int, string, ..)

How will 64 bit variable be referenced in a 32 bit process?

I have a 64 bit kernel and i run 32 bit processes in userland.In the user process code ,if i declare a 64 bit variable ,how will it be referred.Will it incur 2 memory reads.?
basically the scenario is:
I need to use a 64 bit mask in my user process.
Approach 1 :
-> Use a u64bits variable.
Approach
-> Use a array of 2 32 bit variables.
First off: the kernel has no bearing on the answer to this question.
Second, I assume this is x86 you're talking about. Where possible, the compiler will place 64-bit values across 2 32-bit registers. For example, if you return a uint64_t from a function, the low 32 bits will be stored in the eax register, and the high bits will be in edx.
The compiler will generally do the right thing for performance and correctness: using an array will likely just confuse it and lead to worse results.
By the way, x86-64 CPUs will normally perform reads of 2 adjacent 32-bit words at the same speed as a single 64-bit read. The advantages of 64-bit mode are that arithmetic can be done directly on 64-bit values (1 64x64 multiplication instruction vs 3-4 32x32 instructions), there is much more space available in registers (16 registers instead of 8, registers are twice as wide), and of course the larger possible virtual address space.

16 bit Int vs 32 bit Int vs 64 bit Int

I've been wondering this for a long time since I've never had "formal" education on computer science (I'm in highschool), so please excuse my ignorance on the subject.
On a platform that supports the three types of integers listed in the title, which one's better and why? (I know that every kind of int has a different length in memory, but I'm not sure what that means or how it affects performance or, from a developer's view point, which one has more advantages over the other).
Thank you in advance for your help.
"Better" is a subjective term, but some integers are more performant on certain platforms.
For example, in a 32-bit computer (referenced by terms like 32-bit platform and Win32) the CPU is optimized to handle a 32-bit value at a time, and the 32 refers to the number of bits that the CPU can consume or produce in a single cycle. (This is a really simplistic explanation, but it gets the general idea across).
In a 64-bit computer (most recent AMD and Intel processors fall into this category), the CPU is optimized to handle 64-bit values at a time.
So, on a 32-bit platform, a 16-bit integer loaded into a 32-bit address would need to have 16 bits zeroed out so that the CPU could operate on it; a 32-bit integer would be immediately usable without any alteration, and a 64-bit integer would need to be operated on in two or more CPU cycles (once for the low 32-bits, and then again for the high 32-bits).
Conversely, on a 64-bit platform, 16-bit integers would need to have 48 bits zeroed, 32-bit integers would need to have 32 bits zeroed, and 64-bit integers could be operated on immediately.
Each platform and CPU has a 'native' bit-ness (like 32 or 64), and this usually limits some of the other resources that can be accessed by that CPU (for example, the 3GB/4GB memory limitation of 32-bit processors). The 80386 processor family (and later x86) processors made 32-bit the norm, but now companies like AMD and then Intel are currently making 64-bit the norm.
To answer your first question, the usage of a 16 bit vs a 32 bit vs a 64 bit integer depends on the context that it is used. Therefore, you really can't say one is better over the other, per say. However, depending on a situation, using one over another is preferable. Consider this example. Let's say you have a database with 10 million users and you want to store the year they were born. If you create a field in your database with a 64 bit integer then you have exhausted 80 megabytes of your storage; whereas, if you were to use a 16 bit field, only 20 megabytes of your storage will get used. You can use a 16 bit field here because the year people are born is smaller than the largest 16 bit number. In other words 1980, 1990, 1991 < 65535, assuming your field is unsigned. All in all, it depends on the context. I hope this helps.
A simple answer is to use the smallest one you KNOW will be safe for the range of possible values it will contain.
If you know the possible values are constrained to be smaller than a maximum-length 16-bit integer (e.g. the value corresponding to what day of the year it is - always <= 366) then use that. If you aren't sure (e.g. the record ID of a table in a database that can have any number of rows) then use Int32 or Int64 depending on your judgment.
Other can probably give you a better sense of of the performance advantages depending on what programming language you are using, but the smaller types use less memory and hence are 'better' to use if you don't need larger.
Just for reference, a 16-bit integer means there are 2^16 possible values - generally represented as between 0 and 65,535. 32-bit values range from 0 to 2^32 - 1, or just over 4.29 billion values.
This question On 32-bit CPUs, is an 'integer' type more efficient than a 'short' type? may add some more good information.
It depends on whether speed or storage should be optimized. If you are interested in speed and you are running SQL Server in 64 bit mode then 64 bit keys are what you need. A 64 bit processor running in 64 bit mode, is optimized to use 64 bit numbers and addresses. Likewise, a 64 bit processor running in 32 bit mode is optimized to use 32 bit numbers and addresses. For example, in 64 bit mode, all pushes and pops onto the stack are 8 bytes etc. Also fetch from cache and memory are again optimized for 64 bit numbers and addresses. The processor, running in 64 bit mode, may need more machine cycles to handle a 32 bit number just like a processor, running in 32 bit mode needs more machine cycles to handle a 16 bit number. The increases in processing time come for many reasons, but just think about the example of memory alignment: The 32 bit number may not be aligned on a 64 bit integral boundary which means loading the number requires shifting and masking the number after loading it into a register. At the very least, every 32 bit number must be masked before each operation. We are talking at least halving the processor's effective speed while handling 32 or 16 bit integers in 64 bit mode.
To provide a simple explanation to novice programmers. A bit is either a 0 or a 1.
a 16 bit Int is an integer represented by a string of 16 bits (16 0's and 1's)
a 32 bit Int is an integer represented by a string of 32 bits (32 0's and 1's)
a 64 bit Int is an integer represented by a string of 64 bits (64 0's and 1's)
Examples to drive those concepts home:
an example of a 16-bit integer would be 0000000000000110 which equals the int 6
an example of a 32-bit integer would be 00000000000000000100001000100110 which equals the int 16934.
an example of a 64-bit integer would be 0000100010000000010000100010011000000000000000000100001000100110 which equals the int 612562280298594854.
You can represent a larger number of integers with 64 bits than you can 32 bits than you can 16 bits. So the benefit of using fewer bits is you save space on the machine. The benefit of using more bits is you can represent more integers.

CUDA: memory transaction size for compute capability 1.2 or later

all,
From "NVIDIA CUDA Programming Guide 2.0" Section 5.1.2.1:
"Coalescing on Devices with Compute Capability 1.2 and Higher"
"Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data."
I have a doubt here: since each half-warp has 16 threads, if all threads access 8-bit data, then the total size for per half-warp should be 16 * 8-bit=128 bits= 16 bytes. While "Guide" says "32 bytes for 8-bit data". It seems half bandwidth is wasted. Am I understanding correctly?
Thanks
Deryk
Yes. Memory access is always in chunks of 32, 64 or 128 bytes, regardless of how much you actually need from that memory line.
Update:
Question: How does that explain 64 bytes for 16 bit data ?
The value: 32bytes for 1byte-words, 64bytes for 2byte-words and 128bytes for higher-byte words is the maximum size of the accessed segment. If, for example, each thread is fetching 2-byte word and your access is perfectly aligned, the memory access will be reduced to use only 32-byte line fetch.
Check out the section G.3.2.2 "Devices of Compute Capability 1.2 and 1.3" of "CUDA programming guide (v3.2)".
I see you used CUDA PG v. 2.0 (and probably CUDA 2.0 compiler). There were lots of improvements (in particular: bug fixes) since then.

Why does a 32-bit OS support 4 GB of RAM?

Just reading some notes in a purdue lecture about OSs, and it says:
A program sees memory as an array of
bytes that goes from address 0 to 2^32-1 (0 to
4GB-1)
Why 4 GB?
Because 32 bits are able to represent numbers up to 232 − 1 = 4294967295 = 4 GiB − 1 and therefore address up to 232 individual bytes which would be 4 GiB then.
There are ways to circumvent that, though. For example using PAE even a 32-bit operating system can support more memory. Historically this has most commonly been used on servers, though. Also, the non-server Windows SKUs don't support it. By now all that is moot, though, given that 64-bit CPUs, OSes and driver support are commonplace.
Because each byte of memory has to have an address. In a 32-bit operating system, an address is 32 bits long; thus, there are 2^32 possible addresses, which means there are 2^32 bytes = 4 GB.
If you have a 4-bit system, this means the address for each byte is 4 binary digits, the probability of all your address will range from 0000 through 1111 which is 2^4 = 16 (2 because there is either 0 or 1), with four bits it's possible to create 16 different values of zeros and ones, If you have 16 different addr. each represent a byte then you can have a max of 16 bytes
4-bit system will look like this:
For a 32-bit system, your max is 2^32 = 4294967292 bytes
Everybody is saying 2^32 = 4GiB, which is right. Just in case, here is how we got there:
A 32-bit machine uses 32 bits to address memory. Each bit has a value of 0 or 1. If you have 1 bit, you have two possible addresses: 0 or 1.
A two-bit system ( pun aside ) has four possible address: 00 =0, 01=1, 10=2, 11=3. 2^2=4.
Three bits have 8 possble addresses: 000=0, 001=1, 010=2, 011=3, 100=4, 101=5, 110=6, and 111=7.
Each bit doubles the potential address space, which is why 2^n tells you how many addresses you use for a given number of bits. 2^1 = 2, 2^2 = 2*2 = 4, 2^3 = 2*2*2 = 8, etc.
By the time you get to 32 bits, you are at 4GiB.
4 GB = 2^32 bytes.
2 ^ 32 = 4 * 1024 * 1024 * 1024
That, in bytes, is the definition of 4 GB. In other words a 32-bit register as a memory pointer can address 4 GB of memory and no more.
Actually, it's not as simple as 2^32 = 4294967296 bytes. You see in x86 protected mode, with paging enabled (that is, what you get when you use any modern OS), you don't address memory locations directly, even though the paging translation mechanism is transparent for client applications.
Of a logical 32 bit memory address, when using 4K pages:
bits 22-31 refer to a page directory
bits 12-21 refer to a page table
bits 11-0 refer to an offset in the 4096 byte page
As you can see, you have 2^10 (1024) page directories, in each page directory, you have 2^10 page tables and each page is 2^12 (4096) bytes long, hence 2^32 = 4294967296 bytes. The width of the memory bus is conveniently the same as the word length of the CPU but it's not necessary to be like this at all. In fact, more modern x86 CPUs support PAE which enables addressing more than 4GB (or GiB) even in 32-bit mode.
32bits can represent numbers 0..2^32 = 0..4,294,967,296
32bits can address up to 2^32Bytes (assuming Byte-size blocks)
2^32Bytes is the max size
2^32B = 4,194,304KiB = 4,194MiB = 4GiB
Because is the amount of different memory addresses (in Bytes) that can be stored in a Word.
But, in fact, that's not always true (in most of cases it isn't), the OS can handle more physical memory than that (with PAE) and the applications can use less than 4GB of virtual memory (because part of that virtual memory is mapped to the OS, 1GB in Linux and 2GB in Windows, for example).
Another scenario where that doesn't apply is if the memory was addressed by Words instead of Bytes, then the total memory addressable would be 16GB, for example.
A CPU with 32 bit registers will need the operating system to calculate everything in chunks of 32 bits. It's a hardware requirement to which the OS must conform. Similarly, CPUs with 64 bit registers will need an operating system that reads and writes data from the RAM in chunks of 64 bits. (Every time you read data from memory, you need to read it into one of those registers - be it 32 bit, or 64 bit, or 16 bit, etc.)
A 32 bit register can store 2^32 different RAM addresses.
Each RAM address corresponds to a byte (8 bits) in modern RAMs. (The 4 GB argument is true only for those RAMs that have addresses for every byte.)
=> 2^32 = 4,294,967,296‬ addresses, → that corresponds to 4,294,967,296‬ bytes.
Now, 1 KB = 2^10 bytes or 1024 bytes (in the binary system)
Therefore, 4,294,967,296‬ bytes / 1024 = 4,194,304‬ KB
4,194,304‬ KB / 1024 = 4,096‬ MB
4,096‬ MB / 1024 = 4 GB
Mainly due to 32bit OS chosing to support only 2^32-1 addresses.
If the CPU has more than 32 address lines on the FSB, then the 32bit OS can choose to use a paging mechanism to access more than 4GiB. (For example Windows 2000 Advanced Server/Data Center editions on PAE supported Intel/AMD chips)
4 GB = 2^32 bytes.
But remember its max 4gb allocated by a 32 bit OS. In reality, the OS will see less e.g. after VRAM allocation.
As previously stated by other users, 32-bit Windows OSes use 32-bit words to store memory addresses.
Actually, most 32-bit chips these days use 36-bit addressing, using Intel's Physical Address Extension (PAE) model. Some operating systems support this directly (Linux, for example).
As Raymond Chen points out, in Windows a 32-bit application can allocate more than 4GB of memory, and you don't need 64-bit Windows to do it. Or even PAE.
For that matter, 64-bit chips don't support the entire 64-bit memory space. I believe they are currently limited to 42-bit space... the 36-bit space that PAE uses, plus the top 8-bit addresses,

Resources