How will 64 bit variable be referenced in a 32 bit process? - memory

I have a 64 bit kernel and i run 32 bit processes in userland.In the user process code ,if i declare a 64 bit variable ,how will it be referred.Will it incur 2 memory reads.?
basically the scenario is:
I need to use a 64 bit mask in my user process.
Approach 1 :
-> Use a u64bits variable.
Approach
-> Use a array of 2 32 bit variables.

First off: the kernel has no bearing on the answer to this question.
Second, I assume this is x86 you're talking about. Where possible, the compiler will place 64-bit values across 2 32-bit registers. For example, if you return a uint64_t from a function, the low 32 bits will be stored in the eax register, and the high bits will be in edx.
The compiler will generally do the right thing for performance and correctness: using an array will likely just confuse it and lead to worse results.
By the way, x86-64 CPUs will normally perform reads of 2 adjacent 32-bit words at the same speed as a single 64-bit read. The advantages of 64-bit mode are that arithmetic can be done directly on 64-bit values (1 64x64 multiplication instruction vs 3-4 32x32 instructions), there is much more space available in registers (16 registers instead of 8, registers are twice as wide), and of course the larger possible virtual address space.

Related

Reading a bit from memory

I'm looking into reading single bits from memory (RAM, harddisk). My understanding was, one can not read less than a byte.
However I read someone telling it can be done with assembly.
I wan't the bandwidth usage to be as low as possible and the to be retrieved data is not sequential, so I can not read a byte and convert it to 8 bits.
I don't think the CPU will read less than the size of a cache line from RAM (64 bytes on recent Intel chips). From disk, the minimum is typically 4 kiB.
Reading a single bit at a time is neither possible nor necessary, since the data bus is much wider than that.
You cannot read less than a byte from any PC or hard disk that I know of. Even if you could, it would be extremely inefficient.
Some machines do memory mapped port io that can read/write less than a byte to the port, but it still shows up when you get it as at least a byte.
Use the bitwise operators to pick off specific bits as in:
char someByte = 0x3D; // In binary, 111101
bool flag = someByte & 1; // Get the first bit, 1
flag = someByte & 2; // Get the second bit, 0
// And so on. The number after the & operator is a power of 2 if you want to isolate one bit.
// You can also pick off several bits like so:
int value = someByte & 3; // Assume the lower 2 bits are interesting for some reason
It used to be, say 386/486 days, where a memory was a bit wide, 1 meg by 1 bit, but you will have 8 or some multiple number of chips, one for each bit lane on the bus, and you could only read in widths of the bus. today the memories are a byte wide and you can only read in units of 32 or 64 or multiples of those. Even when you read a byte, most designs fill in the whole byte. it adds unnecessarily complication/cost, to isolate the bus all the way to the memory, a byte read looks to most of the system as a 32 or 64 bit read, as it approaches the edge of the processor (sometimes physical pins, sometimes the edge of the core inside the chip) is when the individual byte lane is separated out and the other bits are discarded. Having the cache on changes the smallest divisible read size from the memory, you will see a burst or block of reads.
it is possible to design a memory system that is 8 bits wide and read 8 bits at a time, but why would you? unless it is an 8 bit processor which you probably couldnt take advantage of a 8bit by 2 gig memory. dram is pretty slow anyway, something like 133 mhz (even your 1600mhz memory is only short burst as you read from slow parts, memory has not gotten faster in over 10 years).
Hard disks are similar but different, I think sectors are the smallest divisible unit, you have to read or write in those units. so when reading you have a memory cycle on the processor, no different that going to a memory, and depending on the controller either before you do the read or as a result, a sector is read of the disk, into a buffer, not unlike a cache line read, then your memory cycle to the buffer in the disk controller either causes a bus width read and the processor divides it up or if the bus adds complexity to isolate byte lanes then you isolate a byte, but nobody isolates bit lanes. (I say the word nobody and someone will come back with an exception...)
most of this is well documented, not hard to find. For arm platforms look for the amba and/or axi specifications, freely downloaded. the number of bridges, pcie controllers, disk controller documents are all available for PCs and other platforms. it still boils down to an address and data bus or one goesouta and one goesinta data bus and some control signals that indicate the access type. some busses have byte lane enables, which is generally for a write not a read. If I want to write only a byte to a dram in a modern 64 bit system, I DO have to tell everyone almost all the way out to the dram what I want to write. To write a byte on a memory module which must be accessed 64 bits at a time, at a minimum a 64 bit read happens into a temporary place either the cache or the memory controller, then the byte to be written modifies the specific byte within the 64 bit word, then that 64 bit quantity, eventually, is written back to the memory module itself. You can do this using a combination of the address bits and a few control signals or you can just put 8 byte lane enables and the lower address bits can be ignored. Hard disk, same deal, have to read a sector, modify one byte, then eventually write the whole sector at a time. with flash and eeprom, you can only write zeros (from the programmers perspective), you erase to ones (from the programmers perspective, is actually a zero in the logic, there is an inversion) and a write has to be a sector at a time, sectors can be 64 bytes, 128 bytes, 256 bytes typically.

16 bit Int vs 32 bit Int vs 64 bit Int

I've been wondering this for a long time since I've never had "formal" education on computer science (I'm in highschool), so please excuse my ignorance on the subject.
On a platform that supports the three types of integers listed in the title, which one's better and why? (I know that every kind of int has a different length in memory, but I'm not sure what that means or how it affects performance or, from a developer's view point, which one has more advantages over the other).
Thank you in advance for your help.
"Better" is a subjective term, but some integers are more performant on certain platforms.
For example, in a 32-bit computer (referenced by terms like 32-bit platform and Win32) the CPU is optimized to handle a 32-bit value at a time, and the 32 refers to the number of bits that the CPU can consume or produce in a single cycle. (This is a really simplistic explanation, but it gets the general idea across).
In a 64-bit computer (most recent AMD and Intel processors fall into this category), the CPU is optimized to handle 64-bit values at a time.
So, on a 32-bit platform, a 16-bit integer loaded into a 32-bit address would need to have 16 bits zeroed out so that the CPU could operate on it; a 32-bit integer would be immediately usable without any alteration, and a 64-bit integer would need to be operated on in two or more CPU cycles (once for the low 32-bits, and then again for the high 32-bits).
Conversely, on a 64-bit platform, 16-bit integers would need to have 48 bits zeroed, 32-bit integers would need to have 32 bits zeroed, and 64-bit integers could be operated on immediately.
Each platform and CPU has a 'native' bit-ness (like 32 or 64), and this usually limits some of the other resources that can be accessed by that CPU (for example, the 3GB/4GB memory limitation of 32-bit processors). The 80386 processor family (and later x86) processors made 32-bit the norm, but now companies like AMD and then Intel are currently making 64-bit the norm.
To answer your first question, the usage of a 16 bit vs a 32 bit vs a 64 bit integer depends on the context that it is used. Therefore, you really can't say one is better over the other, per say. However, depending on a situation, using one over another is preferable. Consider this example. Let's say you have a database with 10 million users and you want to store the year they were born. If you create a field in your database with a 64 bit integer then you have exhausted 80 megabytes of your storage; whereas, if you were to use a 16 bit field, only 20 megabytes of your storage will get used. You can use a 16 bit field here because the year people are born is smaller than the largest 16 bit number. In other words 1980, 1990, 1991 < 65535, assuming your field is unsigned. All in all, it depends on the context. I hope this helps.
A simple answer is to use the smallest one you KNOW will be safe for the range of possible values it will contain.
If you know the possible values are constrained to be smaller than a maximum-length 16-bit integer (e.g. the value corresponding to what day of the year it is - always <= 366) then use that. If you aren't sure (e.g. the record ID of a table in a database that can have any number of rows) then use Int32 or Int64 depending on your judgment.
Other can probably give you a better sense of of the performance advantages depending on what programming language you are using, but the smaller types use less memory and hence are 'better' to use if you don't need larger.
Just for reference, a 16-bit integer means there are 2^16 possible values - generally represented as between 0 and 65,535. 32-bit values range from 0 to 2^32 - 1, or just over 4.29 billion values.
This question On 32-bit CPUs, is an 'integer' type more efficient than a 'short' type? may add some more good information.
It depends on whether speed or storage should be optimized. If you are interested in speed and you are running SQL Server in 64 bit mode then 64 bit keys are what you need. A 64 bit processor running in 64 bit mode, is optimized to use 64 bit numbers and addresses. Likewise, a 64 bit processor running in 32 bit mode is optimized to use 32 bit numbers and addresses. For example, in 64 bit mode, all pushes and pops onto the stack are 8 bytes etc. Also fetch from cache and memory are again optimized for 64 bit numbers and addresses. The processor, running in 64 bit mode, may need more machine cycles to handle a 32 bit number just like a processor, running in 32 bit mode needs more machine cycles to handle a 16 bit number. The increases in processing time come for many reasons, but just think about the example of memory alignment: The 32 bit number may not be aligned on a 64 bit integral boundary which means loading the number requires shifting and masking the number after loading it into a register. At the very least, every 32 bit number must be masked before each operation. We are talking at least halving the processor's effective speed while handling 32 or 16 bit integers in 64 bit mode.
To provide a simple explanation to novice programmers. A bit is either a 0 or a 1.
a 16 bit Int is an integer represented by a string of 16 bits (16 0's and 1's)
a 32 bit Int is an integer represented by a string of 32 bits (32 0's and 1's)
a 64 bit Int is an integer represented by a string of 64 bits (64 0's and 1's)
Examples to drive those concepts home:
an example of a 16-bit integer would be 0000000000000110 which equals the int 6
an example of a 32-bit integer would be 00000000000000000100001000100110 which equals the int 16934.
an example of a 64-bit integer would be 0000100010000000010000100010011000000000000000000100001000100110 which equals the int 612562280298594854.
You can represent a larger number of integers with 64 bits than you can 32 bits than you can 16 bits. So the benefit of using fewer bits is you save space on the machine. The benefit of using more bits is you can represent more integers.

One memory location in a computer stores how much data?

Assume 32 Bit OS.
One memory location in a computer stores how much data?
Whats the basic unit of memory storage in a computer?
For Example to a store a integer what will be the memory addresses required?
If basic unit is BYTE the integer requires 4 bytes.
So if I need to store a byte then if start putting in the 1st byte in memory location
0001 then will my integer end at 0003 memory location?
Please correct me if am wrong?
Most commonly, modern systems are what you call "byte-accessible".
This means:
One memory location stores 1 byte (8 bits).
The basic storage unit for memory is 1 byte.
If you need to store 4 bytes, and place the first byte at 0001, the last byte will be at 0004. That's one byte at each of 0001, 0002, 0003, and 0004.
Keep in mind while systems have different CPU word sizes (a 32-bit system has a 32-bit or 4-byte word), memory is usually addressed by byte. The CPU's registers used in arithmetic are 4 bytes, but the "memory" programmers use for data storage is addressed in bytes.
On x86 systems, many memory-accessing instructions require values in memory to be "aligned" to addresses evenly divisible by the word size. e.g. 0x???0, 0x???4, 0x???8, 0x???C. So, storing an int at 0001 won't happen on most systems. Non-numeric data types can usually be found at any address.
See Wikipedia: Alignment Word (Computing) Memory Address
One memory location in a computer stores how much data?
It depends on the computer. A memory location means a part of memory that the CPU can address directly.
Whats the basic unit of memory storage in a computer?
It is the Bit, and then the Byte, but different CPUs are more comfortable addressing memory in words of particular sizes.
For Example to a store a integer what will be the memory addresses required? If basic unit is BYTE the integer requires 4 bytes.
In mathematics, the integer numbers are infinite, so infinite memory should be required to represent all/any of them. The choice made by a computer architecture about how much memory should be used to represent an integer is arbitrary. In the end, the logic about how integers are represented and manipulated is in software, even if it is embedded in the firmware. The programming language Python has an unbounded representation for integers (but please don't try a googol on it).
In the end, all computer architectures somehow allow addressing down to the Byte or Bit level, but they work best with addresses at their word size, which generally matches the bit-size of the CPU registers.
It is not about the amount of data, or the size of integers, but about the number of memory addresses the computer can use.
There are 4GiB addresses (for bytes) in 32 bits. To manage a cluster of machines with more than 4GiB of RAM, each system must manage larger addresses.
Again, it is all about the addressable memory space, and not about the size of integers. There were 64 bit integers even when CPUs preferred 8bit word addressing.
Depends on the architecture. 32-bits for 32-bits. 64-bits for 64-bits.
Usually it's called a "word"
Most values need to be aligned, so the addresses end with 0 4 8 or C

Why is the smallest value that can be stored is a Byte(8bit) & not a Bit(1bit)?

Why is the smallest value that can be stored a Byte(8bit) & not a Bit(1bit) in memory?
Even booleans are stored as Bytes. Will we ever bump the smallest number to 32 or 64bits like register's on the CPU?
EDIT: To clarify as many answers seemed confused about the nature of questing. This question is about why isn't a byte 7-bit, 1-bit, 32-bit, etc (not why lower bit primitives must fit within the hardware's byte at min). Is the 8-bit byte simply historical as some hardware has 10-bit bytes for example. Or is there a mathematical reason 8-bit is ideal vs say 10-bit for general processing?
The hardware is built to read data in blocks (bytes, later words and dwords). This provides greater efficiency, than accessing individual bits, and also offers more addressing range. So most data is aligned to at least byte boundary. There exist encodings that operate with bit sequences, rather than bytes, but they are quite rare.
Nowadays the data is most often aligned to dword (32-bits) boundary anyway. Moreover, some hardware (ARM, for example), can't access misaligned multibyte variables, i.e. 16-bit word can't "cross" dword boundary - exception will be thrown.
Because computers address memory at the byte level, so anything smaller than a byte is not addressable.
The underlying methods of processor access are limited to the size of the smallest usable register. On most architectures, that size is 8 bits. You can use smaller portions of these; for instance, C has the bitfield feature in structs that will allow combining fields that only need to be certain bit lengths. Access will still require that the whole byte be read.
Some older exotic architectures actually did have different a "word size." In these machines, 10 bits might be the common size.
Lastly, processors are almost always backwards compatible. Intel, for instance, has maintained complete instruction compatibility from the 386 on up. If you take a program compiled for the 386, it will still run on an i7 processor. Changing the word size would break compatibility. So while it is possible, no manufacturer will ever do it.
Assume that we have native language that consist of 2 character such as a , b
to distinguish two characters we need at least 1 bit for example 0 to represent char a and 1 to represent char b
so that if we count number of characters and special characters and symbols, there are 128 character and to distinguish one character from another, you need log2(128) = 7 bit and 8th bit for transmission

Purpose of memory alignment

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Resources