Per-node memory overhead - linked-list

I was learning about the pros and cons of using Stacks with linked lists, when i found a cons that say: " the memory cost for each node can be significantly more than the databin stored. Ex a 32 bit value such as integer can be memory overhead 7 times larger than the integer itself."
What does this mean?

When you use a general memory allocator you don't know how big block it allocates on each request. Many of them round the requested size up to some even quantity so that each block is aligned to an address divisible, say, by 8 or 16, or even 32. In that case you always use at least 32 bytes, even if you request only 1 byte. Then you get 32 bytes of a heap for a 4-byte piece of data, which is 8 times what you really need, thus the overhead equal 7.
Often the allocator adds a 'header' before the block it returns and the header size is an allocation size step. For a header 16 bytes long your requested allocation size will get rounded up to a nearest 16 multiply and incremented by 16 for a header. So for requested size 1 through 16 you use 32 bytes, for 17—32 you use 48, for 33—48 it's 64 and so on.


how to tell how many memory addresses a processor can generate

Let's say a computer can hold a word size of 26 bits, I'm curious to know how many memory addresses can the processor generate?
I'm thinking that the maximum number it can hold would be 2^26 - 1 and can have 2^26 unique memory addresses.
I'm also curious to know that if let's say that each cell in the memory has a size of 12 bits then how many bytes of memory can this processor address?
My understanding is that in most cases a processor can hold up to 32 bits which is 4 bytes and each byte is 8 bits. However, in this case, each byte would be 12 bits and the processor would be able to address 2^26/12 bytes of memory. Is that safe to say?
I agree.  We usually refer to this as the size of the address space.
As for the next question:
These days, the term byte is generally agreed to means 8 bits, so 12 bits would mean 1.5 bytes.  It is a matter of terminology, though, which has varied in the long past.
So, I would say 226 12-bit words is capable of holding/storing 226 * 1.5 bytes, though they are not individually addressable, and would have to be packed & unpacked to access the separate bytes.
The DEC PDP-8 computer was a 12 bit computer and word addressable, so there were multiple schemes for storing characters: two 6 bit characters in a 12 bit word, and also 1 & 1/2 8-bit characters in a 12-bit word, so three 8-bit characters in two 12-bit words.
Similar issues occur when storing packed booleans in a memory, where each boolean takes only a single bit, yet the processor can access a minimum of 8 bits at a time, so must extract a single bit from a larger datum.

Why CPU accesses aligned memory

A past couple of days I've been reading about how CPU access memory and how it could be slower then desired if the accessed object is spread over different chunks that CPU accesses.
In a very generalized and abstract words, if I, say, have an address space from 0x0 to 0xF with a cell of one byte, and CPU reads memory in chunks of 4 bytes (that is, has a quad byte memory access granularity), then, if I need to read an object of 4 bytes size residing in cells 0x0 - 0x3, CPU would do it in one operation, while if the same object occupies cells 0x1 - 0x4, then CPU needs to perform two read operations (read memory in 0x0 - 0x3 first, then in 0x4 - 0x7), shift bytes and combine two parts (or break, if it cannot do unaligned access). This happens, once again, because CPU can read memory in 4 bytes chunks (in our abstract case). Let's also assume, that CPU make these reads inside one cache line and there is no need to change the contents of cache between reads.
So, in this case, the beginning of each chunk CPU can read is residing in a memory cell that has an address which is multiple of 4 (right?). Ok, i don't have any questions about why CPU reads in chunks, but why exactly the beginning of each chunk is aligned in such a way? If referring to an example in a previous paragraph, why exactly CPU cannot read a chunk of 4 bytes starting from 0x1?
As I may understand, CPU is pretty much aware that 0x1 exists. So is all the fuzz happening because memory controller cannot access chunk of memory starting from 0x1? Or is it because a couple of LSBs in a processor word are reserved on some architectures? Or the fact that they are reserved is the consequence of an aligned access, an not its cause (it seems like it's a second question already, but I would leave it as at the time I write this question I have a feeling that they are related)?
There are a bunch of answers here touching this topic (like this and this) and articles online (like this and this), but in all the resources there are good explanations on the phenomena itself and its consequences, but no explanation on why exactly CPU cannot read a chunk of memory starting "in between" byte boundaries (or I couldn't see it maybe).
Consider a simple CPU. It has 32 RAM chips. Each chip supplies one bit of memory. The CPU produces one address, passes it to the 32 RAM chips, and 32 bits come back. The first RAM chip holds bit 0 of bytes 0, 4, 8, 12, 16 etc. The second RAM chip holds bit 1 of bytes 0, 4, 8, 12, 16 etc. The ninth RAM chip holds bit 0 of bytes 1, 5, 9, 13, 17 etc.
So you see that the 32 RAM chips between them can produce bits 0 to 7 of bytes 0 to 3, or bytes 4 to 7, or bytes 8 to 11 etc. They are incapable of producing bytes 1 to 4.

Structure of a malloc block

Reading here it says malloc can't allocate less than 32 bytes. I have also seen somewhere saying 16 bytes is the minimum.
This diagram shows generally what malloc block looks like, but is not detailed enough.
The first link suggests there is an 8-byte minimum required to store the size of the block. Piecing these things together, my guess is:
16 bytes for the size (this, but that would limit block size to 65,535 bytes)
16 bytes for the pointer to the next free block (but that would also limit the number of blocks to 65,535 ~ 4 GB, which I guess makes sense).
That would mean the block structure would be:
[size, pointer, userdata....]
[16b, 16b, 65,535b max]
This would mean malloc can't allocate less than 16 + 16 + 16 = 48 bytes.
Wondering if this is accurate or if there is more to it.

JVM 64-bit different memory usages?

I've done some reading but I'm not entirely sure about one thing, for example how much memory would this use in JVM 64 bit(sorry if stupid question, but I'm a bit confused and don't know much about this):
MyObject[] myArray; - I know an array takes up 24 bytes, but how much will each element in this array take? is every element an object reference, meaning 8 byte per element? If not, how do I know how many bytes each element in this array needs?
Normally, that is when using heap sizes of less than 32 GB, the 64-bit JVM uses compressed oops which store object pointers as a 32-bit integer (scaled by three bits when used, since all objects are aligned to 8 bytes; see the link for details), so each element would actually only use 4 bytes.
If you use more than 32 GB of heap or otherwise turn off compressed oops, however, then each element will indeed use 8 bytes.
Also, I suspect that your statement on the array header being 24 bytes is wrong. To begin with, when compressing oops, the class reference in the header is also compressed, and the identity-hash-code and array length fields are 32-bit to begin with, so I suspect it is more likely to use 12 bytes. Even when using full-length oops, it should still only take 16 bytes. I can't find any hard source verifying either, however. In general, however, it should be said that Hotspot does not even use a fixed-size object header but one that varies in size depending on various circumstances of the object. This article describes some of those circumstances.
That is on the Hotspot JVM, at least. Since the JLS doesn't specify any primitive sizes, it could, theoretically, be anything on any given JVM, though 8 bytes are, of course, the most likely implementation choice.
Here is good information on how to calculate the memory usage of a Java array
For Example
let's consider a 10x10 int array. Firstly, the "outer" array has its 12-byte object header followed by space for the 10 elements. Those elements are object references to the 10 arrays making up the rows. That comes to 12+4*10=52 bytes, which must then be rounded up to the next multiple of 8, giving 56. Then, each of the 10 rows has its own 12-byte object header, 4*10=40 bytes for the actual row of ints, and again, 4 bytes of padding to bring the total for that row to a multiple of 8. So in total, that gives 11*56=616 bytes. That's a bit bigger than if you'd just counted on 10*10*4=400 bytes for the hundred "raw" ints themselves.

Reading a bit from memory

I'm looking into reading single bits from memory (RAM, harddisk). My understanding was, one can not read less than a byte.
However I read someone telling it can be done with assembly.
I wan't the bandwidth usage to be as low as possible and the to be retrieved data is not sequential, so I can not read a byte and convert it to 8 bits.
I don't think the CPU will read less than the size of a cache line from RAM (64 bytes on recent Intel chips). From disk, the minimum is typically 4 kiB.
Reading a single bit at a time is neither possible nor necessary, since the data bus is much wider than that.
You cannot read less than a byte from any PC or hard disk that I know of. Even if you could, it would be extremely inefficient.
Some machines do memory mapped port io that can read/write less than a byte to the port, but it still shows up when you get it as at least a byte.
Use the bitwise operators to pick off specific bits as in:
char someByte = 0x3D; // In binary, 111101
bool flag = someByte & 1; // Get the first bit, 1
flag = someByte & 2; // Get the second bit, 0
// And so on. The number after the & operator is a power of 2 if you want to isolate one bit.
// You can also pick off several bits like so:
int value = someByte & 3; // Assume the lower 2 bits are interesting for some reason
It used to be, say 386/486 days, where a memory was a bit wide, 1 meg by 1 bit, but you will have 8 or some multiple number of chips, one for each bit lane on the bus, and you could only read in widths of the bus. today the memories are a byte wide and you can only read in units of 32 or 64 or multiples of those. Even when you read a byte, most designs fill in the whole byte. it adds unnecessarily complication/cost, to isolate the bus all the way to the memory, a byte read looks to most of the system as a 32 or 64 bit read, as it approaches the edge of the processor (sometimes physical pins, sometimes the edge of the core inside the chip) is when the individual byte lane is separated out and the other bits are discarded. Having the cache on changes the smallest divisible read size from the memory, you will see a burst or block of reads.
it is possible to design a memory system that is 8 bits wide and read 8 bits at a time, but why would you? unless it is an 8 bit processor which you probably couldnt take advantage of a 8bit by 2 gig memory. dram is pretty slow anyway, something like 133 mhz (even your 1600mhz memory is only short burst as you read from slow parts, memory has not gotten faster in over 10 years).
Hard disks are similar but different, I think sectors are the smallest divisible unit, you have to read or write in those units. so when reading you have a memory cycle on the processor, no different that going to a memory, and depending on the controller either before you do the read or as a result, a sector is read of the disk, into a buffer, not unlike a cache line read, then your memory cycle to the buffer in the disk controller either causes a bus width read and the processor divides it up or if the bus adds complexity to isolate byte lanes then you isolate a byte, but nobody isolates bit lanes. (I say the word nobody and someone will come back with an exception...)
most of this is well documented, not hard to find. For arm platforms look for the amba and/or axi specifications, freely downloaded. the number of bridges, pcie controllers, disk controller documents are all available for PCs and other platforms. it still boils down to an address and data bus or one goesouta and one goesinta data bus and some control signals that indicate the access type. some busses have byte lane enables, which is generally for a write not a read. If I want to write only a byte to a dram in a modern 64 bit system, I DO have to tell everyone almost all the way out to the dram what I want to write. To write a byte on a memory module which must be accessed 64 bits at a time, at a minimum a 64 bit read happens into a temporary place either the cache or the memory controller, then the byte to be written modifies the specific byte within the 64 bit word, then that 64 bit quantity, eventually, is written back to the memory module itself. You can do this using a combination of the address bits and a few control signals or you can just put 8 byte lane enables and the lower address bits can be ignored. Hard disk, same deal, have to read a sector, modify one byte, then eventually write the whole sector at a time. with flash and eeprom, you can only write zeros (from the programmers perspective), you erase to ones (from the programmers perspective, is actually a zero in the logic, there is an inversion) and a write has to be a sector at a time, sectors can be 64 bytes, 128 bytes, 256 bytes typically.
