Memory Locations of Variables when Using IA-32 Assembly Language - memory

Quick question on memory locations in IA-32 assembly language that i cannot seem to find the answer for anywhere else.
On IA-32 each memory address is 4 bytes long (e.g. 0x0040120e). Each of these addresses points to a 1 byte value (or in the case of a larger value, the first byte of it). Now look at these two simple IA-32 assembly language statements:
var1 db 2
var2 db 3
This will place var1 and var2 in adjacent memory cells (let's say 0x0040120e and 0f). Now I realize that the define directive db allocates 1 byte to the value. But, in the case above I have two values (2 and 3) that in fact only requires two bits each, to be stored.
Questions:
When using the db directive, do these two values still consume a full byte, even though they are smaller than 1 byte?
Is using a full byte for values that could get away with less, still the common way to go (as we have so much memory that we don't care)?
Does integers 0 to 255 then generally take up 1 byte and integers 256 to (2^16 - 1) take up 2 bytes (a word), etc.?
Thank you,
Magnus
EDIT 1: Made questions more clear (apologies for the back and forth)
EDIT 2: Added a structured reply below, based on other posters' input

yes. the B in DB is for Byte.
You could use a nibble for each, like so:
combined db 0x23
but you'd have to
a) shift the result for 4 bits right if you need the "2".
b) mask the leftmost 4 bits if you need the "3".
Hardly worth the effort these days ;-)

Yes, since the architecture is byte-addressable and cannot address anything smaller than a byte.
This means that data requiring less than one byte will need to share its address with other data.
In practice this means that you're going to have to know which bits in the pointed-out byte are used for this particular value.
For hardware registers this sort of mapping is very common.
EDIT: Ah, you seem to mean "values of the same variable" when you said "2 and 3". I thought you meant 2-bit and 3-bit values. You need to decide how many bits are needed at most for a particular variable, for all the values you need that variable to be able to store. There are variable-length encodings for integers of course, but that's generally rarely used in assembly and not what you'd typically use for some general-purpose variable.
You generally should expect to reserve all bits required for all values that a variable need to hold, up front. Otherwise, if you're worried about "wasting memory", you would need to move all other variables as soon as you get some "vacant bits" somewhere. That would end up costing fantastically much. Also, knowing the size of a variable is constant makes it possible to generate (or write) the proper code to handle it, otherwise you would of course also need to explicitly store somewhere "the size of the value held in variable x is now y bits". That becomes extremely painful very very quickly.

My initial question was a bit unstructured, so for the benefit of other searchers stopping by here I will use the answers received from #unwind and #geert3 to create a structured response. Again, this was my fault due to the initial poor structuring and creds for the answers goes to #unwind and #geert3.
When using the db directive you allocate 1 byte to the variable, and even if the variable takes up less space than 1 byte, it will still consume that full 1-byte address spot. As one might guess, that wastes a few bits of memory, but that is okay as you have enough memory and not too bothered about wasting a couple of bits. The reason you want to use the full 1-byte memory location is that it is easier to reference the variable when it is alone in the address slot (see #geert3's note on how to access it if you use less than a byte), and additionally, in case you want to reuse the variable later, it is great to know you have space for any number up to 255.
Yes, see answer to 1
Yes, you would normally allocate multiples of a byte to a variable, in a byte-addressable system

Related

Largest amount of entries in lua table

I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
end
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)
The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.
Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows
...

Does the order or syntax of allocate statement affect performance? (Fortran)

Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).
From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.
There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.

Tracking address when writing to flash

My system needs to store data in an EEPROM flash. Strings of bytes will be written to the EEPROM one at a time, not continuously at once. The length of strings may vary. I want the strings to be saved in order without wasting any space by continuing from the last write address. For example, if the first string of bytes was written at address 0x00~0x08, then I want the second string of bytes to be written starting at address 0x09.
How can it be achieved? I found that some EEPROM's write command does not require the address to be specified and just continues from lastly written point. But EEPROM I am using does not support that. (I am using Spansion's S25FL1-K). I thought about allocating part of memory to track the address and storing the address every time I write, but that might wear out flash faster. What is widely used method to handle such case?
Thanks.
EDIT:
What I am asking is how to track/save the address in a non-volatile way so that when next write happens, I know what address to start.
I never worked with this particular flash, but I've implemented something similar. Unfortunately, without knowing your constrains / priorities (memory or CPU efficient, how often write happens etc.) it is impossible to give a definite answer. Here are some techniques that you may want to consider. I don't know if they are widely used though.
Option 1: Write X bytes containing string length before the string. Then on initialization you could parse your flash: read the length n, jump n bytes forward; read the next byte. If it's empty (all ones for your flash according to the datasheet) then you got your first empty bit. Otherwise you've just read the length of the next string, so do the same over again.
This method allows you to quickly search for the last used sector, since the first byte of the used sector is guaranteed to have a value. The flip side here is overhead of extra n bytes (depending on the max string length) each time you write a string, and having to parse it to get the value (although this can only be done once on boot).
Option 2: Instead of prepending the size, append the unique "end-of-string" sequence, and then parse on boot for the last sequence before ones that represent empty flash.
Disadvantage here is longer parse, but you possibly could get away with just 1 byte-long overhead for each string.
Option 3 would be just what you already thought of: allocating a separate sector that would contain the value you need. To reduce flash wear you could also write these values back-to-back and search for the last one each time you boot. Also, you might consider the expected lifetime of the device that you program versus 100,000 erases that your flash can sustain (again according to the datasheet) - is wearing even a problem? That of course depends on how often data will be saved.
Hope that helps.

Cache and memory

First of all, this is not language tag spam, but this question not specific to one language in particulary and I think that this stackexchange site is the most appropriated for my question.
I'm working on cache and memory, trying to understand how it works.
What I don't understand is this sentence (in bold, not in the picture) :
In the MIPS architecture, since words are aligned to multiples of four
bytes, the least significant two bits are ignored when selecting a
word in the block.
So let's say I have this two adresses :
[1........0]10
[1........0]00
^
|
same 30 bits for boths [31-12] for the tag and [11-2] for the index (see figure below)
As I understand the first one will result in a MISS (I assume that the initial cache is empty). So one slot in the cache will be filled with the data located in this memory adress.
Now, we took the second one, since it has the same 30 bits, it will result in a HIT in the cache because we access the same slot (because of the same 10 bits) and the 20 bits of the adress are equals to the 20 bits stored in the Tag field.
So in result, we'll have the data located at the memory [1........0]10 and not [1........0]00 which is wrong !
So I assume this has to do with the sentence I quote above. Can anyone explain me why my reasoning is wrong ?
The cache in figure :
In the MIPS architecture, since words are aligned to multiples of four
bytes, the least significant two bits are ignored when selecting a
word in the block.
It just mean that in memory, my words are aligned like that :
So when selecting a word, I shouldn't care about the two last bits, because I'll load a word.
This two last bits will be useful for the processor when a load byte (lb) instruction will be performed, to correctly shift the data to get the one at the correct byte position.

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources