Would there be a practical application for a more memory efficient boolean? - memory

I've noticed that booleans occupy a whole byte, despite only needing 1 bit. I was wondering whether we could have something like
struct smartbool{char data;}
, which would store 8 booleans at once.
I am aware that it would take more time to retrieve data, although would the tradeoff be a practical application in some scenarios?
Am I missing something about the memory usage of booleans?

Normally variables are aligned on word boundaries, memory use is balanced against efficiency of access. For one-off boolean variables it may not make sense to store them in a denser form.
If you do need a bunch of booleans you can use things like this BitSet data structure: https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/BitSet.html.
There is a type of database index that stores booleans efficiently:
https://en.wikipedia.org/wiki/Bitmap_index. The less space an index takes up the easier it is to keep in memory.
There are already widely used data types that support multiple booleans, they are called integers. you can store and retrieve multiple booleans in an integral type, using bitwise operations, screening out the bits you don't care about with a pattern of bits called a bitmask.

This sort of "packing" is certainly possible and sometimes useful, as a memory-saving optimization. Many languages and libraries provide a way to make it convenient, e.g. std::vector<bool> in C++ is meant to be implemented this way.
However, it should be done only when the programmer knows it will happen and specifically wants it. There is a tradeoff in speed: if bits are used, then setting / clearing / testing a specific bool requires first computing a mask with an appropriate shift, and setting or clearing it now requires a read-modify-write instead of just a write.
And there is a more serious issue in multithreaded programs. Languages like C++ promise that different threads can freely modify different objects, including different elements of the same array, without needing synchronization or causing a data race. For instance, if we have
bool a, b; // not atomic
void thread1() { /* reads and writes a */ }
void thread2() { /* reads and writes b */ }
then this is supposed to work fine. But if the compiler made a and b two different bits in the same byte, concurrent accesses to them would be a data race on that byte, and could cause incorrect behavior (e.g. if the read-modify-writes being done by the two threads were interleaved). The only way to make it safe would be to require that both threads use atomic operations for all their accesses, which are typically many times slower. And if the compiler could freely pack bools in this way, then every operation on a potentially shared bool would have to be made atomic, throughout the entire program. That would be prohibitively expensive.
So this is fine if the programmer wants to pack bools to save memory, is willing to take the hit to speed, and can guarantee that they won't be accessed concurrently. But they should be aware that it's happening, and have control over whether it does.
(Indeed, some people feel that having C++ provide this with vector<bool> was a mistake, since programmers have to know that it is a special exception to the otherwise general rule that vector<T> behaves like an array of T, and different elements of the vector can safely be accessed concurrently. Perhaps they should have left vector<bool> to work in the naive way, and given a different name to the packed version, similar to std::bitset.)

Related

Using something other than a Swift array for mutable fixed-size thread-safe data passed to OpenGL buffer

I am trying to squeeze every bit of efficiency out of my application I am working on.
I have a couple arrays that follow the following conditions:
They are NEVER appended to, I always calculate the index myself
The are allocated once and never change size
It would be nice if they were thread safe as long as it doesn't cost performance
Some hold primitives like floats, or unsigned ints. One of them does hold a class.
Most of these arrays at some point are passed into a glBuffer
Never cleared just overwritten
Some of the arrays individual elements are changed entirely by = others are changed by +=
I currently am using swift native arrays and am allocating them like var arr = [GLfloat](count: 999, repeatedValue: 0) however I have been reading a lot of documentation and it sounds like Swift arrays are much more abstract then a traditional C-style array. I am not even sure if they are allocated in a block or more like a linked list with bits and pieces thrown all over the place. I believe by doing the code above you cause it to allocate in a continuous block but i'm not sure.
I worry that the abstract nature of Swift arrays is something that is wasting a lot of precious processing time. As you can see by my above conditions I dont need any of the fancy appending, or safety features of Swift arrays. I just need it simple and fast.
My question is: In this scenario should I be using some other form of array? NSArray, somehow get a C-style array going, create my own data type?
Im looking into thread safety, would a different array type that was more thread safe such as NSArray be any slower?
Note that your requirements are contradictory, particularly #2 and #7. You can't operate on them with += and also say they will never change size. "I always calculate the index myself" also doesn't make sense. What else would calculate it? The requirements for things you will hand to glBuffer are radically different than the requirements for things that will hold objects.
If you construct the Array the way you say, you'll get contiguous memory. If you want to be absolutely certain that you have contiguous memory, use a ContiguousArray (but in the vast majority of cases this will give you little to no benefit while costing you complexity; there appear to be some corner cases in the current compiler that give a small advantage to ContinguousArray, but you must benchmark before assuming that's true). It's not clear what kind of "abstractness" you have in mind, but there's no secrets about how Array works. All of stdlib is open source. Go look and see if it does things you want to avoid.
For certain kinds of operations, it is possible for other types of data structures to be faster. For instance, there are cases where a dispatch_data is better and cases where a regular Data would be better and cases where you should use a ManagedBuffer to gain more control. But in general, unless you deeply know what you're doing, you can easily make things dramatically worse. There is no "is always faster" data structure that works correctly for all the kinds of uses you describe. If there were, that would just be the implementation of Array.
None of this makes sense to pursue until you've built some code and started profiling it in optimized builds to understand what's going on. It is very likely that different uses would be optimized by different kinds of data structures.
It's very strange that you ask whether you should use NSArray, since that would be wildly (orders of magnitude) slower than Array for dealing with very large collections of numbers. You definitely need to experiment with these types a bit to get a sense of their characteristics. NSArray is brilliant and extremely fast for certain problems, but not for that one.
But again, write a little code. Profile it. Look at the generated assembler. See what's happening. Watch particularly for any undesired copying or retain counting. If you see that in a specific case, then you have something to think about changing data structures over. But there's no "use this to go fast." All the trade-offs to achieve that in the general case are already in Array.

boost lockfree spsc_queue cache memory access

I need to be extremely concerned with speed/latency in my current multi-threaded project.
Cache access is something I'm trying to understand better. And I'm not clear on how lock-free queues (such as the boost::lockfree::spsc_queue) access/use memory on a cache level.
I've seen queues used where the pointer of a large object that needs to be operated on by the consumer core is pushed into the queue.
If the consumer core pops an element from the queue, I presume that means the element (a pointer in this case) is already loaded into the consumer core's L2 and L1 cache. But to access the element, does it not need to access the pointer itself by finding and loading the element either from either the L3 cache or across the interconnect (if the other thread is on a different cpu socket)? If so, would it maybe be better to simply send a copy of the object that could be disposed of by the consumer?
Thank you.
C++ principally a pay-for-what-you-need eco-system.
Any regular queue will let you choose the storage semantics (by value or by reference).
However, this time you ordered something special: you ordered a lock free queue.
In order to be lock free, it must be able to perform all the observable modifying operations as atomic operations. This naturally restricts the types that can be used in these operations directly.
You might doubt whether it's even possible to have a value-type that exceeds the system's native register size (say, int64_t).
Good question.
Enter Ringbuffers
Indeed, any node based container would just require pointer swaps for all modifying operations, which is trivially made atomic on all modern architectures.
But does anything that involves copying multiple distinct memory areas, in non-atomic sequence, really pose an unsolvable problem?
No. Imagine a flat array of POD data items. Now, if you treat the array as a circular buffer, one would just have to maintain the index of the buffer front and end positions atomically. The container could, at leisure update in internal 'dirty front index' while it copies ahead of the external front. (The copy can use relaxed memory ordering). Only as soon as the whole copy is known to have completed, the external front index is updated. This update needs to be in acq_rel/cst memory order[1].
As long as the container is able to guard the invariant that the front never fully wraps around and reaches back, this is a sweet deal. I think this idea was popularized in the Disruptor Library (of LMAX fame). You get mechanical resonance from
linear memory access patterns while reading/writing
even better if you can make the record size aligned with (a multiple) physical cache lines
all the data is local unless the POD contains raw references outside that record
How Does Boost's spsc_queue Actually Do This?
Yes, spqc_queue stores the raw element values in a contiguous aligned block of memory: (e.g. from compile_time_sized_ringbuffer which underlies spsc_queue with statically supplied maximum capacity:)
typedef typename boost::aligned_storage<max_size * sizeof(T),
boost::alignment_of<T>::value
>::type storage_type;
storage_type storage_;
T * data()
{
return static_cast<T*>(storage_.address());
}
(The element type T need not even be POD, but it needs to be both default-constructible and copyable).
Yes, the read and write pointers are atomic integral values. Note that the boost devs have taken care to apply enough padding to avoid False Sharing on the cache line for the reading/writing indices: (from ringbuffer_base):
static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t);
atomic<size_t> write_index_;
char padding1[padding_size]; /* force read_index and write_index to different cache lines */
atomic<size_t> read_index_;
In fact, as you can see, there are only the "internal" index on either read or write side. This is possible because there's only one writing thread and also only one reading thread, which means that there could only be more space at the end of write operation than anticipated.
Several other optimizations are present:
branch prediction hints for platforms that support it (unlikely())
it's possible to push/pop a range of elements at once. This should improve throughput in case you need to siphon from one buffer/ringbuffer into another, especially if the raw element size is not equal to (a whole multiple of) a cacheline
use of std::unitialized_copy where possible
The calling of trivial constructors/destructors will be optimized out at instantiation time
the unitialized_copy will be optimized into memcpy on all major standard library implementations (meaning that e.g. SSE instructions will be employed if your architecture supports it)
All in all, we see a best-in-class possible idea for a ringbuffer
What To Use
Boost has given you all the options. You can elect to make your element type a pointer to your message type. However, as you already raised in your question, this level of indirection reduces locality of reference and might not be optimal.
On the other hand, storing the complete message type in the element type could become expensive if copying is expensive. At the very least try to make the element type fit nicely into a cache line (typically 64 bytes on Intel).
So in practice you might consider storing frequently used data right there in the value, and referencing the less-of-used data using a pointer (the cost of the pointer will be low unless it's traversed).
If you need that "attachment" model, consider using a custom allocator for the referred-to data so you can achieve memory access patterns there too.
Let your profiler guide you.
[1] I suppose say for spsc acq_rel should work, but I'm a bit rusty on the details. As a rule, I make it a point not to write lock-free code myself. I recommend anyone else to follow my example :)

Linked List in OpenCL

I have 1000 float datas in an array. I want to separate into different classes, lets say 4 classes. Their sizes are unpredictable. I could easily hold them in a linked list in a CPU implementation, but in OpenCL kernel, is there an opportunity like that? In my mind there are 3 solution to this problem.
First, arrays with length 1000 constructed in number of classes, which is memory costly.
Second, I allocate an array with length 1000 and separate them into parts. However, I may transport the values from and index into different index, becuase I don't know the size of each classes and they may exceed the size which I provided for each.
Third, and better in my opinion, I get two different array with same length. One of them stores data, the other one stores pointers. For example, in i-th index of data array, the value is stored which belongs to 2nd class. Additionally in i-th index of pointer to the next data which belongs to 2nd class. But this is good for just atomic type (like int, float, char etc) linked lists.
I am new in OpenCL. I haven't known lots of features of it yet. If there is a better way, please don't share with me and others.
Using pointers on GPU is usually very bad idea. Major amount of data resides in global memory, and to fetch it quickly the access should be coalesced. Using pointers breaks the access pattern totally, making it essentially random. It's not very good on CPUs too since it cause a lot of cache misses, but CPUs have larger caches and "smarter" internal logic, so it's usually not so important, but sometimes cache-aware memory access pattern can increase CPU application's speed by nearly order of magnitude. On GPUs coalesced global memory access is one of most important optimizations, and pointers can't provide it.
If you are not extremely short on memory, I'd suggest to use first way and preallocate arrays large enough to hold all data. If you are really short on memory, you could use textures to store your data and pointer arrays, but it depends on the algorithm whether it would provide any benefits or not.

Why do we use data structures? (when no dynamic allocation is needed)

I'm pretty sure this is a silly newbie question but I didn't know it so I had to ask...
Why do we use data structures, like Linked List, Binary Search Tree, etc? (when no dynamic allocation is needed)
I mean: wouldn't it be faster if we kept a single variable for a single object? Wouldn't that speed up access time? Eg: BST possibly has to run through some pointers first before it gets to the actual data.
Except for when dynamic allocation is needed, is there a reason to use them?
Eg: using linked list/ BST / std::vector in a situation where a simple (non-dynamic) array could be used.
Each thing you are storing is being kept in it's own variable (or storage location). Data structures apply organization to your data. Imagine if you had 10,000 things you were trying to track. You could store them in 10,000 separate variables. If you did that, then you'd always be limited to 10,000 different things. If you wanted more, you'd have to modify your program and recompile it each time you wanted to increase the number. You might also have to modify the code to change the way in which the calculations are done if the order of the items changes because the new one is introduced in the middle.
Using data structures, from simple arrays to more complex trees, hash tables, or custom data structures, allows your code to both be more organized and extensible. Using an array, which can either be created to hold the required number of elements or extended to hold more after it's first created keeps you from having to rewrite your code each time the number of data items changes. Using an appropriate data structure allows you to design algorithms based on the relationships between the data elements rather than some fixed ordering, giving you more flexibility.
A simple analogy might help to understand. You could, for example, organize all of your important papers by putting each of them into separate filing cabinet. If you did that you'd have to memorize (i.e., hard-code) the cabinet in which each item can be found in order to use them effectively. Alternatively, you could store each in the same filing cabinet (like a generic array). This is better in that they're all in one place, but still not optimum, since you have to search through them all each time you want to find one. Better yet would be to organize them by subject, putting like subjects in the same file folder (separate arrays, different structures). That way you can look for the file folder for the correct subject, then find the item you're looking for in it. Depending on your needs you can use different filing methods (data structures/algorithms) to better organize your information for it's intended use.
I'll also note that there are times when it does make sense to use individual variables for each data item you are using. Frequently there is a mixture of individual variables and more complex structures, using the appropriate method depending on the use of the particular item. For example, you might store the sum of a collection of integers in a variable while the integers themselves are stored in an array. A program would need to be pretty simple though before the introduction of data structures wouldn't be appropriate.
Sorry, but you didn't just find a great new way of doing things ;) There are several huge problems with this approach.
How could this be done without requring programmers to massively (and nontrivially) rewrite tons of code as soon as the number of allowed items changes? Even when you have to fix your data structure sizes at compile time (e.g. arrays in C), you can use a constant. Then, changing a single constant and recompiling is sufficent for changes to that size (if the code was written with this in mind). With your approach, we'd have to type hundreds or even thousands of lines every time some size changes. Not to mention that all this code would be incredibly hard to read, write, maintain and verify. The old truism "more lines of code = more space for bugs" is taken up to eleven in such a setting.
Then there's the fact that the number is almost never set in stone. Even when it is a compile time constant, changes are still likely. Writing hundreds of lines of code for a minor (if it exists at all) performance gain is hardly ever worth it. This goes thrice if you'd have to do the same amount of work again every time you want to change something. Not to mention that it isn't possible at all once there is any remotely dynamic component in the size of the data structures. That is to say, it's very rarely possible.
Also consider the concept of implicit and succinct data structures. If you use a set of hard-coded variables instead of abstracting over the size, you still got a data structure. You merely made it implicit, unrolled the algorithms operating on it, and set its size in stone. Philosophically, you changed nothing.
But surely it has a performance benefit? Well, possible, although it will be tiny. But it isn't guaranteed to be there. You'd save some space on data, but code size would explode. And as everyone informed about inlining should know, small code sizes are very useful for performance to allow the code to be in the cache. Also, argument passing would result in excessive copying unless you'd figure out a trick to derive the location of most variables from a few pointers. Needless to say, this would be nonportable, very tricky to get right even on a single platform, and liable to being broken by any change to the code or the compiler invocation.
Finally, note that a weaker form is sometimes done. The Wikipedia page on implicit and succinct data structures has some examples. On a smaller scale, some data structures store much data in one place, such that it can be accessed with less pointer chasing and is more likely to be in the cache (e.g. cache-aware and cache-oblivious data structures). It's just not viable for 99% of all code and taking it to the extreme adds only a tiny, if any, benefit.
The main benefit to datastructures, in my opinion, is that you are relationally grouping them. For instance, instead of having 10 separate variables of class MyClass, you can have a datastructure that groups them all. This grouping allows for certain operations to be performed because they are structured together.
Not to mention, having datastructures can potentially enforce type security, which is powerful and necessary in many cases.
And last but not least, what would you rather do?
string string1 = "string1";
string string2 = "string2";
string string3 = "string3";
string string4 = "string4";
string string5 = "string5";
Console.WriteLine(string1);
Console.WriteLine(string2);
Console.WriteLine(string3);
Console.WriteLine(string4);
Console.WriteLine(string5);
Or...
List<string> myStringList = new List<string>() { "string1", "string2", "string3", "string4", "string5" };
foreach (string s in myStringList)
Console.WriteLine(s);

Atomically read/write int value w/o additional operation on the int value itself

GCC offers a nice set of built-in functions for atomic operations. And being on MacOS or iOS, even Apple offers a nice set of atomic functions. However, all these functions perform an operation, e.g. an addition/subtraction, a logical operation (AND/OR/XOR) or a compare-and-set/compare-and-swap. What I am looking for is a way to atomically assign/read an int value, like:
int a;
/* ... */
a = someVariable;
That's all. a will be read by another thread and it is only important that a either has its old value or its new value. Unfortunately the C standard does not guarantee that assigning or reading a value is an atomic operation. I remember that I once read somewhere, that writing or reading a value to a variable of type int is guaranteed to be atomic in GCC (regardless the size of int) but I searched everywhere on the GCC homepage and I cannot find this statement any longer (maybe it was removed).
I cannot use sig_atomic_t because sig_atomic_t has no guaranteed size and it might also have a different size than int.
Since only one thread will ever "write" a value to a, while both threads will "read" the current value of a, I don't need to perform the operations themselves in an atomic manner, e.g.:
/* thread 1 */
someVariable = atomicRead(a);
/* Do something with someVariable, non-atomic, when done */
atomicWrite(a, someVariable);
/* thread 2 */
someVariable = atomicRead(a);
/* Do something with someVariable, but never write to a */
If both threads were going to write to a, then all operations would have to be atomic, but that way, this may only waste CPU time; and we are extremely low on CPU resources in our project. So far we use a mutex around read/write operations of a and even though the mutex is held for such a tiny amount of time, this already causes problems (one of the threads is a realtime thread and blocking on a mutex causes it to fail its realtime constraints, which is pretty bad).
Of course I could use a __sync_fetch_and_add to read the variable (and simply add "0" to it, to not modify its value) and for writing use a __sync_val_compare_and_swap for writing it (as I know its old value, so passing that in will make sure the value is always exchanged), but won't this add unnecessary overhead?
A __sync_fetch_and_add with a 0 argument is indeed the best bet if you want your load to be atomic and act as a memory barrier. Similarly, you can use an and with 0 or an or with -1 to store 0 and -1 atomically with a memory barrier. For writing, you can use __sync_test_and_set (actually an xchg operation) if an "acquire" barrier is enough, or if using Clang you can use __sync_swap (which is an xchg operation with a full barrier).
However, in many cases that's overkill and you may prefer to add memory barriers manually. If you do not want the memory barrier, you can use a volatile load to atomically read/write a variable that is aligned and no wider than a word:
#define __sync_access(x) (*(volatile __typeof__(x) *) &(x))
(This macro is an lvalue, so you can also use it for a store like __sync_store(x) = 0). The function implements the same semantics as the C++11 memory_order_consume form, but only under two assumptions:
that your machine has coherent caches; if not, you need a memory barrier or global cache flush before the load (or before the first of a group of load).
that your machine is not a DEC Alpha. The Alpha had very relaxed semantics for reordering memory accesses, so on it you'd need a memory barrier after the load (and after each load in a group of loads). On the Alpha the above macro only provides memory_order_relaxed semantics. BTW, the first versions of the Alpha couldn't even store a byte atomically (only a word, which was 8 bytes).
In either case, the __sync_fetch_and_add would work. As far as I know, no other machine imitated the Alpha so neither assumption should pose problems on current computers.
Volatile, aligned, word sized reads/writes are atomic on most platforms. Checking your assembly would be the best way to find out if this is true on your platform. Atomic registers cannot produce nearly as many interesting wait free structures as the more complicated mechanisms like compare and swap, which is why they are included.
See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5659&rank=3 for the theory.
Regarding synch_fetch_and_add with a 0 argument - This seems like the safest bet. If you're worried about efficiency, profile the code and see if you're meeting your performance targets. You may be falling victim to premature optimization.

Resources