OpenMP Shared Array Load/Store - memory

I have two questions about WRITE/READ operations on shared arrays.
1) In my program I write a different element of a given shared array at each iteration of an OpenMP-parallelized DO LOOP. Given that each iteration acts on a different part of the array and no race condition occurs, I avoided the use of CRITICAL. In a different subroutine I aldo READ different elements at each iteration. The results that I get should be right but I'm just wondering whether this is fine or I should enclose the READ/WRITE section in a CRITICAL block.
2) Generally speaking, can we say that if there is no risk of race condition as in the two cases above, the CRITICAL or ATOMIC blocks can be omitted?
Thanks in advance
Timmy

1) So long as you are right, and different threads write to different array elements, then there will be no data races. The straightforward approach to this is to write (pseudo-)code like this
!$omp parallel do
do i = 1, many
array(i) = something ...
...
in which each thread gets a different set of i values, and therefore a different set of array elements to write to.
Reading will not cause data races. Ill-disciplined reading may result in over-active caching and might result in poor performance. But not data races.
2) Yes, generally speaking, if there are no data races you can avoid using atomic or critical blocks.

Related

Array of references that share an Arc

This one's kind of an open ended design question I'm afraid.
Anyway: I have a big two-dimensional array of stuff. This array is mutable, and is accessed by a bunch of threads. For now I've just been dealing with this as a Arc<Mutex<Vec<Vec<--owned stuff-->>>>, which has been fine.
The problem is that stuff is about to grow considerably in size, and I'll want to start holding references rather than complete structures. I could do this by inverting everything and going to Vec<Vec<Arc<Mutex>>, but I feel like that would be a ton of overhead, especially because each thread would need a complete copy of the grid rather than a single Arc/Mutex.
What I want to do is have this be an array of references, but somehow communicate that the items being referenced all live long enough according to a single top-level Arc or something similar. Is that possible?
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
EDIT:Giving some more specifics on my code (away from home so this is rough):
What I want:
Outer scope initializes a bunch of Ts and somehow collectively ensures they live long enough (that's the hard part)
Outer scope initializes a grid :Something<Vec<Vec<&T>>> that stores references to the Ts
Outer scope creates a bunch of threads and passes grid to them
Threads dive in and out of some sort of (problable RW) lock on grid, reading the Tsand changing the &Ts in the process.
What I have:
Outer thread creates a grid: Arc<RwLock<Vector<Vector<T>>>>
Arc::clone(& grid)s are passed to individual threads
Read-heavy threads mostly share the lock and sometimes kick each other out for the writes.
The only problem with this is that the grid is storing actual Ts which might be problematically large. (Don't worry too much about the RwLock/thread exclusivity stuff, I think it's perpendicular to the question unless something about it jumps out at you.)
What I don't want to do:
Top level creates a bunch of Arc<Mutex<T>> for individual T
Top level creates a `grid : Vec<Vec<Arc<Mutex>>> and passes it to threads
The problem with that is that I worry about the size of Arc/Mutex on every grid element (I've been going up to 2000x2000 so far and may go larger). Also while the threads would lock each other out less (only if they're actually looking at the same square), they'd have to pick up and drop locks way more as they explore the array, and I think that would be worse than my current RwLock implementation.
Let me start of by your "aside" question, as I feel it's the one that can be answered:
As an aside, is Vec even the correct data type for this? For the grid in particular I really want a large, fixed-size block of memory that will live for the entire length of the program once it's initialized, and has a lot of reference locality (along either dimension.) Is there something else/more specialized I should be using?
The documenation of std::vec::Vec specifies that the layout is essentially a pointer with size information. That means that any Vec<Vec<T>> is a pointer to a densely packed array of pointers to densely packed arrays of Ts. So if block of memory means a contiguous block to you, then no, Vec<Vec<T>> cannot give that you. If that is part of your requirements, you'd have to deal with a datatype (let's call it Grid) that is basically a (pointer, n_rows, n_columns) and define for yourself if the layout should be row-first or column-first.
The next part is that if you want different threads to mutate e.g. columns/rows of your grid at the same time, Arc<Mutex<Grid>> won't cut it, but you already figured that out. You should get clarity whether you can split your problem such that each thread can only operate on rows OR columns. Remember that if any thread holds a &mut Row, no other thread must hold a &mut Column: There will be an overlapping element, and it will be very easy for you to create a data races. If you can assign a static range of of rows to a thread (e.g. thread 1 processes rows 1-3, thread 2 processes row 3-6, etc.), that should make your life considerably easier. To get into "row-wise" processing if it doesn't arise naturally from the problem, you might consider breaking it into e.g. a row-wise step, where all threads operate on rows only, and then a column-wise step, possibly repeating those.
Speculative starting point
I would suggest that your main thread holds the Grid struct which will almost inevitably be implemented with some unsafe methods, e.g. get_row(usize), get_row_mut(usize) if you can split your problem into rows/colmns or get(usize, usize) and get(usize, usize) if you can't. I cannot tell you what exactly these should return, but they might even be custom references to Grid, which:
can only be obtained when the usual borrowing rules are fulfilled (e.g. by blocking the thread until any other GridRefMut is dropped)
implement Drop such that you don't create a deadlock
Every thread holds a Arc<Grid>, and can draw cells/rows/columns for reading/mutating out of the grid as needed, while the grid itself keeps book of references being created and dropped.
The downside of this approach is that you basically implement a runtime borrow-checker yourself. It's tedious and probably error-prone. You should browse crates.io before you do that, but your problem sounds specific enough that you might not find a fitting solution, let alone one that's sufficiently documented.

Would there be a practical application for a more memory efficient boolean?

I've noticed that booleans occupy a whole byte, despite only needing 1 bit. I was wondering whether we could have something like
struct smartbool{char data;}
, which would store 8 booleans at once.
I am aware that it would take more time to retrieve data, although would the tradeoff be a practical application in some scenarios?
Am I missing something about the memory usage of booleans?
Normally variables are aligned on word boundaries, memory use is balanced against efficiency of access. For one-off boolean variables it may not make sense to store them in a denser form.
If you do need a bunch of booleans you can use things like this BitSet data structure: https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/BitSet.html.
There is a type of database index that stores booleans efficiently:
https://en.wikipedia.org/wiki/Bitmap_index. The less space an index takes up the easier it is to keep in memory.
There are already widely used data types that support multiple booleans, they are called integers. you can store and retrieve multiple booleans in an integral type, using bitwise operations, screening out the bits you don't care about with a pattern of bits called a bitmask.
This sort of "packing" is certainly possible and sometimes useful, as a memory-saving optimization. Many languages and libraries provide a way to make it convenient, e.g. std::vector<bool> in C++ is meant to be implemented this way.
However, it should be done only when the programmer knows it will happen and specifically wants it. There is a tradeoff in speed: if bits are used, then setting / clearing / testing a specific bool requires first computing a mask with an appropriate shift, and setting or clearing it now requires a read-modify-write instead of just a write.
And there is a more serious issue in multithreaded programs. Languages like C++ promise that different threads can freely modify different objects, including different elements of the same array, without needing synchronization or causing a data race. For instance, if we have
bool a, b; // not atomic
void thread1() { /* reads and writes a */ }
void thread2() { /* reads and writes b */ }
then this is supposed to work fine. But if the compiler made a and b two different bits in the same byte, concurrent accesses to them would be a data race on that byte, and could cause incorrect behavior (e.g. if the read-modify-writes being done by the two threads were interleaved). The only way to make it safe would be to require that both threads use atomic operations for all their accesses, which are typically many times slower. And if the compiler could freely pack bools in this way, then every operation on a potentially shared bool would have to be made atomic, throughout the entire program. That would be prohibitively expensive.
So this is fine if the programmer wants to pack bools to save memory, is willing to take the hit to speed, and can guarantee that they won't be accessed concurrently. But they should be aware that it's happening, and have control over whether it does.
(Indeed, some people feel that having C++ provide this with vector<bool> was a mistake, since programmers have to know that it is a special exception to the otherwise general rule that vector<T> behaves like an array of T, and different elements of the vector can safely be accessed concurrently. Perhaps they should have left vector<bool> to work in the naive way, and given a different name to the packed version, similar to std::bitset.)

Using something other than a Swift array for mutable fixed-size thread-safe data passed to OpenGL buffer

I am trying to squeeze every bit of efficiency out of my application I am working on.
I have a couple arrays that follow the following conditions:
They are NEVER appended to, I always calculate the index myself
The are allocated once and never change size
It would be nice if they were thread safe as long as it doesn't cost performance
Some hold primitives like floats, or unsigned ints. One of them does hold a class.
Most of these arrays at some point are passed into a glBuffer
Never cleared just overwritten
Some of the arrays individual elements are changed entirely by = others are changed by +=
I currently am using swift native arrays and am allocating them like var arr = [GLfloat](count: 999, repeatedValue: 0) however I have been reading a lot of documentation and it sounds like Swift arrays are much more abstract then a traditional C-style array. I am not even sure if they are allocated in a block or more like a linked list with bits and pieces thrown all over the place. I believe by doing the code above you cause it to allocate in a continuous block but i'm not sure.
I worry that the abstract nature of Swift arrays is something that is wasting a lot of precious processing time. As you can see by my above conditions I dont need any of the fancy appending, or safety features of Swift arrays. I just need it simple and fast.
My question is: In this scenario should I be using some other form of array? NSArray, somehow get a C-style array going, create my own data type?
Im looking into thread safety, would a different array type that was more thread safe such as NSArray be any slower?
Note that your requirements are contradictory, particularly #2 and #7. You can't operate on them with += and also say they will never change size. "I always calculate the index myself" also doesn't make sense. What else would calculate it? The requirements for things you will hand to glBuffer are radically different than the requirements for things that will hold objects.
If you construct the Array the way you say, you'll get contiguous memory. If you want to be absolutely certain that you have contiguous memory, use a ContiguousArray (but in the vast majority of cases this will give you little to no benefit while costing you complexity; there appear to be some corner cases in the current compiler that give a small advantage to ContinguousArray, but you must benchmark before assuming that's true). It's not clear what kind of "abstractness" you have in mind, but there's no secrets about how Array works. All of stdlib is open source. Go look and see if it does things you want to avoid.
For certain kinds of operations, it is possible for other types of data structures to be faster. For instance, there are cases where a dispatch_data is better and cases where a regular Data would be better and cases where you should use a ManagedBuffer to gain more control. But in general, unless you deeply know what you're doing, you can easily make things dramatically worse. There is no "is always faster" data structure that works correctly for all the kinds of uses you describe. If there were, that would just be the implementation of Array.
None of this makes sense to pursue until you've built some code and started profiling it in optimized builds to understand what's going on. It is very likely that different uses would be optimized by different kinds of data structures.
It's very strange that you ask whether you should use NSArray, since that would be wildly (orders of magnitude) slower than Array for dealing with very large collections of numbers. You definitely need to experiment with these types a bit to get a sense of their characteristics. NSArray is brilliant and extremely fast for certain problems, but not for that one.
But again, write a little code. Profile it. Look at the generated assembler. See what's happening. Watch particularly for any undesired copying or retain counting. If you see that in a specific case, then you have something to think about changing data structures over. But there's no "use this to go fast." All the trade-offs to achieve that in the general case are already in Array.

Why object's mutability is relevant to its Thread safe?

I am working on CoreImage on IOS, you guys know CIContext and CIImage are immutable so they can be shared on threads. The problem is i am not sure why objects' mutability is closely relevant to its thread safe.
I can guess the rough reason is to prevent multiple threads from do something on a certain object at the same time. But can anybody provide some theoretical evidence or give a specific answer to it?
You are correct, mutable objects are not thread safe because multiple threads can write to that data at the same time. This is in contrast to reading data, an operation multiple threads can do simultaneously without causing problems. Immutable types can only be read.
However, when multiple threads are writing to the same data, these operations may interfere. For example, an NSMutableArray which two threads are editing could easily be corrupted. One thread is editing at one location, suddenly changing the memory the other thread was updating. iOS will use what is called an atomic operation for simple data types. What this means is that iOS requires the edit operation to fully finish before anything else can happen. This has an efficiency advantage over locks. Just Google about this stuff if you want to know more.
Well, actually, you are right.
If you have an immutable objects, all you can do is to read data from them. And every thread will get the same data, since the object can not be changed.
When you have a mutable object, the problem is that (in general) read and write operations are not atomic. It means that they are not performed instantaneously, they take time. So they can overlap each other. Even if we have a single-core processor, it can switch between threads at arbitrary moments of time. Possible problems it might cause:
1) Two threads write at the same object simultaneously. In this case the object might become corrupted (for instance, half of data comes from the first thread, the other half – from the second, the result is unpredictable.
2) One thread writes the data, another one reads it. One problem is that thread might read already outdated data. Another one is that it might read a corrupted data (if the writing thread haven't finished writing yet).
Take the example with an array of values. Let's say I'm doing some computation and I find the count of the array to be 4
var array = [1,2,3,4]
let count = array.count
Now maybe I want to loop through all the elements in the array, so I setup a loop that goes through index i<4 or something along those lines. So far so good.
This array is mutable though, so on a separate thread, I could easily remove an element. So perhaps I'm on thread 1 and I get the count to be 4, I perhaps start looping through the array. Now we switch to thread 2 and I remove an element, so now this same array has only 3 values in it. If I end up going back to thread 1 and I still assume I have 4 values while looping through the array, my program is going to crash when it tries to access the 4th element.
This is why immutability is desirable, it can guarantee some level of consistency across threads.

How expensive is the NSArray count method?

I am in a situation that I need to get the items in an array and time is sensitive. I have the option of using a separate variable to hold the current count or just use NSMutableArray's count method.
ex: if (myArray.count == ... ) or if (myArrayCount == ...)
How expensive is it to get the counting of items from the count method of an array?
The correct answer is, there is no difference in speed, so access the count of the array as you wish my child :)
Fetching NSArray's count method is no more expensive then fetching a local variable in which you've stored this value. It's not calculated when it's called. It's calculated when the array is created and stored.
For NSMutableArray, the only difference is that the property is recalculated any time you modify the contents of the array. The end result is still the same--when you call count, the number returned was precalculated. It's just returning the precalculated number it already stored.
Storing count in a variable, particularly for an NSMutableArray is actually a worse option because the size of the array could change, and access the count in this variable is not faster whatsoever. It only provides the added risk of potential inaccuracy.
The best way to prove to yourself that this is a preset value that is not calculated upon the count method being called is to create two arrays. One array has only a few elements. The other array has tens of thousands of elements. Now time how long it takes count to return. You'll find the time for count to return is identical no matter the size of the array.
As a correction to everyone above, NSArray does not have a count property. It has a count method. The method itself either physically counts all of the elements within the array or is a getter for a private variable the array stores. Unless you plan on subclassing NSArray and create a higher efficient system for counting dynamic and/or static arrays... you're not going to get better performance than using the count method on an NSArray. As a matter of fact, you should count on the fact that Apple has already optimized this method to it's max. My main ponder after this is that if you are doing an asynchronous call and your focus is optimizing the count of an NSArray how do you not know that you are seriously doing something wrong. If you are performing some high performance hitting method on the main thread or such... you should consider optimizing that. The performance hit of iterating and counting through the array using NSArray's count method should in no way effect your performance to any noticeable rate.
You should read up more on performance for NSArrays and NSMutableArrays if this is truly a concern for you. You can start here: link
If you need to get the item**s** then getting the count is not time critical. You'd also want to look at fast enumeration, or using enumeration with dispatch blocks, especially with parallel execution.
Edit:
Asa's is the most correct answer. I misunderstood the question.
Asa is right because the compiler will automatically optimize this and use the fastest way on its own.
TheGamingArt is correct about NSArray being as optimal as could be. However, this is only for obj-c.
Don't forget you have access to c and c++ which means you can use vectors which should be only 'slightly' faster considering it won't use obj-c messaging. However, it wouldn't surprise me if the difference isn't noticeable. c++ vector benchmarks: http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html
This is a good example of Premature Optimization (http://c2.com/cgi/wiki?PrematureOptimization). I suggest you look into GCD or NSOperations (http://www.raywenderlich.com/19788/how-to-use-nsoperations-and-nsoperationqueues)

Resources