Related
For example, in thread 1 there is executing something and it uses a global variable, but another thread may change this value
thread 1
a = 1;
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
NSLog(#"a = %d", a);
});
thread 2
a = 2;
there are two questions,
if thread 1 executes first and can I assume system will always print a = 1? or system can change to thread 2 halfway and then change to thread 1 and get a = 2?
if I don't put NSLog in dispatch_asyc(), whether this cause different result?
You can't tell or guarantee exactly. You can assign the threads (or queues) priorities but you still don't know exactly what will happen.
Yes, logging will make a difference to the runtime, again you don't know if it might make a difference to the thread management.
So, if you need something to be protected from access by multiple threads then you need to protect it by adding some synchronisation. How you choose to do that depends on what it is and each case needs to be considered separately.
As i know, volatile is usually used to prevent unexpected compile optimization during some hardware operations. But which scenes volatile should be declared in property definition puzzles me. Please give some representative examples.
Thx.
A compiler assumes that the only way a variable can change its value is through code that changes it.
int a = 24;
Now the compiler assumes that a is 24 until it sees any statement that changes the value of a. If you write code somewhere below above statement that says
int b = a + 3;
the compiler will say "I know what a is, it's 24! So b is 27. I don't have to write code to perform that calculation, I know that it will always be 27". The compiler may just optimize the whole calculation away.
But the compiler would be wrong in case a has changed between the assignment and the calculation. However, why would a do that? Why would a suddenly have a different value? It won't.
If a is a stack variable, it cannot change value, unless you pass a reference to it, e.g.
doSomething(&a);
The function doSomething has a pointer to a, which means it can change the value of a and after that line of code, a may not be 24 any longer. So if you write
int a = 24;
doSomething(&a);
int b = a + 3;
the compiler will not optimize the calculation away. Who knows what value a will have after doSomething? The compiler for sure doesn't.
Things get more tricky with global variables or instance variables of objects. These variables are not on stack, they are on heap and that means that different threads can have access to them.
// Global Scope
int a = 0;
void function ( ) {
a = 24;
b = a + 3;
}
Will b be 27? Most likely the answer is yes, but there is a tiny chance that some other thread has changed the value of a between these two lines of code and then it won't be 27. Does the compiler care? No. Why? Because C doesn't know anything about threads - at least it didn't used to (the latest C standard finally knows native threads, but all thread functionality before that was only API provided by the operating system and not native to C). So a C compiler will still assume that b is 27 and optimize the calculation away, which may lead to incorrect results.
And that's what volatile is good for. If you tag a variable volatile like that
volatile int a = 0;
you are basically telling the compiler: "The value of a may change at any time. No seriously, it may change out of the blue. You don't see it coming and *bang*, it has a different value!". For the compiler that means it must not assume that a has a certain value just because it used to have that value 1 pico-second ago and there was no code that seemed to have changed it. Doesn't matter. When accessing a, always read its current value.
Overuse of volatile prevents a lot of compiler optimizations, may slow down calculation code dramatically and very often people use volatile in situations where it isn't even necessary. For example, the compiler never makes value assumptions across memory barriers. What exactly a memory barrier is? Well, that's a bit far beyond the scope of my reply. You just need to know that typical synchronization constructs are memory barriers, e.g. locks, mutexes or semaphores, etc. Consider this code:
// Global Scope
int a = 0;
void function ( ) {
a = 24;
pthread_mutex_lock(m);
b = a + 3;
pthread_mutex_unlock(m);
}
pthread_mutex_lock is a memory barrier (pthread_mutex_unlock as well, by the way) and thus it's not necessary to declare a as volatile, the compiler will not make an assumption of the value of a across a memory barrier, never.
Objective-C is pretty much like C in all these aspects, after all it's just a C with extensions and a runtime. One thing to note is that atomic properties in Obj-C are memory barriers, so you don't need to declare properties volatile. If you access the property from multiple threads, declare it atomic, which is even default by the way (if you don't mark it nonatomic, it will be atomic). If you never access it from multiple thread, tagging it nonatomic will make access to that property a lot faster, but that only pays off if you access the property really a lot (a lot doesn't mean ten times a minute, it's rather several thousand times a second).
So you want Obj-C code, that requires volatile?
#implementation SomeObject {
volatile bool done;
}
- (void)someMethod {
done = false;
// Start some background task that performes an action
// and when it is done with that action, it sets `done` to true.
// ...
// Wait till the background task is done
while (!done) {
// Run the runloop for 10 ms, then check again
[[NSRunLoop currentRunLoop]
runUntilDate:[NSDate dateWithTimeIntervalSinceNow:0.01]
];
}
}
#end
Without volatile, the compiler may be dumb enough to assume, that done will never change here and replace !done simply with true. And while (true) is an endless loop that will never terminate.
I haven't tested that with modern compilers. Maybe the current version of clang is more intelligent than that. It may also depend on how you start the background task. If you dispatch a block, the compiler can actually easily see whether it changes done or not. If you pass a reference to done somewhere, the compiler knows that the receiver may the value of done and will not make any assumptions. But I tested exactly that code a long time ago when Apple was still using GCC 2.x and there not using volatile really caused an endless loop that never terminated (yet only in release builds with optimizations enabled, not in debug builds). So I would not rely on the compiler being clever enough to do it right.
Just some more fun facts about memory barriers:
If you ever had a look at the atomic operations that Apple offers in <libkern/OSAtomic.h>, then you might have wondered why every operation exists twice: Once as x and once as xBarrier (e.g. OSAtomicAdd32 and OSAtomicAdd32Barrier). Well, now you finally know it. The one with "Barrier" in its name is a memory barrier, the other one isn't.
Memory barriers are not just for compilers, they are also for CPUs (there exists CPU instructions, that are considered memory barriers while normal instructions are not). The CPU needs to know these barriers because CPUs like to reorder instructions to perform operations out of order. E.g. if you do
a = x + 3 // (1)
b = y * 5 // (2)
c = a + b // (3)
and the pipeline for additions is busy, but the pipeline for multiplication is not, the CPU may perform instruction (2) before (1), after all the order won't matter in the end. This prevents a pipeline stall. Also the CPU is clever enough to know that it cannot perform (3) before either (1) or (2) because the result of (3) depends on the results of the other two calculations.
Yet, certain kinds of order changes will break the code, or the intention of the programmer. Consider this example:
x = y + z // (1)
a = 1 // (2)
The addition pipe might be busy, so why not just perform (2) before (1)? They don't depend on each other, the order shouldn't matter, right? Well, it depends. Consider another thread monitors a for changes and as soon as a becomes 1, it reads the value of x, which should now be y+z if the instructions were performed in order. Yet if the CPU reordered them, then x will have whatever value it used to have before getting to this code and this makes a difference as the other thread will now work with a different value, not the value the programmer would have expected.
So in this case the order will matter and that's why barriers are needed also for CPUs: CPUs don't order instructions across such barriers and thus instruction (2) would need to be a barrier instruction (or there needs to be such an instruction between (1) and (2); that depends on the CPU). However, reordering instructions is only performed by modern CPUs, a much older problem are delayed memory writes. If a CPU delays memory writes (very common for some CPUs, as memory access is horribly slow for a CPU), it will make sure that all delayed writes are performed and have completed before a memory barrier is crossed, so all memory is in a correct state in case another thread might now access it (and now you also know where the name "memory barrier" actually comes from).
You are probably working a lot more with memory barriers than you are even aware of (GCD - Grand Central Dispatch is full of these and NSOperation/NSOperationQueue bases on GCD), that's why your really need to use volatile only in very rare, exceptional cases. You might get away writing 100 apps and never have to use it even once. However, if you write a lot low level, multi-threading code that aims to achieve maximum performance possible, you will sooner or later run into a situation where only volatile can grantee you correct behavior; not using it in such a situation will lead to strange bugs where loops don't seem to terminate or variables simply seem to have incorrect values and you find no explanation for that. If you run into bugs like these, especially if you only see them in release builds, you might miss a volatile or a memory barrier somewhere in your code.
A good explanation is given here: Understanding “volatile” qualifier in C
The volatile keyword is intended to prevent the compiler from applying any optimizations on objects that can change in ways that cannot be determined by the compiler.
Objects declared as volatile are omitted from optimization because their values can be changed by code outside the scope of current code at any time. The system always reads the current value of a volatile object from the memory location rather than keeping its value in temporary register at the point it is requested, even if a previous instruction asked for a value from the same object. So the simple question is, how can value of a variable change in such a way that compiler cannot predict. Consider the following cases for answer to this question.
1) Global variables modified by an interrupt service routine outside the scope: For example, a global variable can represent a data port (usually global pointer referred as memory mapped IO) which will be updated dynamically. The code reading data port must be declared as volatile in order to fetch latest data available at the port. Failing to declare variable as volatile, the compiler will optimize the code in such a way that it will read the port only once and keeps using the same value in a temporary register to speed up the program (speed optimization). In general, an ISR used to update these data port when there is an interrupt due to availability of new data
2) Global variables within a multi-threaded application: There are multiple ways for threads communication, viz, message passing, shared memory, mail boxes, etc. A global variable is weak form of shared memory. When two threads sharing information via global variable, they need to be qualified with volatile. Since threads run asynchronously, any update of global variable due to one thread should be fetched freshly by another consumer thread. Compiler can read the global variable and can place them in temporary variable of current thread context. To nullify the effect of compiler optimizations, such global variables to be qualified as volatile
If we do not use volatile qualifier, the following problems may arise
1) Code may not work as expected when optimization is turned on.
2) Code may not work as expected when interrupts are enabled and used.
volatile comes from C. Type "C language volatile" into your favourite search engine (some of the results will probably come from SO), or read a book on C programming. There are plenty of examples out there.
i've just started playing around with posix pthreads (on c++).
I'm trying to use a conditional variable to start many threads at once.
Does someone know a better way to do this or can give an example of how one would?
If you have ruled out pthread_cond_broadcast, and are trying to do this you probably have already created the threads and might be looking for a way to gather release them all at once. If that is the case you may want to use a barrier.
You can initialize a barrier with pthread_barrier_init which takes a parameter for the number of threads you want to wait on. When the specified number of threads have hit a pthread_barrier_wait statement all the waiting threads are released at once (i.e. marked ready to run), though of course they remain subject to the whims of scheduler as to which may or may not immediately get processor time.
A very simple sketch
void* tfunc(void *)
{
pthread_barrier_wait(&bar);
//do stuff
}
pthread_barrier_init(&bar, NULL, 4);
for (int i = 0; i < 4; ++i)
pthread_create(&tid[i], NULL, tfunc, NULL);
When the 4th thread hits the wait all the waiting threads will continue.
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.
i am writing records into mnesia which should be kept there
only for an allowed time (24 hours). after 24 hours, before a user modifies part of them,
the system should remove them automatically. forexample, a user is given free airtime (for voice calls)
which they should use in a given time. if they do not use it, after 24 hours, the system should
remove these resource reservation from the users record.
Now, this has brought in timers. an example of a record structure is:
-record(free_airtime,
{
reference_no,
timer_object, %% value returned by timer:apply_after/4
amount
}).
The timer object in the record is important because in case the user
finally puts to use the resources reserved before they are timed out
(or if they time out),the system can call timer:cancel/1 so as to relieve
the timer server from this object.
Now the problem, i have two ways of handling timers on these records:
Option 1: timers handled within the transaction
reserve_resources(Reference_no,Amnt)->
F = fun(Ref_no,Amount) ->
case mnesia:read({free_airtime,Ref_no}) of
[] ->
case mnesia:write(#free_airtime{reference_no = Ref_no,amount = Amount}) == ok of
true ->
case timer:apply_after(timer:hours(24),?MODULE,reference_no_timed_out,[Ref_no]) of
{ok,Timer_obj} ->
[Obj] = mnesia:read({free_airtime,Ref_no}),
mnesia:write(Obj#free_airtime{timer_object = Timer_obj});
_ -> mnesia:abort({error,failed_to_time_object})
end;
false -> mnesia:abort({error,write_failed})
end;
[_] -> mnesia:abort({error,exists,Ref_no})
end
end,
mnesia:activity(transaction,F,[Reference_no,Amnt],mnesia_frag).
About the above option.
Mnesia docs say that transactions maybe repeated by the tm manager (due to some reason)
until they are successful, and so when you put code which is io:format/2 or any other which has nothing to do with
writes or reads, it may get executed several times. This statement made me pause at this point
and think of a way of handling timers out of the transaction it self, so i modified the code as
follows:
Option 2: timers handled outside the transaction
reserve_resources(Reference_no,Amnt)->
F = fun(Ref_no,Amount) ->
case mnesia:read({free_airtime,Ref_no}) of
[] ->
P = #free_airtime{reference_no = Ref_no,amount = Amount},
ok = mnesia:write(P),
P;
[_] -> mnesia:abort({error,exists,Ref_no})
end
end,
Result = try mnesia:activity(transaction,F,[Reference_no,Amnt],mnesia_frag) of
Any -> Any
catch
exit:{aborted,{error,exists,XX}} -> {exists,XX}
E1:E2 -> {error,{E1,E2}}
end,
on_reservation(Result).
on_reservation(#free_airtime{reference_no = Some_Ref})->
case timer:apply_after(timer:hours(24),?MODULE,reference_no_timed_out,[Some_Ref]) of
{ok,Timer_obj} ->
[Obj] = mnesia:activity(transaction,fun(XX) -> mnesia:read({free_airtime,XX}) end,[Some_Ref],mnesia_frag),
ok = mnesia:activity(transaction,fun(XX) -> mnesia:write(XX) end,[Obj#free_airtime{timer_object = Timer_obj}],mnesia_frag);
_ ->
ok = mnesia:activity(transaction,fun(XX) -> mnesia:delete({free_airtime,XX}) end,[Some_Ref],mnesia_frag),
{error,failed_to_time_object}
end;
on_reservation(Any)-> Any.
The code to handle time out of the reservation:
reference_no_timed_out(Ref_no)->
do_somethings_here.....
then later remove this reservation from the database....below..
ok = mnesia:activity(transaction,fun(XX) -> mnesia:delete({free_airtime,XX}) end,[Ref_no],mnesia_frag).
Now i thought that in option 2, i am safer by keeping the timer processing
code out, even when mnesia_tm re-executes the transaction due to its reasons
, this piece of code is not run twice (i avoid having several timer objects
against the same record).
Question 1: Which of these two implementations is right? and/or wrong? Tell me (also)
wether both of them are wrong
Question 2: The module timer, is it well suited for handling large numbers of timer
jobs in production?
Question 3: As compared to Sean Hinde's timer_mn-1.1,
which runs on top of mnesia, is the timer module (possibly running on top of Ets tables) less
capable (for real) in production?
(am asking this because using Sean Hinde's timer_mn on a system which itself is using mnesia appears
to be a problem in terms schema changes, node problems e.t.c)
If any one has another way of handling timer related problems with mnesia, update me
thanx guys...
Question 1:
Handle the timer outside the transaction. When transactions collide in Mnesia, they are simply repeated. That would give you more than one timer reference and two triggers of the timer. It is not a problem per se, but if you wait until the success of the transaction before installing the timer, you can avoid the problem.
The second solution is what I would do. If the TX is okay, you can install a timer on it. If the timer triggers and there is no reference to the object, it doesn't matter. You are only to worry about if this situation happens a lot since you would then have a large number of stray timers.
Question 2:
The timer module is neat, but the performance guide recommends you use the erlang:start_timer BIFs instead, see
http://www.erlang.org/doc/efficiency_guide/commoncaveats.html#id58959
I would introduce a separate process as a gen_server which handles the timing stuff. You send it a remove(timer:hours(24), RefNo) message and then it starts up a timer, gets a TRef and installs a mapping {TRef, RefNo, AuxData} in either Mnesia or ETS. When the timer trigger, the process can spawn a helper removing the RefNo entry from the main table.
At this point, you must wonder about crashes. The removal gen_server may crash. Also, the whole node may crash. How you want to reinstall timers in the case this happens is up to you, but you ought to ponder on it happening so you can solve it. Suppose we come up again and the timer information is loaded in from disk. How do you plan on reinstalling the timers?
One way is to have AuxData contain information about the timeout point. Every hour or 15 minutes, you scan all of the table, removing guys that shouldn't be there. In fact, you could opt for this being the main way to remove timer structures. Yes, you will give people 15 minutes of extra time in the worst case, but it may be easier to handle code-wise. At least it better handles the case where the node (and thus the timers) die.
Another option again is to cheat and only store timings rougly in a data structure which makes it very cheap to find all expired RefNo's in the last 5 minutes and then run that every 5 minutes. Doing stuff in bulk is probably going to be more effective. This kind of bulk-handling is used a lot by operating system kernels for instance.
Question 3
I know nothing about timer-tm, sorry :)