destroying an orphaned process-shared condition variable - pthreads

Is the behavior of pthread_cond_destroy on an orphaned, process-shared condition variable specified, unspecified, implementation-defined, or undefined? Also, is the behavior I'm seeing on Linux (spelled out below) a bug?
By an "orphaned" cv here I mean one that was in a pthread_cond_wait call at the time of its waiter's death.
Adapting a scenario from this question, I find that if I do this on Linux:
Time Process A Process B Comments
---- --------- --------- --------
1 mmap MAP_ANONYMOUS // or shm_open()
2 init pshared cv
3 init pshared mutex
4 fork ------------------> lock(mutex) // can also re-shm_open()
5 wait... alarm(a_timeout)
6 wait... cond_wait(cv, mutex)
7 wait <------------------ <<ALRM>>
8 cond_signal(cv) // (without this, EBUSY for #9)
9 cond_destroy(cv) // blocks on linux
On Linux, the destroy() (#9) blocks forever. If I omit the signal (#8) to the orphaned cv, then the Linux destroy() returns EBUSY. On OS X, by contrast, the destroy() always returns EBUSY, regardless of signaling or not.
For what it's worth, I do not see this behavior on Linux with process-shared mutexes and cvs in a single multi-threaded process (with the waiting thread cancel()d).
Again, what's spec and what's bug?

According to the spec for pthread_cond_destroy
"It shall be safe to destroy an initialized condition variable upon
which no threads are currently blocked"
As this is exactly the case, i.e. there are no other threads whatsoever that reference or are blocked on the condition variable, the destroy shall by successful.
IMHO, we have bugs in both operating system in that the condition variable object is left in an inconsistent state.

Related

Potential Impact of Violating ZwOpenKey Should Only Be Called at IRQL = PASSIVE_LEVEL

A 3rd party data loss prevention driver when enabled driver verifier on it causes driver verifier bugcheck based on IrqlZwPassive Rule
The crash includes the following information:
ZwOpenKey should only be called at IRQL = PASSIVE_LEVEL.
What are some of the potential impacts to a Windows system if ZwOpenKey is used outside of IRQL=PASSIVE_LEVEL?
Is this always a serious problem that we should raise with a vendor, or only in certain scenarios.
all Zw api in kernel must be called only on PASSIVE_LEVEL. this is by design. if call it on APC_LEVEL this already will be UB some times this can work, some times produce hang or crash. say in case ZwOpenKey - registry manager can read key data from disk, if it still not in memory. so pass IRP to filesystem and wait for it complete. but Irp for completion can insert special APC (IopCompleteRequest) in calling thread. if thread on APC level - APC will not be executed, until IRQL of thread not lower to passive. but it never done - he wait on IRP complete..
another point - on exit from Zw service, system check - are UserApcPending in Thread and if yes, raise IRQL to APC_LEVEL, initiate user apc, and lower it back to PASSIVE_LEVEL (system assume that Zw called on PASSIVE_LEVEL) - so you can enter to Zw api at APC_LEVEL and exit on PASSIVE_LEVEL. can ask - why thread at some time have APC_LEVEL ? simply, because nothing to do IRQL raised ? or exist some requirements why at some point must be APC_LEVEL ? if yes, what is be if situation require stay on APC_LEVEL but thread ahead of time lower IRQL to PASSIVE_LEVEL ? really UB. in most case can be nothing. but in some case can be very nasty bug which very hard catch and research.

lost in debug ... can't stop execution

I am trying to understand why an installation file hangs up using Windbg, but I am at a point where I can't stop the execution.
As background, I had already been able to install this program on the same PC, but for some reason I had then uninstalled, and now I can't re-install it (I tried to clean up everything from the old installation, incl. registry). Now this setup.exe starts and stays idle among the running processes without doing anything.
But let's go to the actual question. I am trying to use Windbg for the first time (I only had some practice with the old 8086 debug at DOS-time :-), so please bear with me if I'm asking something straightforward).
I have tracked the code up to a point where I have a RET code. I am able to stop the debugger at the RET instruction, but as soon as I "step into" the RET, the execution starts and does not stop, while I was expecting it to just go to the instruction following the previous CALL. From how I see things, it seems that after the RET the execution goes somewhere else ... how is it possible? Also, just before the RET there is a SYSCALL that I don't fully understand ... can it have an impact?
This is the portion of the code I am examining at the moment:
ntdll!NtTerminateThread:
00007ff9`fc8b5b20 4c8bd1 mov r10,rcx
00007ff9`fc8b5b23 b853000000 mov eax,53h
00007ff9`fc8b5b28 f604250803fe7f01 test byte ptr [SharedUserData+0x308 (00000000`7ffe0308)],1
00007ff9`fc8b5b30 7503 jne ntdll!NtTerminateThread+0x15 (00007ff9`fc8b5b35)
00007ff9`fc8b5b32 0f05 syscall
00007ff9`fc8b5b34 c3 ret
00007ff9`fc8b5b35 cd2e int 2Eh
00007ff9`fc8b5b37 c3 ret
I am stuck at the first RET instruction, at address 5b34.
At this time, this is the stack call:
00000000`0203fc38 00007ff9`fc86c63e ntdll!NtTerminateThread+0x14
00000000`0203fc40 00007ff9`fc8d903a ntdll!RtlExitUserThread+0x4e
00000000`0203fc80 00007ff9`fc86c5c5 ntdll!DbgUiRemoteBreakin+0x5a
00000000`0203fcb0 00000000`00000000 ntdll!RtlUserThreadStart+0x45
so my understanding is that execution should continue at address 00007ff9`fc86c63e. However, even if I add a BP at this address, or if I just go for a trace, the execution continues and keeps running some idle loop until I hit the "pause" button in windbg, after which it resume at a completely different address.
In case the registers are relevant, here are some of them:
rax: 353000
rbx: 0
rcx: 0
rdx: 0
rsp: 203fc38
rdi: 7ff9c8d8fe0
rip: 7ff9fc8b5b34
So, eventually, where am I wrong? How can I see where the code goes after this RET?
Thanks in advance for any help,
Bob
when a thread exits the execution wont return to the return address
the system is free to schedule another thread that is ready in the process
the stack shows NtTerminateThread on the stack it is a function that does not return
__declspec noreturn foo (...) ;
btw when you say it goes elsewhere do you mean the app keeps running and is not terminated if so hit ctrl+break and check what other threads are doing
ie if usermode ~*kb should show all the threads callstack
answering the comment about where it goes
process is a collection of threads
each thread has a stack and each thread gets a bit of time to execute from the scheduler (thread quantum)
each thread that has a lower priority can be preempted by threads xxxx ,yyy ,zzz with higher priorities by interrupts by apcs , dpcs etc
when a thread has completed its quantum or is preempted by some vip cavalcade happening to travel on the road that this poor thread is walking
a trap is made _KTRAP and this poor threads position EIP is filled into it and put in a waiting threads barricade
when the vip cavalcade's dust has settled police open the barricade and let the poor thread walk from where it stopped
for such gory details you may need a kernel debugging setup and may need to control your process from a kernel debugger
when you hit the return os sees the thread is dead and has no return address
so it checks the !ready threads and selects the highest priority thread and provides it a quantum to enjoy
so before hitting the return address check what all other threads are doing in your app set an appropriate break on threads of interest and hit the return when the other thread executes its quantum your break will get hit
You're looking at the wrong thread!
From the partial output you supplied seems like you're attaching to a running process (rather than start it from the debugger). To break into a running process the debugger injects a thread into the target process that basically contains a hardcoded int 3 instruction and not much more.
It does it by calling ntdll!RtlpCreateUserThreadEx (the internal undocumented native parallel of CreateRemoteThread) supplying ntdll!DbgUiRemoteBreakin as the start address for the new thread.
The sole purpose of this synthetic thread is to generate the breakpoint exception. This exception causes the operating system to stop running the target process and passes control to the debugger. After it does this it's not needed anymore and it commits suicide.
What you're supposed to do at this point is probably switch to your thread of interest using ~s command, set breakpoints and then continue execution.
If try to step through this synthetic thread it will just end, and then the process will continue doing whatever it was doing before you broke into it, which is pretty much the opposite of what you want,
That's what this stack means:
00000000`0203fc38 00007ff9`fc86c63e ntdll!NtTerminateThread+0x14
00000000`0203fc40 00007ff9`fc8d903a ntdll!RtlExitUserThread+0x4e
00000000`0203fc80 00007ff9`fc86c5c5 ntdll!DbgUiRemoteBreakin+0x5a
00000000`0203fcb0 00000000`00000000 ntdll!RtlUserThreadStart+0x45
ntdll!RtlUserThreadStart is the real user-mode entry point of all user-mode threads and you can see that it just called ntdll!DbgUiRemoteBreakin after which you continued a bit until the thread finally ends itself.

Handing off a piece of work to a thread and waiting for it to accept

My application works as follows:
the worker-threads initialize and begin waiting in pthread_cond_wait()
the main thread connects to DB and starts handing over one row at a time to the proper worker
Because of the DB-driver internals, the next row can not be read until the current one is extracted, so the main thread has to wait for the worker to "accept" the row.
I achieve this by calling pthread_cond_wait() inside the main thread -- waiting for a pthread_signal() from the worker. This works cleanly -- on both Linux and FreeBSD -- but usually takes much longer on Linux. Whereas I consistently process the entire 1.6M rows in about 27 seconds on FreeBSD, on Linux it usually takes over 2 minutes. Except sometimes the Linux box shows the same time...
The code is compiled from the same source and the program talks to the same DB-server. If anything, the Linux box is located on the same LAN as the DB, whereas the FreeBSD machine connects via VPN (so it should be a bit slower). But it is the wide inconsistency of the Linux results that bothers me, and I suspect the thread-coordination...
Here is what I have now:
MAIN THREAD WORKER
--------------------------------------------------------------------------
get new row
figure out, which worker it belongs to lock my mutex
lock the worker's mutex go into pthread_cond_wait
signal the worker extract the row's data
unlock the worker's mutex signal the main thread
go into pthread_cond_wait unlock the mutex
go on back to getting the next row go on to process the row's data
Is there a better way? Thanks!
If reading the next row must be serial anyway, why are you delegating this to the worker? As the main thread has to wait anyway, have the main thread do the extraction and have the hand-off occur as soon as the row has been sufficiently extracted that the master can proceed to the next row.
Other than that, you will need to provide code, as your description is incomplete, as would be any question of this nature submitted without code.
It looks like your problem is that you are calling pthread_cond_wait() without the mutex locked in the main thread. This means that there's a race-condition: if the worker thread wakes up, extracts the data and signals the condition before the parent executes pthread_cond_wait(), the wakeup will be lost.
What you should have is some shared state paired with the condition variable, like this:
Main Thread:
get_new_row();
worker = decide_worker();
pthread_mutex_lock(&mutex);
/* Signal worker that data is available */
flag[worker] = 1;
pthread_cond_signal(&cond);
/* Wait for worker to extract it */
while (flag[worker] == 1)
pthread_cond_wait(&cond, &mutex):
pthread_mutex_unlock(&mutex);
Worker Thread:
pthread_mutex_lock(&mutex);
/* Wait for data to be available */
while (flag[worker] == 0)
pthread_cond_wait(&cond, &mutex):
extract_row_data();
/* Signal main thread that extraction is complete */
flag[worker] = 0;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);

pthread_create and EAGAIN

I got an EAGAIN when trying to spawn a thread using pthread_create. However, from what I've checked, the threads seem to have been terminated properly.
What determines the OS to give EAGAIN when trying to create a thread using pthread_create? Would it be possible that unclosed sockets/file handles play a part in causing this EAGAIN (i.e they share the same resource space)?
And lastly, is there any tool to check resource usage, or any functions that can be used to see how many pthread objects are active at the time?
Okay, found the answer. Even if pthread_exit or pthread_cancel is called, the parent process still need to call pthread_join to release the pthread ID, which will then become recyclable.
Putting a pthread_join(tid, NULL) in the end did the trick.
edit (was not waitpid, but rather pthread_join)
As a practical matter EAGAIN is almost always related to running out of memory for the process. Often this has to do with the stack size allocated for the thread which you can adjust with pthread_attr_setstacksize(). But there are process limits to how many threads you can run. You can query the hard and soft limits with getrlimit() using RLIMIT_NPROC as the first parameter.
There are quite a few questions here dedicated to keeping track of threads, their number, whether they are dead or alive, etc. Simply put, the easiest way to keep track of them is to do it yourself through some mechanism you code, which can be as simple as incrementing and decrementing a global counter (protected by a mutex) or something more elaborate.
Open sockets or other file descriptors shouldn't cause pthread_create() to fail. If you reached the maximum for descriptors you would have already failed before creating the new thread and the new thread would have already have had to be successfully created to open more of them and thus could not have failed with EAGAIN.
As per my observation if one of the parent process calls pthread_join(), and chilled processes are trying to release the thread by calling pthread_exit() or pthread_cancel() then system is not able to release that thread properly. In that case, if pthread_detach() is call immediately after successful call of pthread_create() then this problem has been solved. A snapshot is here -
err = pthread_create(&(receiveThread), NULL, &receiver, temp);
if (err != 0)
{
MyPrintf("\nCan't create thread Reason : %s\n ",(err==EAGAIN)?"EAGAUIN":(err==EINVAL)?"EINVAL":(err==EPERM)?"EPERM":"UNKNOWN");
free(temp);
}
else
{
threadnumber++;
MyPrintf("Count: %d Thread ID: %u\n",threadnumber,receiveThread);
pthread_detach(receiveThread);
}
Another potential cause: I was getting this problem (EAGAIN on pthread_create) because I had forgotten to call pthread_attr_init on the pthread_attr_t I was trying to initialize my thread with.

trouble reading from __global memory after atom_inc in OpenCL

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all of them increment that one piece of memory.
However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.
Is the global memory being locally cached? What's going on here?
Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.
In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.
If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

Resources