iOS select versus kqueue/kevent versus mach_wait_until scheduling - ios

I have a thread that listens on a single UDP socket, but also needs to wake up once in a while to perform other tasks. These tasks are triggered by the passage of time, or by activity on other threads. My current design is to use select() timeout value as scheduling timer, and to write a packet to the socket (loopback) address when I need to wake it from another thread.
However, Apple documention says select() timeouts should not be used to wake up more than a few times per second. And, in practice, I find they may be delayed by 100 msec or more, whereas I would like 10-20 msec resolution. Are they just trying to discourage cpu intensive polling, or is there something wrong with using select() per se. Is there a better approach?
Would it help to replace select with kqueue/kevent? Or, create a dedicated scheduling thread, with mach_wait_until() to handle the timer, and then write to the socket to wake the net thread? Or, do all the work in the dedicated thread, and have the net thread queue incoming data to it?

Something bugs me about this approach. Why do you have anything at all happening on the select() thread?
If you need a thread dedicated to waiting for incoming packets, them make that thread a wait as much as possible.
while (1) {
int numsockets = select(…);
if (numsockets > 0) {
// Read data (only drain the socket)
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
// Process data
});
}
}
Then you can have your periodic tasks run using timer dispatch sources.
dispatch_source_t source = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0,
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0));
dispatch_source_set_event_handler(source, ^{
// Periodic process data
});
uint64_t nsec = 0.001 * NSEC_PER_SEC;
dispatch_source_set_timer(source, dispatch_time(DISPATCH_TIME_NOW, nsec), nsec, 0);
dispatch_resume(source);
I don't know, but I've been told that you can even use a dispatch source to replace the select().

Related

NumberOfConcurrentThreads parameter in CreateIoCompletionPort

I am still confused about the NumberOfConcurrentThreads parameter within CreateIoCompletionPort(). I have read and re-read the MSDN dox, but the quote
This value limits the number of runnable threads associated with the
completion port.
still puzzles me.
Question
Let's assume that I specify this value as 4. In this case, does this mean that:
1) a thread can call GetQueuedCompletionStatus() (at which point I can allow a further 3 threads to make this call), then as soon as that call returns (i.e. we have a completion packet) I can then have 4 threads again call this function,
or
2) a thread can call GetQueuedCompletionStatus() (at which point I can allow a further 3 threads to make this call), then as soon as that call returns (i.e. we have a completion packet) I then go on to process that packet. Only when I have finished processing the packet do I then call GetQueuedCompletionStatus(), at which point I can then have 4 threads again call this function.
See my confusion? Its the use of the phrase 'runnable threads'.
I think it might be the latter, because the link above also quotes
If your transaction required a lengthy computation, a larger
concurrency value will allow more threads to run. Each completion
packet may take longer to finish, but more completion packets will be
processed at the same time.
This will ultimately affect how we design servers. Consider a server that receives data from clients, then echoes that data to logging servers. Here is what our thread routine could look like:
DWORD WINAPI ServerWorkerThread(HANDLE hCompletionPort)
{
DWORD BytesTransferred;
CPerHandleData* PerHandleData = nullptr;
CPerOperationData* PerIoData = nullptr;
while (TRUE)
{
if (GetQueuedCompletionStatus(hCompletionPort, &BytesTransferred,
(PULONG_PTR)&PerHandleData, (LPOVERLAPPED*)&PerIoData, INFINITE))
{
// OK, we have 'BytesTransferred' of data in 'PerIoData', process it:
// send the data onto our logging servers, then loop back around
send(...);
}
}
return 0;
}
Now assume I have a four core machine; if I leave NumberOfConcurrentThreads as zero within my call to CreateIoCompletionPort() I will have four threads running ServerWorkerThread(). Fine.
My concern is that the send() call may take a long time due to network traffic. Hence, I could be receiving a load of data from clients that cannot be dequeued because all four threads are taking a long time sending the data on?!
Have I missed the point here?
Update 07.03.2018 (This has now been resolved: see this comment.)
I have 8 threads running on my machine, each one runs the ServerWorkerThread():
DWORD WINAPI ServerWorkerThread(HANDLE hCompletionPort)
{
DWORD BytesTransferred;
CPerHandleData* PerHandleData = nullptr;
CPerOperationData* PerIoData = nullptr;
while (TRUE)
{
if (GetQueuedCompletionStatus(hCompletionPort, &BytesTransferred,
(PULONG_PTR)&PerHandleData, (LPOVERLAPPED*)&PerIoData, INFINITE))
{
switch (PerIoData->Operation)
{
case CPerOperationData::ACCEPT_COMPLETED:
{
// This case is fired when a new connection is made
while (1) {}
}
}
}
I only have one outstanding AcceptEx() call; when that gets filled by a new connection I post another one. I don't wait for data to be received in AcceptEx().
I create my completion port as follows:
CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)
Now, because I only allow 4 threads in the completion port, I thought that because I keep the threads busy (i.e. they do not enter a wait state), when I try and make a fifth connection, the completion packet would not be dequeued hence would hang! However this is not the case; I can make 5 or even 6 connections to my server! This shows that I can still dequeue packets even though my maximum allowed number of threads (4) are already running? This is why I am confused!
the completion port - is really KQUEUE object. the NumberOfConcurrentThreads is corresponded to MaximumCount
Maximum number of concurrent threads the queue can satisfy waits for.
from I/O Completion Ports
When the total number of runnable threads associated with the
completion port reaches the concurrency value, the system blocks the
execution of any subsequent threads associated with that completion
port until the number of runnable threads drops below the concurrency
value.
it's bad and not exactly said. when thread call KeRemoveQueue ( GetQueuedCompletionStatus internal call it) system return packet to thread only if Queue->CurrentCount < Queue->MaximumCount even if exist packets in queue. system not blocks any threads of course. from another side look for KiInsertQueue - even if some threads wait on packets - it activated only in case Queue->CurrentCount < Queue->MaximumCount.
also look how and when Queue->CurrentCount is changed. look for KiActivateWaiterQueue (This function is called when the current thread is about to enter a wait state) and KiUnlinkThread. in general - when thread begin wait for any object (or another queue) system call KiActivateWaiterQueue - it decrement CurrentCount and possible (if exist packets in queue and became Queue->CurrentCount < Queue->MaximumCount and threads waited for packets) return packet to wait thread. from another side, when thread stop wait - KiUnlinkThread is called. it increment CurrentCount.
your both variant is wrong. any count of threads can call GetQueuedCompletionStatus(). and system of course not blocks the execution of any subsequent threads. for example - you have queue with MaximumCount = 4. you can queue 10 packets to queue. and call GetQueuedCompletionStatus() from 7 threads in concurrent. but only 4 from it got packets. another will be wait (despite yet 6 packets in queue). if some of threads, which remove packet from queue begin wait - system just unwait and return packet to another thread wait on queue. or if thread (which already previous remove packet from this queue (Thread->Queue == Queue) - so active thread) again call KeRemoveQueue will be Queue->CurrentCount -= 1;

Suspending serial queue

Today i've tried following code:
- (void)suspendTest {
dispatch_queue_attr_t attr = dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_CONCURRENT, QOS_CLASS_BACKGROUND, 0);
dispatch_queue_t suspendableQueue = dispatch_queue_create("test", attr);
for (int i = 0; i <= 10000; i++) {
dispatch_async(suspendableQueue, ^{
NSLog(#"%d", i);
});
if (i == 5000) {
dispatch_suspend(suspendableQueue);
}
}
dispatch_after(dispatch_time(DISPATCH_TIME_NOW, (int64_t)(6 * NSEC_PER_SEC)), dispatch_get_main_queue(), ^{
NSLog(#"Show must go on!");
dispatch_resume(suspendableQueue);
});
}
The code starts 10001 tasks, but it should suspend the queue from running new tasks halfway for resuming in 6 seconds. And this code works as expected - 5000 tasks executes, then queue stops, and after 6 seconds it resumes.
But if i use a serial queue instead of concurrent queue, the behaviour is not clear for me.
dispatch_queue_attr_t attr = dispatch_queue_attr_make_with_qos_class(DISPATCH_QUEUE_SERIAL, QOS_CLASS_BACKGROUND, 0);
In this case a random number of tasks manage to execute before suspending, but often this number is close to zero (suspending happens before any tasks).
The question is - Why does suspending work differently for serial and concurrent queue and how to suspend serial queue properly?
As per its name, the serial queue performs the tasks in series, i.e., only starting on the next one after the previous one has been completed. The priority class is background, so it may not even have started on the first task by the time the current queue reaches the 5000th task and suspends the queue.
From the documentation of dispatch_suspend:
The suspension occurs after completion of any blocks running at the time of the call.
i.e., nowhere does it promise that asynchronously dispatched tasks on the queue would finish, only that any currently running task (block) will not be suspended part-way through. On a serial queue at most one task can be "currently running", whereas on a concurrent queue there is no specified upper limit. edit: And according to your test with a million tasks, it seems the concurrent queue maintains the conceptual abstraction that it is "completely concurrent", and thus considers all of them "currently running" even if they actually aren't.
To suspend it after the 5000th task, you could trigger this from the 5000th task itself. (Then you probably also want to start the resume-timer from the time it is suspended, otherwise it is theoretically possible it will never resume if the resume happened before it was suspended.)
I think the problem is that you are confusing suspend with barrier. suspend stops the queue dead now. barrier stops when everything in the queue before the barrier has executed. So if you put a barrier after the 5000th task, 5000 tasks will execute before we pause at the barrier on the serial queue.

Is it safe to have a timer dispatch source scheduled every second?

I have this timer created using GDC. It will call a method every 1 second. Is it safe to have this timer alive during the whole time, even in background?
self.theTimer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, dispatch_get_main_queue());
dispatch_source_set_timer(self.theTimer, DISPATCH_TIME_NOW, (1.0) * NSEC_PER_SEC, 0.25 * NSEC_PER_SEC);
dispatch_source_set_event_handler(self.theTimer, ^{
[self awakeAndProcess];
});
// Start the timer
dispatch_resume(self.theTimer);
The method "awakeAndProcess" has a "consumer" behavior where it checks a data queue and tries to send an HTTP request. So it is constantly checking if there are messages to be sent
You are better off pausing the timer when going to background to conserve battery because the awakeAndProcess seems to be a network call. But if you are in background then all your tasks are suspended anyways so shouldn't be a problem. When in foreground its better to wait for the previous awakeAndProcess call to finish before you trigger the next call. Otherwise you might end up with lot of awakeAndProcess calls being batched together. If awakeAndProcess is not reentrant then it can cause havoc in your code.
You are better off suspending the timer after the awakeAndProcess and then call resume after the awakeAndProcess is fully complete.
If you do this then its safe to use your approach.

Workaround on the threads limit in Grand Central Dispatch?

With Grand Central Dispatch, one can easily perform time consuming task on non-main thread, avoid blocking the main thead and keep the UI responsive. Simply by using dispatch_async and perform the task on a global concurrent queue.
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
// code
});
However, something sounds too good to be true like this one usually have their downside. After we use this a lot in our iOS app project, recently we found that there's a 64 thread limit on it. Once we hit the limit, the app will freeze / hang. By pausing the app with Xcode, we can see that the main thread is held by semaphore_wait_trap.
Googling on the web confirms that others are encountering this problem too but so far no solution found for this.
Dispatch Thread Hard Limit Reached: 64 (too many dispatch threads
blocked in synchronous operations)
Another stackoverflow question confirms that this problem occur when using dispatch_sync and dispatch_barrier_async too.
Question:
As the Grand Central Dispatch have a 64 threads limit, is there any workaround for this?
Thanks in advance!
Well, if you're bound and determined, you can free yourself of the shackles of GCD, and go forth and slam right up against the OS per-process thread limit using pthreads, but the bottom line is this: if you're hitting the queue-width limit in GCD, you might want to consider reevaluating your concurrency approach.
At the extremes, there are two ways you can hit the limit:
You can have 64 threads blocked on some OS primitive via a blocking syscall. (I/O bound)
You can legitimately have 64 runnable tasks all ready to rock at the same time. (CPU bound)
If you're in situation #1, then the recommended approach is to use non-blocking I/O. In fact, GCD has a whole bunch of calls, introduced in 10.7/Lion IIRC, that facilitate asynchronous scheduling of I/O and improve thread re-use. If you use the GCD I/O mechanism, then those threads won't be tied up waiting on I/O, GCD will just queue up your blocks (or functions) when data becomes available on your file descriptor (or mach port). See the documentation for dispatch_io_create and friends.
In case it helps, here's a little example (presented without warranty) of a TCP echo server implemented using the GCD I/O mechanism:
in_port_t port = 10000;
void DieWithError(char *errorMessage);
// Returns a block you can call later to shut down the server -- caller owns block.
dispatch_block_t CreateCleanupBlockForLaunchedServer()
{
// Create the socket
int servSock = -1;
if ((servSock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
DieWithError("socket() failed");
}
// Bind the socket - if the port we want is in use, increment until we find one that isn't
struct sockaddr_in echoServAddr;
memset(&echoServAddr, 0, sizeof(echoServAddr));
echoServAddr.sin_family = AF_INET;
echoServAddr.sin_addr.s_addr = htonl(INADDR_ANY);
do {
printf("server attempting to bind to port %d\n", (int)port);
echoServAddr.sin_port = htons(port);
} while (bind(servSock, (struct sockaddr *) &echoServAddr, sizeof(echoServAddr)) < 0 && ++port);
// Make the socket non-blocking
if (fcntl(servSock, F_SETFL, O_NONBLOCK) < 0) {
shutdown(servSock, SHUT_RDWR);
close(servSock);
DieWithError("fcntl() failed");
}
// Set up the dispatch source that will alert us to new incoming connections
dispatch_queue_t q = dispatch_queue_create("server_queue", DISPATCH_QUEUE_CONCURRENT);
dispatch_source_t acceptSource = dispatch_source_create(DISPATCH_SOURCE_TYPE_READ, servSock, 0, q);
dispatch_source_set_event_handler(acceptSource, ^{
const unsigned long numPendingConnections = dispatch_source_get_data(acceptSource);
for (unsigned long i = 0; i < numPendingConnections; i++) {
int clntSock = -1;
struct sockaddr_in echoClntAddr;
unsigned int clntLen = sizeof(echoClntAddr);
// Wait for a client to connect
if ((clntSock = accept(servSock, (struct sockaddr *) &echoClntAddr, &clntLen)) >= 0)
{
printf("server sock: %d accepted\n", clntSock);
dispatch_io_t channel = dispatch_io_create(DISPATCH_IO_STREAM, clntSock, q, ^(int error) {
if (error) {
fprintf(stderr, "Error: %s", strerror(error));
}
printf("server sock: %d closing\n", clntSock);
close(clntSock);
});
// Configure the channel...
dispatch_io_set_low_water(channel, 1);
dispatch_io_set_high_water(channel, SIZE_MAX);
// Setup read handler
dispatch_io_read(channel, 0, SIZE_MAX, q, ^(bool done, dispatch_data_t data, int error) {
BOOL close = NO;
if (error) {
fprintf(stderr, "Error: %s", strerror(error));
close = YES;
}
const size_t rxd = data ? dispatch_data_get_size(data) : 0;
if (rxd) {
// echo...
printf("server sock: %d received: %ld bytes\n", clntSock, (long)rxd);
// write it back out; echo!
dispatch_io_write(channel, 0, data, q, ^(bool done, dispatch_data_t data, int error) {});
}
else {
close = YES;
}
if (close) {
dispatch_io_close(channel, DISPATCH_IO_STOP);
dispatch_release(channel);
}
});
}
else {
printf("accept() failed;\n");
}
}
});
// Resume the source so we're ready to accept once we listen()
dispatch_resume(acceptSource);
// Listen() on the socket
if (listen(servSock, SOMAXCONN) < 0) {
shutdown(servSock, SHUT_RDWR);
close(servSock);
DieWithError("listen() failed");
}
// Make cleanup block for the server queue
dispatch_block_t cleanupBlock = ^{
dispatch_async(q, ^{
shutdown(servSock, SHUT_RDWR);
close(servSock);
dispatch_release(acceptSource);
dispatch_release(q);
});
};
return Block_copy(cleanupBlock);
}
Anyway... back to the topic at hand:
If you're in situation #2, you should ask yourself, "Am I really gaining anything through this approach?" Let's say you have the most studly MacPro out there -- 12 cores, 24 hyperthreaded/virtual cores. With 64 threads, you've got an approx. 3:1 thread to virtual core ratio. Context switches and cache misses aren't free. Remember, we presumed that you weren't I/O bound for this scenario, so all you're doing by having more tasks than cores is wasting CPU time with context switches and cache thrash.
In reality, if your application is hanging because you've hit the queue width limit, then the most likely scenario is that you've starved your queue. You've likely created a dependency that reduces to a deadlock. The case I've seen most often is when multiple, interlocked threads are trying to dispatch_sync on the same queue, when there are no threads left. This is always fail.
Here's why: Queue width is an implementation detail. The 64 thread width limit of GCD is undocumented because a well-designed concurrency architecture shouldn't depend on the queue width. You should always design your concurrency architecture such that a 2 thread wide queue would eventually finish the job to the same result (if slower) as a 1000 thread wide queue. If you don't, there will always be a chance that your queue will get starved. Dividing your workload into parallelizable units should be opening yourself to the possibility of optimization, not a requirement for basic functioning. One way to enforce this discipline during development is to try working with a serial queue in places where you use concurrent queues, but expect non-interlocked behavior. Performing checks like this will help you catch some (but not all) of these bugs earlier.
Also, to the precise point of your original question: IIUC, the 64 thread limit is 64 threads per top-level concurrent queue, so if you really feel the need, you can use all three top level concurrent queues (Default, High and Low priority) to achieve more than 64 threads total. Please don't do this though. Fix your design such that it doesn't starve itself instead. You'll be happier. And anyway, as I hinted above, if you're starving out a 64 thread wide queue, you'll probably eventually just fill all three top level queues and/or run into the per-process thread limit and starve yourself that way too.

pthread conditional variable

I'm implementing a thread with a queue of tasks. As soon as as the first task is added to the queue the thread starts running it.
Should I use pthread condition variable to wake up the thread or there is more appropriate mechanism?
If I call pthread_cond_signal() when the other thread is not blocked by pthread_cond_wait() but rather doing something, what happens? Will the signal be lost?
Semaphores are good if-and-only-if your queue already is thread safe. Also,
some semaphore implementations may be limited by top counter value.
Even it is unlikely you would overrun maximal value.
Simplest and correct way to do this is following:
pthread_mutex_t queue_lock;
pthread_cond_t not_empty;
queue_t queue;
push()
{
pthread_mutex_lock(&queue_lock);
queue.insert(new_job);
pthread_cond_signal(&not_empty)
pthread_mutex_unlock(&queue_lock);
}
pop()
{
pthread_mutex_lock(&queue_lock);
if(queue.empty())
pthread_cond_wait(&queue_lock,&not_empty);
job=quque.pop();
pthread_mutex_unlock(&queue_lock);
}
From the pthread_cond_signal Manual:
The pthread_cond_broadcast() and pthread_cond_signal() functions shall have no effect if there are no threads currently blocked on cond.
I suggest you use Semaphores. Basically, each time a task is inserted in the queue, you "up" the semaphore. The worker thread blocks on the semaphore by "down"'ing it. Since it will be "up"'ed one time for each task, the worker thread will go on as long as there are tasks in the queue. When the queue is empty the semaphore is at 0, and the worker thread blocks until a new task arrives. Semaphores also easily handle the case when more than 1 task arrived while the worker was busy. Notice that you still have to lock access to the queue to keep inserts/removes atomic.
The signal will be lost, but you want the signal to be lost in that case. If there is no thread to wakeup, the signal serves no purpose. (If nobody is waiting for something, nobody needs to be notified when it happens, right?)
With condition variables, lost signals cannot cause a thread to "sleep through a fire". Unless you actually code a thread to go to sleep when there's already a fire, there is no need to "save a signal". When the fire starts, your broadcast will wake up any sleeping threads. And you would have to be pretty daft to code a thread to go to sleep when there's already a fire.
As already suggested, semaphores should be the best choice. If you need a fixed-size queue just use 2 semaphores (as in classical producer-consumer).
In artyom code, it would be better to replace "if" with "while" in pop() function, to handle spurious wakeup.
No effects.
If you check how pthread_condt_signal is implemented, the condt uses several counters to check whether there are any waiting threads to wake up. e.g., glibc-nptl
/* Are there any waiters to be woken? */
if (cond->__data.__total_seq > cond->__data.__wakeup_seq){
...
}

Resources