Asynchronous Controllers at ASP.NET 4 make sense? - asp.net-mvc

Asp.net 2 have 12 threads by default
Now Asp.Net 4 have 5000. We still need async controllers?

We still need async controllers?
Yes. Async controllers are useful in situations where you have lengthy operations such as network calls and you don't want to monopolize worker threads for them. The fact that there are 5000 worker threads by default doesn't mean that you have to waste them. Is it because you are a millionaire that you are giving away your money? No.
Obviously if you don't use async controllers correctly they will do more harm than good.

MVC 4/Dev11 makes Async controllers more appealing than previous versions. Add to that WebAPI making it easy to create web services.
Start of Levi's comments so they won't be missed (under #Darin Dimitrov's excellent answer)
Expanding on Darin's answer a bit - asynchronous I/O operations (which
is what AsyncController is intended for) operate using IOCP, not
ThreadPool threads. This is important, as each ThreadPool thread has
an associated 1 MB stack (plus other overhead), so if you're using
5000 ThreadPool threads, you're automatically losing 5 GB of memory
just due to overhead! IOCP continuations have nowhere near as much
overhead, so it's possible to juggle greater numbers of them at any
given time. ThreadPool threads are pooled and removed when no longer
needed - so you're only taking the hit for threads which are currently
active. But if you're doing a lot of concurrent CPU-bound work with
ThreadPool, you're very rapidly going to start hitting memory issues.
This is precisely one of the reasons the C# / VB teams released the
Async CTP a few months ago - to try to solve this issue.
Async for Web service often makes sense - see Should my database calls be Asynchronous Part II
For database applications using async operations to reduce the number of blocked threads on the web server is almost always a complete waste of time. A small web server can easily handle way more simultaneous blocking requests than your database back-end can process concurrently. Instead make sure your service calls are cheap at the database, and limit the number of concurrently executing requests to a number that you have tested to work correctly and maximize overall transaction throughput
See Should my database calls be Asynchronous?

4 or 5000, it doesn't matter, it is just a setting. You can set it to 1 billion if you want to, that won't make your application more scalabe. In the end your machine only has 4 cores (or 8 or 2, but not 5000). Always keep in mind that you can only ever have as many threads running at the same time that the number of cores. Every thread you have in excess to your number of cores is just overhead. It will create more context switches, consume CPU and occupy more memory.
IO (database access, web service, file access...) is not taking up any CPU. If you do it synchronously, it will block a thread for the length of the operation. If you have a lengthy operation (5 seconds), and a load of 1,000 request per second, you will be permanently blocking 5,000 threads. So you are already starving the thread pool (with a setting of 5,000). But what is worse is that you will be trashing your maching with context switches. If you do it asynchronously, no thread will be blocked, no resource is taken, and you have no limit on the number of concurrent IO you can do.
Adding more threads in the thread pool is a quick and dirty hack when you can't afford rewriting your application using asynchronous IO. It is not a clean solution.

Related

Grand Central Dispatch: What happens when queues get overloaded?

Pretty simple question that I haven't found anywhere in documentation or tutorials on GCD: What happens if I'm submitting work to queues faster than it's being processed and removed? I'm aware that GCD queues have no size limit, would work just pile up until the program runs out of memory? Is there any way to properly handle this situation?
What happens if I'm submitting work to queues faster than it's being processed and removed?
It depends.
If dispatching tasks to a single/shared serial queue, they will just be added to the queue and it will process them in a FIFO manner. No problem. Memory is your only constraint.
If dispatching tasks to a concurrent queue, though, you end up with “thread explosion”, and you will quickly exhaust the limited number of worker threads available for that quality-of-service (QoS). This can result in unpredictable behaviors should the OS need to avail itself of a queue of the same QoS. Thus, you must be very careful to avoid this thread explosion.
See a discussion on thread explosion WWDC 2015 Building Responsive and Efficient Apps with GCD and again in WWDC 2016 Concurrent Programming With GCD in Swift 3.
Is there any way to properly handle this situation?
It is hard to answer that in the abstract. Different situations call for different solutions.
In the case of thread explosion, the solution is to constrain the degree of concurrency using concurrentPerform (limiting the concurrency to the number of cores on your device). Or we use operation queues and their maxConcurrentOperationCount to limit the degree of concurrency to something reasonable. There are other patterns, too, but the idea is to constrain concurrency to something suitable for the device in question.
But if you're just dispatching a large number of tasks to a serial queue, there's not much you can do (other than looking for parallelism opportunities, to make efficient use of all of CPU’s cores). But that's OK, as that is the whole purpose of a queue, to let it perform tasks in the order they were submitted, even if the queue can't keep up. It wouldn’t be a “queue” if it didn’t follow this FIFO sort of pattern.
Now if dealing with real-time data that cannot be processed quickly enough, you have a different problem. In that case, you might want to decouple the capture of the input from the processing and decide how to you want to handle it. E.g. if you can't keep up with real-time processing of a video, for example, you have a choice. Either you start dropping frames or process the data asynchronously/later. You just have to decide what is right for your use case. We cannot answer this question in the abstract.

Is there a way to limit the number of threads spawned by GCD in my application?

I know that the max number of threads spawned cannot exceed 66 through the response to this question. But is there a way to limit the thread count to a value which an user has defined?
From my experience and work with GCD under various circumstances, I believe this is not possible.
Said that, it is very important to understand, that by using GCD, you spawn queues, not threads. Whenever a call to create a queue is made from your code, GCD subsystem in its turn checks OS condition and seeks for available resources. New threads are then created under the hood based on these conditions – in the order and with the resources allocated, not controlled by you. This is clearly explained in official documentation:
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Source: Dispatch Queues
There is no way you can control resources consumption with GCD, like by setting some kind of threshold. GCD is a high-level abstraction over low-level things, such as threads, and it manages it for you.
The only way you can possibly influence how many resources particular task within your application should take, is by setting its QoS (Quality of Service) class (formerly known simply as priority, extended to a more complex concept). To be brief, you can classify tasks within your application based on their importance, this way helping GCD and your application be more resource- and battery- efficient. Its employment is highly encouraged in complex applications with vast concurrency usage.
Even still, however, this kind of regulation from developer end has its limits and ultimately does not address the goal to control threads creation:
Apps and operations compete to use finite resources—CPU, memory,
network interfaces, and so on. In order to remain responsive and
efficient, the system needs to prioritize tasks and make intelligent
decisions about when to execute them.
Work that directly impacts the user, such as UI updates, is extremely
important and takes precedence over other work that may be occurring
in the background. This higher priority work often uses more energy,
as it may require substantial and immediate access to system
resources.
As a developer, you can help the system prioritize more effectively by
categorizing your app’s work, based on importance. Even if you’ve
implemented other efficiency measures, such as deferring work until an
optimal time, the system still needs to perform some level of
prioritization. Therefore, it is still important to categorize the
work your app performs.
Source: Prioritize Work with Quality of Service Classes
To conclude, if you are deliberate in your intent to control threads, don't use GCD. Use low-level programming techniques and manage them yourself. If you use GCD, then you agree to leave this kind of responsibility to GCD.

Why does Firemonkey application use no more than 20% of CPU?

I have a large binary file (700 Mb approximately) which I load to TMemoryStream. After that I perform the reading with TMemoryStream.Read() and make some simple calculations but the application never takes more than 20% of CPU. My PC has i7 processor.
Is there any chance to increase the CPU using and speed up the reading process without using the threads?
As far as I know, the only way to utilise the power of multiple cpu cores with Delphi is to use threads.
If you do choose to use threads in your application, there are a couple libraries that may ease development. How Do I Choose Between the Various Ways to do Threading in Delphi?
Adding on to Shannon's answer, on an i7 processor with multiple cores, one thread will only be utilizing one core. One thread cannot run on more than one processor core. Therefore, if you wish to utilize multiple cores, you need to create multiple threads to handle various tasks. Creating a thread isn't necessarily as simple as saying do this in that thread, there's a lot to know about multi-threading. For example, your application has one main GUI thread, then one thread might be dedicated for performing some long calculation, another thread might be updating a caption to real-time data, and so on.
Windows automatically decides which core to assign a thread to, and usually divides it up fairly. So, if you have 8 processor cores, and 16 threads, each core would get 2 threads (presumably) and since each core sends its own ticks apart from each other, more than one thread could literally be running at the same time (as opposed to a single core where it divides each 'tick' between each thread).
So to answer your question, if you had 5 threads performing something big at the same time, then you would see 100% processor usage.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

What's the idiomatic way to do async socket programming in Delphi?

What is the normal way people writing network code in Delphi use Windows-style overlapped asynchronous socket I/O?
Here's my prior research into this question:
The Indy components seem entirely synchronous. On the other hand, while ScktComp unit does use WSAAsyncSelect, it basically only asynchronizes a BSD-style multiplexed socket app. You get dumped in a single event callback, as if you had just returned from select() in a loop, and have to do all the state machine navigation yourself.
The .NET situation is considerably nicer, with Socket.BeginRead / Socket.EndRead, where the continuation is passed directly to Socket.BeginRead, and that's where you pick back up. A continuation coded as a closure obviously has all the context you need, and more.
I have found that Indy, while a simpler concept in the beginning, is awkward to manage due to the need to kill sockets to free threads at application termination. In addition, I had the Indy library stop working after an OS patch upgrade. ScktComp works well for my application.
#Roddy - Synchronous sockets are not what I'm after. Burning a whole thread for the sake of a possibly long-lived connection means you limit the amount of concurrent connections to the number of threads that your process can contain. Since threads use a lot of resources - reserved stack address space, committed stack memory, and kernel transitions for context switches - they do not scale when you need to support hundreds of connections, much less thousands or more.
What is the normal way people writing
network code in Delphi use
Windows-style overlapped asynchronous
socket I/O?
Well, Indy has been the 'standard' library for socket I/O for a long while now - and it's based on blocking sockets. This means if you want asynchronous behaviour, you use additional thread(s) to connect/read/write data. To my mind this is actually a major advantage, as there's no need to manage any kind of state machine navigation, or worry about callback procs or similar stuff. I find the logic of my 'reading' thread is less cluttered and much more portable than non-blocking sockets would allow.
Indy 9 has been mostly bombproof, fast and reliable for us. However the move to Indy 10 for Tiburon is causing me a little concern.
#Mike: "...the need to kill sockets to free threads...".
This made go "huh?" until I remembered our threading library uses an exception-based technique to kill 'waiting' threads safely. We call QueueUserAPC to queue a function which raises a C++ exception (NOT derived from class Exception) which should only be caught by our thread wrapper procedure. All destructors get called so the threads all terminate cleanly and tidy up on the way out.
"Synchronous sockets are not what I'm after."
Understood - but I think in that case the answer to your original question is that there just isn't a Delphi idiom for async socket IO because it's actually a highly specialized and uncommon requirement.
As a side issue, you might find these links interesting. They're both a little old, and more *nxy than Windows. The second one implies that - in the right environment - threads might not be as bad as you think.
The C10K problem
Why Events Are A Bad Idea (for High-concurrency Servers)
#Chris Miller - What you've stated in your answer is factually inaccurate.
Windows message-style async, as available through WSAAsyncSelect, is indeed largely a workaround for lack of a proper threading model in Win 3.x days.
.NET Begin/End, however, is not using extra threads. Instead, it is using overlapped I/O, using the extra argument on WSASend / WSARecv, specifically the overlapped completion routine, to specify the continuation.
This means that the .NET style harnesses the Windows OS's async I/O support to avoid burning a thread by blocking on a socket.
Since threads are generally speaking expensive (unless you specify a very small stack size to CreateThread), having threads blocking on sockets will stop you from scaling to 10,000s of concurrent connections.
This is why it's important that async I/O be used if you want to scale, and also why .NET is not, I repeat, is not, simply "using threads, [...] just managed by the Framework".
#Roddy - I've already read the links you point to, they are both referenced from Paul Tyma's presentation "Thousands of Threads and Blocking I/O - The old way to write Java Servers is New again".
Some of the things that don't necessarily jump out from Paul's presentation, however, are that he specified -Xss:48k to the JVM on startup, and that he's assuming that the JVM's NIO implementation is efficient in order for it to be a valid comparison.
Indy does not specify a similarly shrunken and tightly constrained stack size. There are no calls to BeginThread (the Delphi RTL thread creation routine, which you should use for such situations) or CreateThread (the raw WinAPI call) in the Indy codebase.
The default stack size is stored in the PE, and for the Delphi compiler it defaults to 1MB of reserved address space (space is committed page by page by the OS in 4K chunks; in fact, the compiler needs to generate code to touch pages if there are >4K of locals in a function, because the extension is controlled by page faults, but only for the lowest (guard) page in the stack). That means you're going to run out of address space after max 2,000 concurrent threads handling connections.
Now, you can change the default stack size in the PE using the {$M minStackSize [,maxStackSize]} directive, but that will affect all threads, including the main thread. I hope you don't do much recursion, because 48K or (similar) isn't a lot of space.
Now, whether Paul is right about non-performance of async I/O for Windows in particular, I'm not 100% sure - I'd have to measure it to be certain. What I do know, however, is that arguments about threaded programming being easier than async event-based programming, are presenting a false dichotomy.
Async code doesn't need to be event-based; it can be continuation-based, like it is in .NET, and if you specify a closure as your continuation, you get state maintained for you for free. Moreover, conversion from linear thread-style code to continuation-passing-style async code can be made mechanical by a compiler (CPS transform is mechanical), so there need be no cost in code clarity either.
There is a free IOCP (completion ports) socket components : http://www.torry.net/authorsmore.php?id=7131 (source code included)
"By Naberegnyh Sergey N.. High
performance socket server based on
Windows Completion Port and with using
Windows Socket Extensions. IPv6
supported. "
i've found it while looking better components/library to rearchitecture my little instant messaging server. I haven't tried it yet but it looks good coded as a first impression.
For async stuff try ICS
http://www.overbyte.be/frame_index.html?redirTo=/products/ics.html
Indy uses synchronous sockets because it's a simpler way of programming. The asynchronous socket blocking was something added to the winsock stack back in the Windows 3.x days. Windows 3.x did not support threads and there you couldn't do socket I/O without threads. For some additional information about why Indy uses the blocking model, please see this article.
The .NET Socket.BeginRead/EndRead calls are using threads, it's just managed by the Framework instead of by you.
#Roddy, Indy 10 has been bundled with Delphi since at Delphi 2006. I found that migrating from Indy 9 to Indy 10 to be a straight forward task.
With the ScktComp classes, you need to use a ThreadBlocking server rather than an a NonBlocking server type. Use the OnGetThread event to hand off the ClientSocket param to a new thread of your devising. Once you've instantiated an inherited instance of TServerClientThread you'll create a instance of TWinSocketStream (inside the thread) which you can use to read and write to the socket. This method gets you away from trying to process data in the event handler. These threads could exist for just the short period need to read or write, or hang on for the duration for the purpose of being reused.
The subject of writing a socket server is fairly vast. There are many techniques and practices you could choose to implement. The method of reading and writing to the same socket with in the TServerClientThread is straight forward and fine for simple applications. If you need a model for high availability and high concurrency then you need to look into patterns like the Proactor pattern.
Good luck!

Resources