Shadow registers in interrupt - mips64

I have got mips64 CPU.
And I do some actions for save context in OS porting.
And I have got a question about mips shadow registers and software interrtupt.
I know what in hardware interupt routine in prolog and epilog of interrupt I can don't save/restore SFR's by programming appropriate code - this do exactly shadow registers.
But in case of software interrupt routine, is this the same mechanizm?
Can anybody know, does shadow registers exist also for software interrupt????
Thanks!

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Enforce use of independent flip mode with DXGI FLIP SwapChain

I currently face a problem with DXGI Swapchains (DirectX 11). My C++ application shows (live) video and my goal is to minimize latency. I have no user input to process.
In order to decrease latency I switched to a DXGI_SWAP_EFFECT_FLIP_DISCARD swapchain (I used BitBlt before - see For best performance, use DXGI flip model for further details). I use the following flags:
//Swapchain Init:
sc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
sc.Flags = DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT | DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING;
//Present:
m_pDXGISwapChain->Present(0, DXGI_PRESENT_ALLOW_TEARING);
On one computer the swapchain (windowed) goes into the "Hardware: Independent Flip" mode and I have perfectly low latency, as long as I have no other windows in front. On all other computers I tried, I am stuck in the "Composed: Flip" mode with higher latency (~ 25-30 ms more). The software (binary) is exactly the same. I am checking this with PresentMon.
What I find interesting is that on the computer where the independent flip mode works, it is also active without the ALLOW_TEARING flags - from what I understood they should be required for it. I btw also see tearing in this case, but that is a different problem.
I already tried to compare Windows 10 versions, graphic drivers and driver settings. GPU is a Quadro RTX 4000 for all systems. I couldn't spot any difference between the systems.
I would really appreciate any hints on additional preconditions for the independent flip mode I might have missed in the docs. Thanks for your help!
Update1: I updated the "working" system from Nvidia driver 511.09 to 473.47 (latest stable). After that I got the same behavior like on the other systems (no ind. flip). After going back to 511.09 it worked again. So the driver seems to have influence. The other systems also had 511.09 for my original tests though.
Update2: After dealing with all DirectX debug outputs, it still does not work as desired. I manage to get into independent flip mode only in real full screen mode or in windowed mode where the window has no decorations and actually covers the whole screen. Unfortunately, using the Graphics Tools for VS I never enter the independent flip and cannot do further analysis here. But it is interesting that when using the Graphics Tools debug, PresentMon shows Composed Flip, but the Graphics Analyzer from the Graphics Tools shows only DISCARD as SwapEffect for the SwapChain. I would have expected FLIP_DISCARD as I explicitly used DXGI_SWAP_EFFECT_FLIP_DISCARD.

Calculating the Stack usage in RTOS application

I am currently working on a project to develop an application in STM32 microcontroller using RTOS (micrium).
Are there any tools to calculate the stack usage of a particular thread in RTOS application?
No tools I know of. However, two simple methods to estimate stack usage have always worked for me.
Fill all RAM with a value like 0x55 or 0xAA. Let the program run long enough while using all of the device's options to have the most code execution coverage. Stop (under some debugger), and examine RAM for the above values being overwritten. That should give you a good approximation. This works with or without an OS.
Modify the OS just a bit so that on task switches you record to some global variable (array) and for each task the lowest stack pointer found by comparing to the previous value for the same task. After running the app long enough as in [1], examine the counters. Although there is no guarantee the moment a task switch happens you will have the maximum stack used for that task, statistically, after long enough time and assuming preemptive switching, you will have managed to record an accurate enough value.
If you are using GCC or clang -fstack-usage compiler switch generates a stack frame size for each function. You need to combine that information with call-graph information generated by the linker to find the deepest stack usage starting from a specific function. Starting at main(), a task entry-point and and ISR will then give you the worst-case usage for that thread.
Helpfully the work to create such a tool has been done for you as discussed here, using a Perl script from here.
ARM's armcc compiler v5 and earlier (v6 is clang/llvm) has this functionality built-in and can include detailed stack analysis in the link map, including the worst-case call path and warnings of non-deterministic stack usage (due to recursion or call-backs through function pointers for example). You may be using armcc if you are using Keil ARM MDK for example. Again for multi-threaded systems (tasks/ISRs) you need to look at the stack usage for the thread entry point.
Note also that on ARM Cortex-M, the "system stack" is shared by the main() thread and all ISRs, and if you use the ISR preemption priorities multiple interrupts may be active simultaneously. So in theory worst case stack usage is the sum of the stack usage for each of main() and all ISRs that may occur concurrently. Whilst it is good practice to keep ISRs short and simple, beware of third-party code. ST's USB library for example runs the entire USB device stack in the ISR context for example!

Using pthread_yield to return control over the CPU back to Kernel

Consider the following scenario:
in a POSIX system, some thread from a user program is running and timer_interrupt has been disabled.
to my understanding, unless it terminates - the currently running thread won't willingly give away control over the CPU.
My question is as follows: will calling pthread_yield() from within the thread give the kernel control over the CPU?
any kind of help with this question would be greatly appreciated.
Turning off the operating system's timer interrupt would change it into a cooperative multitasking system. This is how Windows 1,2,3 and Mac OS 9 worked. The running task only changed when the program made a system call.
Since pthread_yield results in a system call, yes, the kernel would get control back from the program.
If you are writing programs on a cooperative multitasking system, it is very important to not hog the CPU. If your program does hog the CPU the entire system comes to a halt.
This is why Windows MFC has the idle message in its message loop. Programs doing long term tasks would do it in that message handler by operating on one or two items and then returning to the operating system to check if the user had clicked on something.
It can easily relinquish control by issuing system calls that perform blocking inter-thread comms or requesting I/O. The timer interrupt is not absolutely required for a multithreaded OS, though it is very useful for providing timeouts for such system calls and helping out if the system is overloaded with ready threads.

What's the idiomatic way to do async socket programming in Delphi?

What is the normal way people writing network code in Delphi use Windows-style overlapped asynchronous socket I/O?
Here's my prior research into this question:
The Indy components seem entirely synchronous. On the other hand, while ScktComp unit does use WSAAsyncSelect, it basically only asynchronizes a BSD-style multiplexed socket app. You get dumped in a single event callback, as if you had just returned from select() in a loop, and have to do all the state machine navigation yourself.
The .NET situation is considerably nicer, with Socket.BeginRead / Socket.EndRead, where the continuation is passed directly to Socket.BeginRead, and that's where you pick back up. A continuation coded as a closure obviously has all the context you need, and more.
I have found that Indy, while a simpler concept in the beginning, is awkward to manage due to the need to kill sockets to free threads at application termination. In addition, I had the Indy library stop working after an OS patch upgrade. ScktComp works well for my application.
#Roddy - Synchronous sockets are not what I'm after. Burning a whole thread for the sake of a possibly long-lived connection means you limit the amount of concurrent connections to the number of threads that your process can contain. Since threads use a lot of resources - reserved stack address space, committed stack memory, and kernel transitions for context switches - they do not scale when you need to support hundreds of connections, much less thousands or more.
What is the normal way people writing
network code in Delphi use
Windows-style overlapped asynchronous
socket I/O?
Well, Indy has been the 'standard' library for socket I/O for a long while now - and it's based on blocking sockets. This means if you want asynchronous behaviour, you use additional thread(s) to connect/read/write data. To my mind this is actually a major advantage, as there's no need to manage any kind of state machine navigation, or worry about callback procs or similar stuff. I find the logic of my 'reading' thread is less cluttered and much more portable than non-blocking sockets would allow.
Indy 9 has been mostly bombproof, fast and reliable for us. However the move to Indy 10 for Tiburon is causing me a little concern.
#Mike: "...the need to kill sockets to free threads...".
This made go "huh?" until I remembered our threading library uses an exception-based technique to kill 'waiting' threads safely. We call QueueUserAPC to queue a function which raises a C++ exception (NOT derived from class Exception) which should only be caught by our thread wrapper procedure. All destructors get called so the threads all terminate cleanly and tidy up on the way out.
"Synchronous sockets are not what I'm after."
Understood - but I think in that case the answer to your original question is that there just isn't a Delphi idiom for async socket IO because it's actually a highly specialized and uncommon requirement.
As a side issue, you might find these links interesting. They're both a little old, and more *nxy than Windows. The second one implies that - in the right environment - threads might not be as bad as you think.
The C10K problem
Why Events Are A Bad Idea (for High-concurrency Servers)
#Chris Miller - What you've stated in your answer is factually inaccurate.
Windows message-style async, as available through WSAAsyncSelect, is indeed largely a workaround for lack of a proper threading model in Win 3.x days.
.NET Begin/End, however, is not using extra threads. Instead, it is using overlapped I/O, using the extra argument on WSASend / WSARecv, specifically the overlapped completion routine, to specify the continuation.
This means that the .NET style harnesses the Windows OS's async I/O support to avoid burning a thread by blocking on a socket.
Since threads are generally speaking expensive (unless you specify a very small stack size to CreateThread), having threads blocking on sockets will stop you from scaling to 10,000s of concurrent connections.
This is why it's important that async I/O be used if you want to scale, and also why .NET is not, I repeat, is not, simply "using threads, [...] just managed by the Framework".
#Roddy - I've already read the links you point to, they are both referenced from Paul Tyma's presentation "Thousands of Threads and Blocking I/O - The old way to write Java Servers is New again".
Some of the things that don't necessarily jump out from Paul's presentation, however, are that he specified -Xss:48k to the JVM on startup, and that he's assuming that the JVM's NIO implementation is efficient in order for it to be a valid comparison.
Indy does not specify a similarly shrunken and tightly constrained stack size. There are no calls to BeginThread (the Delphi RTL thread creation routine, which you should use for such situations) or CreateThread (the raw WinAPI call) in the Indy codebase.
The default stack size is stored in the PE, and for the Delphi compiler it defaults to 1MB of reserved address space (space is committed page by page by the OS in 4K chunks; in fact, the compiler needs to generate code to touch pages if there are >4K of locals in a function, because the extension is controlled by page faults, but only for the lowest (guard) page in the stack). That means you're going to run out of address space after max 2,000 concurrent threads handling connections.
Now, you can change the default stack size in the PE using the {$M minStackSize [,maxStackSize]} directive, but that will affect all threads, including the main thread. I hope you don't do much recursion, because 48K or (similar) isn't a lot of space.
Now, whether Paul is right about non-performance of async I/O for Windows in particular, I'm not 100% sure - I'd have to measure it to be certain. What I do know, however, is that arguments about threaded programming being easier than async event-based programming, are presenting a false dichotomy.
Async code doesn't need to be event-based; it can be continuation-based, like it is in .NET, and if you specify a closure as your continuation, you get state maintained for you for free. Moreover, conversion from linear thread-style code to continuation-passing-style async code can be made mechanical by a compiler (CPS transform is mechanical), so there need be no cost in code clarity either.
There is a free IOCP (completion ports) socket components : http://www.torry.net/authorsmore.php?id=7131 (source code included)
"By Naberegnyh Sergey N.. High
performance socket server based on
Windows Completion Port and with using
Windows Socket Extensions. IPv6
supported. "
i've found it while looking better components/library to rearchitecture my little instant messaging server. I haven't tried it yet but it looks good coded as a first impression.
For async stuff try ICS
http://www.overbyte.be/frame_index.html?redirTo=/products/ics.html
Indy uses synchronous sockets because it's a simpler way of programming. The asynchronous socket blocking was something added to the winsock stack back in the Windows 3.x days. Windows 3.x did not support threads and there you couldn't do socket I/O without threads. For some additional information about why Indy uses the blocking model, please see this article.
The .NET Socket.BeginRead/EndRead calls are using threads, it's just managed by the Framework instead of by you.
#Roddy, Indy 10 has been bundled with Delphi since at Delphi 2006. I found that migrating from Indy 9 to Indy 10 to be a straight forward task.
With the ScktComp classes, you need to use a ThreadBlocking server rather than an a NonBlocking server type. Use the OnGetThread event to hand off the ClientSocket param to a new thread of your devising. Once you've instantiated an inherited instance of TServerClientThread you'll create a instance of TWinSocketStream (inside the thread) which you can use to read and write to the socket. This method gets you away from trying to process data in the event handler. These threads could exist for just the short period need to read or write, or hang on for the duration for the purpose of being reused.
The subject of writing a socket server is fairly vast. There are many techniques and practices you could choose to implement. The method of reading and writing to the same socket with in the TServerClientThread is straight forward and fine for simple applications. If you need a model for high availability and high concurrency then you need to look into patterns like the Proactor pattern.
Good luck!

Resources