How pthreads does cross-threading and scheduling - pthreads

I was wondering, how does pthreads-win32 (windows implementation of pthreads) implement cross-threading? Is it written exclusively with windows API? I checked some of the sources and it seems that most is indeed written with windows API, tho i was wondering if it uses windows scheduler to switch between threads (and cores) as well or does it implement its own? Specifically, most processors these days implement their own scheduler (i've read about itanium arch for example, the hardwired logic supports two threads per core and it even automatically switches between them with hw logic, so evidently OS support for multiple cores is not necessarily needed), so if i have an obsolete OS like windows 32-bit or something, which doesn't support multi-core processors, would a program written with pthreads-win32 still run on more than one processor core or would only one core be used?
How about pthreads implementations (untainted posix threads)? Do they support multi-core processors even if the OS on which they are running doesn't?
I am guessing the answer is no, for both windows and posix versions, only one core is in use if the OS doesn't support for multiple cores. Tho this is just an educated guess and i would like to confirm it, so pls leave a comment.
On a side request, can you pls recommend a lib that DOES support for muli-core thread execution, even if the OS on which the program is running DOESN'T. If any exist ofc.
Also, is there a way to ensure two threads written with pthreads are being executed on different cores, or does the OS (or the processor, or pthreads lib) do the assignment automatically? Does pthreads guarantee execution on different cores if they are available?
Cheers, Val
EDIT:
I know most of these questions are implementation specific, so i was referring to this implementation of pthreads for windows http://sourceware.org/pthreads-win32/. I didn't specifically mention it before, because as far as i know, this is the most popular and widely used implementation of pthreads for windows.

So from what i'm getting, the most important thing to note in all of this is that threading has very little to do with parallelism (like UMA with multi-core processors). So while threading might be a technique to implement concurrency it is not a way of ensuring ACTUAL parallel execution, which is what i was looking for in the first place, since i am studying parallel and distributed systems and algorithms.
So to answer one question at a time. Yes, pthreads, and probably most (if not all) other threading APIs out there are based on the underlying OS API. Which ofc gives them the same limits that the OS has. Meaning, yes, if the OS (concretely in this case, some windows running for example pthreads-win32) doesn't support multiple cores, only one core is in use at all times. As is pointed out on the wiki page nob provided, to cite: "Hyper-threading requires not only that the operating system support multiple processors, but also that it be specifically optimised for HTT, and Intel recommends disabling HTT when using operating systems that have not been so optimized." http://en.wikipedia.org/wiki/Hyper-threading Meaning in most cases, just hardwired processors (basic) scheduler is not enough to take advantage of multiple cores, it has to be supported/used by SW (OS support).
While this might not be a definitive proof, i believe enough evidence points in the same direction to confirm this to be the case.
I did not sift through pthreads (for posix compliant OSes) sources, i am guessing the same goes for this API, since it is more than likely to use the underlying OS API. You will have to confirm this on your own. :)
Also, any potential libs out there that might support execution on multiple cores even if the OS on which they're running on doesn't support multiple cores, you will have to find them on your own (if they exist), please leave a comment.
To ensure parallelism (execution on different cores) manually, linux does provide a way to pin a thread to a specific virtual processor (under certain conditions). To pin an entire process to a specific (virtual) processor/core, sched_setaffinity() (from sched.h) can be used. As nos pointed out, pthreads provides pthread_setaffinity_np() to pin a particular thread to a specific core. Windows supports a similar functionality with SetThreadAffinityMask(), so clearly, assigning threads manually to run in parallel on different cores is possible (if the OS supports multi-cores).
From my experience coding with pthreads, if you write for code that uses multiple threads (more than 2), they SHOULD be executed on more than one physical core, if available (which is probably an OS feature used by pthreads).
My questions were quite general to begin with, since most of these things are implementation specific, it's hard to give one answer. I hope this answer is detailed enough to help you clarify a few things.
Cheers, Val

Generally each modern OS supports Threads by itself and schedules them to the different (virtual) Cores of a System. The OS provides some general synchronization techniques (like Mutexes or Semaphores or Barriers) which are used by pthread to implement the pthreads API.
With two threads per Core (I think you mean Hyper Threading) on some Intel Processors (like Itanium) the OS sees two "virtual" Cores. The processor indeed schedules the two threads onto one physical core. (See Wikipedia)
However, there are examples where Runtime-Plattforms implement their own Thread-Conceptepts and do the scheduling: I think of (at least older) implementations of Java having their own scheduling routines.

Related

Why do we need AML - ACPI Machine Language?

As I understand, ACPI defines a generic hardware programming model where operating system relies on the OEM firmware provided AML (ACPI machine language) code to manipulate the hardware.
In order to execute the AML code, operating system has to incorporate an AML interpreter.
So, it looks to me that firmware developers use AML to provide a control interface between platform hardware and operating system.
But do we really need AML?
I think ultimately the hardware can only be configured through the native instruction of the platform. So the AML interpreter must translate the AML into native instructions otherwise it cannot be executed on the platform.
But what's the point of using an intermediate language like AML? I mean though the AML is said to be platform-independent, which means I can use AML to describe my platform in a non-native way.
But the AML is part of the platform firmware in practice. And the entire firmware has already been built into the target platform's native instructions. So what good can it be to make such a little part of the firmware as platform-independent? Why not just use native instructions? There must be some way to let OS use it as well. And this way operating system doesn't need the AML interpreter at all. A lot of complexity can be avoided.
One of the big goals of ACPI over its predecessor APM was to give the OS more viability and control over power state transitions.
APM was a black box. The OS knew nothing about the power management implementation. It would just call a BIOS function and the BIOS handled all of the magic. Did it work? Did the system sleep properly? Did the system freeze? Was a user application able to handle the BIOS implementation? The sad truth was that many systems had power management that was downright broken, and Microsoft wanted to provide a better power management experience for the growing laptop industry.
Now, the BIOS hands the ASL/AML code over to the OS and the OS executes it not the BIOS. If the BIOS code does something dumb (like messing with registers it shouldn't), Windows can detect that by parsing the code and block it. AML is 100% decompilable unlike C.
Remember that ACPI is not x86 specific. At the time it was developed, Itanium and Xscale were around. Intel and Microsoft needed a language that would work on all platforms, both 32 and 64 bit.
Lastly, ASL is more than just a list of executable functions. It is also number of static configuration tables. The ASL code has tables to define the non PnP hardware built onto your motherboard. It has tables of supported power states. A traditional programming language like C isn't really setup for that.
If ACPI was invented today, they would probably use something like XML to provide the info to the OS.
Originally, hardware for "80x86 PC" was cloned from IBM's PC, and this created an effective de-facto standard for hardware to follow. However it didn't take long before manufacturers wanted to add features that didn't previously exist, where there was no (official or de-facto) standard to follow.
This led to a major problem for operating system software (how do you support "non-standard chaos"). Some standards were created for some things (APM, etc) but they didn't really cover everything needed and became out-of-date. ACPI was created to fix this.
Ideally, what was (and still is) needed is standards that allow operating system to detect and use supported features of the motherboard. For example, a "standardised case temperature and fan control" device (with support for detecting how many fans, temperature sensors, etc), or a "standardised CPU speed/power consumption", a "PCI slot IRQ routing for IO APICs" standard, a "hot-plug PCI controller device" standard, etc.
However, ACPI didn't provide useful standards that hardware manufacturers and operating systems can use. Instead, ACPI provided an over-engineered mess (AML) to allow an OS to cope with ACPI's failure to standardise the hardware.
Essentially; we "need" AML now because it's the only viable way for an OS to work-around the "non-standard chaos" problem that ACPI failed to fix.
The problem with providing native code instead of AML is that different operating systems use CPUs in different ways (e.g. native 64-bit 80x86 code in firmware would be useless for an older "32-bit only" OS). AML provides portability between different types of CPUs and between the same CPU/s in different modes.
Also; native code is considered a major security problem (rootkits, etc); and people tend to think an interpreted language mitigates that problem. Of course in practice AML needs far too much access to the underlying hardware and does it in a way that an OS can't check, and there's isn't even a way for an OS to determine if the AML has been maliciously modified before the OS booted. For these reasons AML is still a major security problem despite using interpreted language.

Qualitative comparison between Petalinux and FreeRTOS

I'm going to start the development of an application on a Zynq board. My task is basically to port an existing application running on a Microblaze on the dual core ARM.
What I'm wondering about is which O.S. to use on the new system, because I have no experience at all in this field.
It seems to me that there are four main approaches:
1) Petalinux (use both cores)
2) Petalinux+FreeRTOS (use both cores)
3) FreeRTOS (use only a core)
4) Baremetal (use only a core)
What my application has to do is to move a big amount of data between Ethernet and multiple custom links, so it has to serve a lot of interrupts and command a lot of DMA operations.
How much is the overhead introduced by Petalinux in the interrupt service with respect to baremetal or FreeRTOS? Do you think that, for this kind of work, is faster a single core application running without any OS or, for example, a Petalinux application that has the overhead of the OS (and of the synchronization mechanisms like semaphores or mutex)?
I know the question is not precise and quite vague, but having no experience in the field I strongly need some initial hints.
Thank you.
As you say, this is too vague to give a considered answer because it really depends on your application (when does it not). If you need all the 'stuff' that is available for Linux and boot time is not an issue then go with that. If you need actual real time behaviour, fast boot time, simplicity, and don't need anything Linux specific, then FreeRTOS might be your best choice. There is a Zynq FreeRTOS TCP project that uses the BSD style sockets interface (like Linux) here: http://www.freertos.org/FreeRTOS-Plus/FreeRTOS_Plus_TCP/TCPIP_FAT_Examples_Xilinx_Zynq.html
Usually the performance should not differ alot.
If you compile your linux with a well optimizing compiler there is a good chance to be faster compared to bare metal.
But if you need hard real time linux is not suitable for you.
There is a good whitepaper from Altera but should fit for Xilinx too:
whitepaper on real time jitter

Are there any implementations\prototypes of Erlang alike VM that could run not only on CPU but also on gpu?

I was creating distributed systems in OOP languages using message passing libraries like MPI, ZepoMQ, RabbitMQ and so on. Now I found myself watching some erlang promotional material and understood that lots of things we emulate in OOP languages like C++ and C# using libraries (1 000 000 socket connections per process, distributed messaging and distributed process monitoring visualization) was there in Erlang for many years now. And it seemed reasonable to get to know the language better. I found myself asking one last question: are there any implementations\prototypes of Erlang alike VM that could run/spawn some processes not only on CPU but also on GPU?
Because that would definitely make Erlang (and its more readable for my OOP background dialects like Elixir) language of choice for most future projects.
GPU is fast only with sequential memory access. I hardly imagine garbage collection on GPU RAM. GPU is NOT a cool and parallel CPU. It requires more effort to write to. So most probably there is no Erlang compiler for GPU.
I doubt there's any implementation that can run Erlang processes on GPU but you can use two techniques to run computations on GPU under Erlang:
use C library through NIFs (native implemented functions) - see http://www.erlang.org/doc/man/erl_nif.html and an example of such an implementation: msantos/procket on Github (I'm sorry, I can't post the link due to low reputation :)
use native OS process and communicate with it through erlang "port" - see http://www.erlang.org/doc/reference_manual/ports.html
The first one is faster and the later is safer (NIFs can crash the whole VM).
This is not specific to GPU coputations. Erlang is not well suited for high performance number crunching - it's better to do it in C and manipulate the results in Erlang anyway. The communication between the C and Erlang should be implemented in the one of the two described manners.

Why do the CLR and JVM use a stack-based architecture?

I am always curious as to why the JVM and CLR have a stack-based architecture?
Why don't they use a register-based approach?
What benefits does it have over the register-based approach?
I used to ponder the differences between register and stack machines and compare instruction sequences, and run benchmarks...
Then I spent a couple of years implementing both types of machines while working on the Parrot VM, which was a register machine. We started, naively, with a fixed register set, in combination with data and register stacks, but eventually concluded that it was an artificial limitation, so we changed to an infinite register set and an allocator. At some point, the Parrot fast core (GCC computed goto) outperformed Mono and JVM interpreter cores (non-JIT), but the difference came down to the JIT. Parrot's JIT never matched the quality of the others. It is the quality of the JITter that makes the eventual machine, and that is generally what people care about. If all VMs played by the same rules (ie. they had a constraint to run in interpreted mode with no JIT), then my evidence shows a register machine has the performance edge on an equivalent stack machine. Larger instructions, but fewer of them == higher throughput (IPC), and better cache locality of reference. The Dalvik JVM actually supports my findings, Dalvik had no JIT for a couple of years, and competed with its interpreter core.
Very few mainstream VMs run in interpretation mode exclusively (AFAIK), they JIT compile, and thats what we benchmark. The point of the interpreter core is to establish a presence on the platform, do bytecode verification, and provide a failsafe execution core in absence of the JIT. This isn't a rule, of course; there are billions of devices running ARM accelerated JVM without JIT, but in the absence of memory or CPU constraints, this applies.
I worked and worked at tweaking the core, testing and tuning, only to find that in the end we really wanted a fast JIT. I arrived at the conclusion that if you are going to eventually JIT, it doesn't matter much whether you implement a stack or register machine to start, do what you like; but you will get "to market" faster with a stack machine. Doing a lot of pseudo-register-machine virtual optimizations for bytecode interpretation by a virtual machine core is partially a wasted effort, because it isn't real native optimization. The soft-core doesn't do branch prediction, register renaming, instruction reordering, parallel execution or prefetch like a real processor. My feeling is that once we have a high quality JIT to native binary, we arrive at the same destination.
For those reasons, I technically favor a stack based machine for:
Simplicity - Much less code to maintain = less bugs
Time to implement
But visually, and emotionally I favor a register machine for:
Visual-Conceptual models more closely match the machine, and my
brain
Flexibility - Compilers can evaluate their expression trees
in different orders using SSA.
Note I didn't say compilers could more "easily" generate code. That seems to be what people who have worked mostly with stack machines like to argue. I don't believe that and didn't find that to be true. I saw many hobby compilers written in a short time on both Parrot and the CLR, though I would admit the ones on the CLR are of higher quality, but that is mainly one of ecosystem and quality of available tools. I wrote compilers on both platforms myself, and found there are tradeoffs, but not enough to lose sleep over.
This is an educated guess, because my real-world experience does not include writing a full JITter so I don't have first-hand experience comparing the pros and cons of JITting various forms of opcodes, but my opinion is, if you plan to include a JIT, then creating an extremely sophisticated virtual machine opcode core amounts to premature optimization. Your time is better spent elsewhere.
It is usually not appropriate to just link out to an article but this time I'll make an exception: This article by Eric Lippert answers just this question.

Should I run my regression testing programs on both AMD and Intel chips?

Right now I plan to test on 32-bit, 64-bit, Windows XP Home, Windows XP Pro, Windows Vista Home Basic, Windows Vista Ultimate, Windows 7 Home Basic, and Windows 7 Ultimate ... all with the latest service pack.
However, now I'm wondering if it's worthwhile to test on both AMD and Intel for all the listed scenarios above or would it be a waste of time?
Note: this is a security application for everyday average users.
My feeling is that this would only be worthwhile if you had lots of on-the-edge hand-coded assembly language or some kind of incredibly tight timings (which you're not going to meet with that selection of OS anyway).
If you're using off-the-shelf commercial compilers, then you can be reasonably sure they're going to generate code which runs on all the normal processors.
Of course, nobody could ever prove they didn't need to test on a particular platform, but I would think there are bigger causes of platform difference to worry about than CPU brand (all the various multi-core/hyperthreading permutations, for example, which might expose all your multithreaded code bugs in different ways)
Only if you're programming in assembly and use extended, vender specific instruction sets. But since AMD and Intel have cross-licensing agreements in place, this is more of an historic issue than a current one.
In every other case (e.g. using a high level language) it's the job of the compiler writers to ensure the code is x86 compliant and runs on every CPU.
Oh, and except the FDIV Bug Processor vendors usually don't do mistakes.
I think you're looking in the wrong direction for testing scenarios.
Yes, it's possible that your code will work on Intel but not on AMD, or in Windows Vista Home but not in Windows Vista Professional. But unless you're doing something very closely tied to low-level programming in the first case, or to details of OS implementation in the second, the odds are small. You could say that it never hurts to test every conceivable scenario. But in real life there must be some limit on the resources available to you for testing. Testing on different processors or different OS's is, in most cases, not testing YOUR program, it's testing the compiler, the OS, or the processor. How much time do you have to spare to test other people's work? I think your time would be better spent testing more scenarios within your own code. You don't give much detail on just what your app does, but just to take one of my own examples, it would be much more productive to spend a day testing selling products our own company makes versus products we resell from other manufacturers, or testing sales tax rules for different states, or whatever.
In practice, I rarely even test deploying on Windows versus deploying on Linux, never mind different versions of Windows, and I rarely get burned on that.
If I was writing low-level device drivers or some such, that would be a different story. But normal apps? Don't waste your time.
Certainly sounds like it would be a waste of time to me - which language(s) are your programs written in?
I'd say no. Unless you are writing your application in assembler, you should be far enough removed from the processor to not need to worry about differences. The processors will support the Windows OS whose API's are what you are interefacing with(depending on the language). If you are using .NET the ONLY forseeable issue you will have is if you are using a version of the framework that those platforms don't support. Given that they are all XP or later you should be fine. If you want to worry about something make sure your application will play nicely with the Vista and later security model.
The question is probably "what are you testing". It is unlikely that any of the test is testing something that would be potentially different between AMD and Intel hardware platforms. Differences could be expected at driver level, but you do not seems to plane testing your software for every existing bit of PC hardware available around. Most probably there would be much more differences between different levels of windows service pack than between AMD and Intel processors.
I suppose it's possible there is some functionality in your code that (whether you know it or not) takes advantage of some processing/optimization in one or the other that could have a serious effect on the outcome. Keyword possible.
I would say in general you're unlikely to have to worry about it. If you're going to do it on multiple machines anyway, mix it up on them. But I wouldn't stress out about it.
I would never run all of my regression tests on both AMD and Intel unless I had specifically fixed an issue unique to one either one. That is what regression testing is.
Unit testing on the other hand... I wouldn't anticipate any difference. So again, I wouldn't bother running unit tests on both until I had actually seen an issue specific to either AMD or Intel.
If you rely on accurate / consistent floating point results, then yes, definitely.

Resources