I have the following weird behaviour on a machine with 40 cores: calling System.Environment.ProcessorCount in fsi (12.0.30815.0) and fsianycpu (12.0.30815.0) leads to different results.
In fsi I get System.Environment.ProcessorCount = 32 in fsianycpu I get System.Environment.ProcessorCount = 40. This also seems to affect the task parallel library which only uses 80% of all available cores when a simple test code is run from fsi (which has the wrong processor count).
What could be the reason?
FSI is probably running under 32 bit mode by default. You should be able to check via Task Manager assuming you are running under Windows of course. I suspect this is a limitation of apps running under WoW64 (what you run your 32 bit apps in using a 64 bit Windows).
See: https://msdn.microsoft.com/en-us/library/windows/desktop/aa384228%28v=vs.85%29.aspx. Doesn't confirm it exactly (no documented behavior of WoW64) but it mentions that 32 bit Windows only supports 32 processors.
EDIT: See this other stack overflow post as well: Detecting the number of processors
Related
I have a workstation whose operating system is 64 bit windows server 2012 R2. I am using Delphi XE7 Update 1. The workstation has 72 cores including hyperthreading.
I want all my applications to run on all the cores that are available each time the application is run. I wish to do this programmatically rather that using set affinity in task master (which only applies to one group at a time and I have two groups of 36 cpus which I want to engage simultaneously) or boot options in advance setting from msconfig.
I realise that the question is similar to or encompasses the following questions that have already been asked on Stackoverflow
Delphi TParallel not using all available cpu
Strange behaviour of TParallel.For default ThreadPool
SetProcessAffinityMask - Select more than one processor?.
and I have also looked at the suggestion edn.embarcadero.com/article/27267.
But my SetProcessAffinityMask question relates to more that 64 cores using a 64 bit operation system and is not confined to using TParallel.
The solution that I have tried is an adaptation of the one proffered by Marco van de Voort
var
cpuset : set of 0..71;
i: integer;
begin
cpuset:=[];
for i:=0 to 71 do
cpuset:=cpuset+[i];
SetProcessAffinityMask(ProcInfo.hProcess, dword(cpuset));
end;
but it did not work.
I would be grateful for any suggestions.
As discussed in the comments, an affinity mask is 32 bits in 32 bit code, and 64 bits in 64 bit code. You are trying to set a 72 bit mask. Clearly that isn't going to work.
You are going to need to understand processor groups, covered in detail on MSDN: Processor Groups, including this link to Supporting Systems That Have More Than 64 Processors.
Since you've got 72 processors you'll have multiple processor groups. You need to either use multiple processes to reach all groups, or use a multi process group. From the doc:
After the thread is created, its affinity can be changed by calling SetThreadAffinityMask or SetThreadGroupAffinity. If a thread is assigned to a different group than the process, the process's affinity is updated to include the thread's affinity and the process becomes a multi-group process. Further affinity changes must be made for individual threads; a multi-group process's affinity cannot be modified using SetProcessAffinityMask.
This is pretty gnarly stuff. If you are in control of your threads then you should be able to do what you need with SetThreadGroupAffinity. If you are using the Delphi threading library then you don't control the threads. That probably makes a single process solution untenable.
Another problem to consider is memory locality. If the machine uses NUMA memory, then I know of no Delphi memory manager that can perform well with NUMA memory. In a NUMA environment, if performance is important to you, you probably need each thread to allocate memory on the thread's NUMA node.
The bottom line here is that there is no simple quick fix to produce code that will use all of this machine's resources effectively. Start with the documentation I linked to above and perform some trials to make sure you understand all the implications of processor groups and NUMA.
When executing a 32-bit program on a 64-bit CPU, can we assume that the underlying hardware will use 32-bit values in memory and in the CPU registers to store an integer?
Given these setups:
Intel i5 + Windows 32 bit + 32 bit application
Intel i5 + Windows 64 bit + 32 bit application
Please explain the drawbacks of each setup. Would using the second one be less efficient because the CPU gets extra work to ignore halt of the register's value?
When an x86_64 processor executes 32-bit code it is effectively running in i686 mode. This is all achieved in hardware so there is no performance penalty.
A user-space 32-bit program will be executed in "32 bit mode" whether the operating system is 32 bits or 64 bits so the behaviour should be identical.
The only potential drawback is that the program might need to dynamically link to some 32-bit libraries which are not installed by default on the 64-bit OS. That sort of scenario is more common on Linux where it is much more common for programs to dynamically link to external libraries; the only example I can think of on Windows is if the program needs the Visual C++ runtime library then you would have to install the 32-bit version of that (possibly alongside the 64-bit version, if another program needed that). So in short you might have to install more stuff with a 64 bit set up.
Also the actual OS might consume more memory on a 64 bit setup, however one drawback of 32-bit system is that your entire system memory size is limited to 4GB (or practically about 3.5 GB due to some memory space being mapped to hardware).
64-bit Knowledge Base: How much memory can an application access in Win32 and Win64?
I would like to know what's the difference betweenmach_vm_allocate and vm_allocate. I know mach_vm_allocate is only available in OS X and not iOS, but I'm not sure why. The file that has all the function prototypes in for the mach_vm_... functions (mach/mach_vm.h) only has #error mach_vm.h unsupported. in iOS.
the new Mach VM API that is introduced in Mac OS X 10.4. The new API is essentially the same as the old API from the programmer's standpoint, with the following key differences.
-Routine names have the mach_ prefixfor example, vm_allocate() becomes mach_vm_allocate() .
-Data types used in routines have been updated to support both 64-bit and 32-bit tasks. Consequently, the new API can be used with any task.
The new and old APIs are exported by different MIG subsystems:
mach_vm and vm_map , respectively. The corresponding header files are
<mach/mach_vm.h> and <mach/vm_map.h> , respectively.
The new Mach VM API that is introduced in Mac OS X 10.4. The new API is essentially the same as the old API from the programmer's standpoint, with the following key differences:
Routine names have the mach_ prefixfor example, vm_allocate() becomes mach_vm_allocate();
Data types used in routines have been updated to support both 64-bit and 32-bit tasks. Consequently, the new API can be used with any task.
The new and old APIs are exported by different MIG subsystems: mach_vm and vm_map respectively. The corresponding header files are <mach/mach_vm.h> and <mach/vm_map.h> respectively.
The information is derived from the book OS X Internals A System Approach.
Those would be /usr/include/mach/mach_vm.h and /usr/include/mach/vm_map.h. You can grep in /usr/include/mach to get those.
In the kernel, the APIs are basically using the same implementation. On iOS, you can use the "old" APIs perfectly well.
I'm not happy with the other answers. Two seem to quote from the same source (one doesn't correctly attribute it) and this source rather misleading.
The data types in the old API are variable size. If you are compiling for i386 CPU architecture, they are 32 bit, if you compile for x86_64 CPU architecture, they are 64 bit.
The data type in the new API are fixed size and always 64 bit.
As a 32 bit process only requires 32 bit data types and a 64 bit process already had 64 bit types even with the old API, it isn't obvious why the new API was required at all. But it will become obvious if you look at Apples kernel design:
The real kernel of macOS is a Mach 3 micro kernel. Being a micro kernel, it only takes care of process and thread management, IPC, virtual memory management and process/thread scheduling. That's it. Usually everything else would be done in userspace when using a micros kernel but that is slow and thus Apple took a different approach.
Apple took the FreeBSD monolithic kernel and wrapped it around the micro kernel which executes the FreeBSD kernel as a single process (a task) with special privileges (despite being a task, the code runs within kernel space not userspace as all the other tasks in the system). The combined kernel has the name XNU.
And in the past, Apple supported any combination of mixing 32 and 64 bits within a system: Your system could run a 32 bit kernel and run 32 or 64 bit processes on it or it could run a 64 bit kernel and run 32 or 64 bit processes on it. You might already be seeing where this going, aren't you?
If process and kernel are of different bit width, the old API is of no use. E.g., a 32 bit kernel cannot use it to interact with the memory of a 64 bit process because all data types of the API calls would only be 32 bit in the 32 bit kernel task but a 64 bit process has a 64 bit memory space even if the kernel itself has not.
Actually there is even a 3rd version of that API. To quote a comment from Apple's kernel source:
* There are three implementations of the "XXX_allocate" functionality in
* the kernel: mach_vm_allocate (for any task on the platform), vm_allocate
* (for a task with the same address space size, especially the current task),
* and vm32_vm_allocate (for the specific case of a 32-bit task). vm_allocate
* in the kernel should only be used on the kernel_task. vm32_vm_allocate only
* makes sense on platforms where a user task can either be 32 or 64, or the kernel
* task can be 32 or 64. mach_vm_allocate makes sense everywhere, and is preferred
* for new code.
As long as you are only using that API in userspace and only for memory management of your old process, using the old API is still fine, even for a 64 bit process as all data types will be 64 bit in that case.
The new API is only required when working across process boundaries and only if it's not certain that both processes are both 32 or 64 bit. However, Apple has dropped 32 bit kernel support long ago and meanwhile they have also dropped 32 bit userspace support as all system libraries only ship as 64 bit library starting with 10.15 (Catalina).
On iOS the new API was never required as you could never write kernel code for iOS and the security concept of iOS forbids direct interaction with other process spaces. On iOS you can only use that API to interact with your own process space and for that it will always have correct data types.
I have some 32-bit DLLs that don't have matched 64-bit DLLs. How can I invoke these DLLs from a 64-bit application written in Delphi XE2?
No, you cannot directly do this. A 64 bit process can only execute 64 bit code, and a 32 bit process can only execute 32 bit code.
The trick is to use multiple processes.... (Note this can be done for non visual code, and even for GUI elements, though there can be some small but problematic behaviors for visual elements.)
The most common solution is to wrap the 32 bit dll in an out of process COM server, which you can call across the 64/32 bit barrier. (This goes both ways, you can create a 64 bit out of process COM server and call it from a 32 bit application also.)
Yes, there are other ways to conceive of this, but the most common is to use COM:
Create a new 32 bit out of process COM server that hosts your 32 bit
DLL and exposes the needed functionality from the 32 bit dll.
Call this COM server from your 64 bit code
I should add that it is also possible to create the new 32 bit COM server as an in-process COM server, and then configure COM+ to run it. COM+ will run it out of process, and magically run your 32 bit in process COM server out of process, where you can call it from 32 and 64 bit code transparently, as if it was in process. (Note, if the COM server is a GUI control, going out of process may or may not work. The team I work with has done it successfully, but there are complexities -- some of which cannot be surmounted -- related to hooking parent windows and controls that cannot be done across the process boundary.)
You can use the same exact technique used to call 64 bit dlls from 32 bit code.
See http://cc.embarcadero.com/Item/27667
"Just" make the contrary: run a background 32 bit process, the communicate from your 64 bit process with it using a memory mapped buffer.
But this is definitively not an easy task. You'll have to rewrite some asm code. I wrote some article about how it works.
The out-of-process COM option is perhaps the easiest to implement. Or use a more simple IPC - like WM_COPYDATA message or any other mean. But you'll definitively need another 32 bit process to link to the 32 bit libraries.
I had the same issue some time back and found this link:32-bit DLLs in 64-bit environment
The 32-bit DLL was written in Delphi, ages ago, and we now had a need to call it from a 64-bit platform- but we don't have a 64-bit Delphi.
I've made it work- though it seems a bit of a kludge, it was better than getting the DLL rewritten in 64-bit (we'd have had to purchase a 64-bit version of Delphi, or start from scratch in something else).
NB while this needs some hacking, no programming is required- it uses components that came with Windows. Works in (at least) Windows 7, Windows 2008.
I need some help to describe, in technical words, why a 64-bit application prompts a "Not a valid Win32 application" in Windows 32-bit on a 32-bit machine? Any MSDN reference is greatly appreciate it (I couldn't google a reliable source). I know it shouldn't run, but I have no good explanation for this.
A 32 bit OS runs in 32-bit protected mode. A 64 bit OS runs in long mode (i.e. 64-bit protected mode). 64 bit instructions (used by 64 bit programs) are only available when the CPU is in long mode; so you can't execute them in 32-bit protected mode, in which a 32 bit OS runs.
(The above statement applies to x86 architecture)
By the way, the reason for "Not a valid Win32 application" error message is that 64 bit executables are stored in PE32+ format while 32 bit executables are stored in PE32 format. PE32+ files are not valid executables for 32 bit Windows. It cannot understand that format.
To expand on what others have said in a more low-level and detailed way:
When a program is compiled, the instructions are written for a specific processor instruction set. What we see as "x = y + z" usually amounts to something along the lines of copying one value into a register, passing the add command with the memory location of the other value, etc.
Specific to this one question, a 64 bit application is expecting 64 bits of address space to work with. When you pass a command to the processor in a 32 bit system, it works on those 32 bits of data at once.
The point of it all? You can't address more than 4 gigabytes (232) of memory on a 32 bit system without creativity. Some tasks that would take multiple operations (say, dealing with simple math on numbers > 4 billion unsigned) can be done in a single operation. Bigger, faster, but requires breaking compatibility with older systems.
A 64-bit application requires a 64-bit CPU because it makes use of 64-bit instructions.
And a 32-bit OS will put your CPU into 32-bit mode, which disables said instructions.
Why would you expect it to work? You're compiling your program to take advantage of features that don't exist on a 32-bit system.
A 64-bit architecture is built to invoke hardware CPU instructions that aren't supported by a 32-bit CPU, which your CPU is emulating in order to run a 32-bit OS.
As it would be theorically possible to run in some kind of emulation mode for the instruction set, you get into troubles as soon as your application needs more memory than 32 bits can address.
The primary reason why 64-bit apps don't run on 32-bit OSes has to do with the registers being used in the underlying assembly (presuming IA-32 here) language. In 32-bit operating systems and programs, the CPU registers are capable of handle a maximum of 32 bits of data (DWORDS) in the registers. In 64-bit OSes, the registers must handle double the bits (64) and hence the QWORD data size in assembly. Compiling for x86 (32 bit) means the compiler is limiting the registers to handling 32 bits of data max. Compiling for x64 means the compiler is allowing the registers to handle up to 64 bits of data. Stepping down from QWORDs (64) to DWORDs (32) means you end up losing half the information stored in the registers. You're most likely getting the error because the underlying compiled machine code is trying to move more than 32 bits of data into a register which the 32-bit operating system can't handle since the maximum size of its registers are 32 bits.
Because they're fundamentally different. You wouldnt expect a Frenchman to understand Mandarin Chinese, so why would you expect a 32bit CPU to understand 64bit code?