It doesn't seem like it would be difficult to associate ranges with segments of memory. Then have an assembly instruction which treats 2 integers as "location" & "offset" (another for "data" if setting), and returns the data and error code. This would mean no longer having to make a choice between speed and security/safety when working with arrays.
Another example might be a function which verifies that instructions originating in a particular memory range cannot physically access memory outside that range. If all hardware connected to the motherboard had this capability (and were made to be compatible with each other), it would be trivial to make perfect virtual machines that run at nearly the same speed as the physical machine.
Dustin Soodak
Yes.
Decades ago, Lisp machines performed simultaneous validation checks (e.g. type checks and bounds checks) as the program ran with the assumption the program and state were valid, jumping "back in time" if the check failed - unfortunately this ability to get "free" runtime validation was lost when conventional (i.e. x86) machines became dominant.
https://en.wikipedia.org/wiki/Lisp_machine
Lisp Machines ran the tests in parallel with the more conventional single instruction additions. If the simultaneous tests failed, then the result was discarded and recomputed; this meant in many cases a speed increase by several factors. This simultaneous checking approach was used as well in testing the bounds of arrays when referenced, and other memory management necessities (not merely garbage collection or arrays).
Fortunately we're finally learning from the past and slowly, and by piecemeal, reintroducing those innovations - Intel's "MPX" (Memory Protection eXtensions) for x86 were introduced in Skylake-generation processors for hardware bounds-checking - though it isn't perfect.
(x86 is a regression in other ways too: IBM's mainframes had true hardware-accelerated system virtualization in the 1980s - we didn't get it on x86 until 2005 with Intel's "VT-x" and AMD's "AMD-V" extensions).
x86 BOUND
Technically, x86 does have hardware bounds-checking: the the BOUND instruction was introduced in 1982 in the Intel 80188 (as well as the Intel 286 and above, but not the Intel 8086, 8088 or 80186 processors).
While the BOUND instruction does provide hardware bounds-checking, I understand it indirectly caused performance issues because it breaks the hardware branch predictor (according to a Reddit thread, but I'm unsure why), but also because it requires the bounds to be specified in a tuple in memory - that's terrible for performance - I understand at runtime it's no faster than manually having the instructions to do an "if index not in range [x,y] then signal the BR exception to the program or OS" (so you might imagine the BOUND instruction was added for the convenience of people who coded assembly by-hand, which was quite common in the 1980s).
The BOUND instruction is still present in today's processors, but it was not included in AMD64 (x64) - likely for the performance reasons I explained above, and also because likely very few people were using it (and compilers could trivially replace it with a manual bounds check, that might have better performance anyway, as that could use registers).
Another disadvantage to storing the array bounds in memory is that code elsewhere (that wasn't subject to BOUNDS checking) could overwrite the previously written bounds for another pointer and circumvent the check that way - this is mostly a problem with code that intentionally tries to disable safety features (i.e. malware), but if the bounds were stored in the stack - and given how easy it is to corrupt the stack, it has even less utility.
Intel MPX
Intel MPX was introduced in Skylake architecture in 2015 and should be present in all Skylake and subsequent processor models in the mainstream Intel Core family (including Xeon, and non-SoC versions of Celeron and Pentium). Intel also implemented MPX in the Goldmont architecture (Atom, and SoC versions of Celeron and Pentium) from 2016 onwards.
MPX is superior to BOUND in that it provides dedicated registers to store the bounds range so the bounds-check should be almost zero-cost compared to BOUND which required a memory access. On the Intel 486 the BOUND instruction takes 7 cycles (compare to CMP which takes only 2 cycles even if the operand was a memory address). In Skylake the MPX equivalent (BNDMK, BNDCL and BNDCU) are all 1-cycle instructions and BNDMK can be amortized as it only needs to be called once for each new pointer).
I cannot find any information on wherever or not AMD has implemented their own version of MPX yet (as of June 2017).
Critical thoughts on MPX
Unfortunately the current state of MPX is not all that rosy - a recent paper by Oleksenko, Kuvaiskii, et al. in February 2017 "Intel MPX Explained" (PDF link: caution: not yet peer-reviewed) is a tad critical:
Our main conclusion is that Intel MPX is a promising technique that is not yet practical for widespread adoption. Intel MPX’s performance overheads are still high (~50% on average), and the supporting infrastructure has bugs which may cause compilation or runtime errors. Moreover, we showcase the design limitations of Intel MPX: it cannot detect temporal errors, may have false positives and false negatives in multithreaded code, and its restrictions
on memory layout require substantial code changes for some programs.
Also note that compared to the Lisp Machines of yore, Intel MPX is still executed inline - whereas in Lisp Machines (if my understanding is correct) bounds checks happened concurrently in hardware with a retroactive jump backwards if the check failed; thus, so-long as a running program's pointers do not point to out-of-bounds locations then there would be an absolutely zero runtime performance cost, so if you have this C code:
char arr[10];
arr[9] = 'a';
arr[8] = 'b';
Then under MPX then this would be executed:
Time Instruction Notes
1 BNDMK arr, arr+9 Set bounds 0 to 9.
2 BNDCL arr Check `arr` meets lower-bound.
3 BNDCU arr Check `arr` meets upper-bound.
4 MOV 'a' arr+9 Assign 'a' to arr+9.
5 MOV 'a' arr+8 Assign 'a' to arr+8.
But on a Lisp machine (if it were magically possible to compile C to Lisp...), then the program-reader-hardware in the computer has the ability to execute additional "side" instructions concurrently with the "actual" instructions, allowing the "side" instructions to instruct the computer to disregard the results from the "actual" instructions in the event of an error:
Time Actual instruction Side instruction
1 MOV 'A' arr+9 ENSURE arr+9 BETWEEN arr, arr+9
2 MOV 'A' arr+8 ENSURE arr+8 BETWEEN arr, arr+9
I understand the instructions-per-cycle for the "side" instructions are not the same as the "Actual" instructions - so the side-check for the instruction at Time=1 might only complete after the "Actual" instructions have already progressed on to Time=3 - but if the check failed then it would pass the instruction pointer of the failed instruction to the exception handler that would direct the program to disregard the results of the instructions executed after Time=1. I don't know how they could achieve that without massive amounts of memory or some mandatory execution pauses, possibly memory-fencing too -
that's outside the scope of my answer, but it is at least theoretically possible.
(Note in this contrived example I'm using constexpr index values that a compiler can prove will never be out-of-bounds so would omit the MPX checks entirely - so pretend they're user-supplied variables instead :) ).
I'm not an expert in x86 (or have any experience in microprocessor design, spare a CS500-level course I took at UW and didn't do the homework for...) but I don't believe concurrent execution of bounds-checks nor "time travel" is possible with x86's current design, despite the extant implementation of out-of-order execution - I might be wrong, however. I speculate that if all pointer-types were promoted to 3-tuples ( struct BoundedPointer<T> { T* ptr, T* min, T* max } - which technically already happens with MPX and other software-based bounds-checks as every guarded pointer has its bounds defined when BNDMK is called) then the protection could be provided for free by the MMU - but now pointers will consume 24 bytes of memory, each, instead of the current 8 bytes - or compare to the measly 4 bytes under 32-bit x86 - RAM is plentiful, but still a finite resource that shouldn't be wasted.
MPX in GCC
GCC supported for MPX from version 5.0 to 9.1 ( https://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20GCC%20compiler ) when it was removed due to its maintenance burden.
MPX in Visual Studio / Visual C++
Visual Studio 2015 Update 1 (2015.1) added "experimental" support for MPX with the /d2MPX switch ( https://blogs.msdn.microsoft.com/vcblog/2016/01/20/visual-studio-2015-update-1-new-experimental-feature-mpx/ ). Support is still present in Visual Studio 2017 but Microsoft has not announced if it's considered a mainstream (i.e. non-experimental) feature yet.
MPX in Clang / LLVM
Clang has partially supported manual use of MPX in the past, but that support was fully removed in version 10.0
As of July 2021, LLVM still seems capable of outputting MPX instructions, but I can't see any evidence of an MPX "pass".
MPX in Intel C/C++ Compiler
The Intel C/C++ Compiler has supported MPX since version 15.0.
The XL compilers available on the IBM POWER processors on the Little Endian Linux, Big Endian Linux or AIX operating systems have a different implementation of array bounds checking.
Using the -qcheck or its synonym -C option turns on various kinds of checking. -qcheck=bounds checks array bounds. When this is used, the compilers check that every array reference has a valid subscript.
The hardware instruction used is a conditional trap, comparing the subscript to the upper limit and trapping if the subscript is too large or too small. In C and C++ the lower limit is 0. In Fortran it defaults to 1 but can be any integer. When it is not zero, the lower limit is subtracted from the subscript being checked, and the check compares that to the upper limit minus the lower limit.
When the limit is known at compile time and small enough, a conditional trap immediate instruction is enough. When the limit is calculated at execution time or is greater than 65535, a conditional trap instruction comparing two registers is needed.
The performance impact is small for several reasons:
1. The conditional trap instructions are fast.
2. They are executed in a standard integer pipeline. Since most POWER CPUs have 2 or 4 integer pipelines, there is usually an otherwise empty slot to put the trap in, so it is often essentially zero cost.
3. When it can the compiler optimizer moves the conditional trap out of loops so it is executed only once, checking all loop iterations at once.
4. When it can prove the actual subscript cannot exceed the limit, the optimizer discards the instruction.
5. Also when it can prove the subscript will also be invalid, the optimizer uses an unconditional trap.
6. If necessary -qcheck can be used during testing and skipped for production builds, but the overhead is small enough that's not usually necessary.
If my memory is correct, one long ago paper reported a 2% slowdown in one case and 0% in another. Since that CPU had only one integer pipeline, the slowdown should be significantly less with modern CPUs.
Other checking using the same mechanism is available to detect dereferencing NULL pointers, dividing an integer by zero, using an uninitialized auto variable, specially written asserts, etc.
This doesn't include all kinds of invalid memory usage, but it does handle the most common kind, does it very efficiently, and is very easy to use.
GCC supports -fbounds-check for similar purposes, but at this time it is only available for the Fortran front end (gfortran).
I'm playing around with a binary and when I load it into my debugger, or even run readelf, I noticed the entry point is 0x530 instead of the usual 0x80****** that'd learned ELF's were loaded at.
Why is this? Is there anything else going on? The binary is linked and not stripped.
instead of the usual 0x80****** that'd learned ELF's were loaded at.
You learned wrong.
While 0x804800 is the usual address that 32-bit x86 Linux binaries are linked at, that address is by no means universal or special.
64-bit x86_64 and aarch64 binaries are linked at default address of 0x40000, and powerpc64le binaries at default address of 0x10000000.
There is no reason a binary could not be linked at any other (page-aligned) address (so long as it is not 0, and allows for sufficient stack at the high end of the address space.
Why is this?
The binary was likely linked with a custom linker script. Nothing wrong with that.
As mentioned by Employed, the entry address is not fixed.
Just to verify, I've tried on x86_64:
gcc -Wl,-Ttext-segment=0x800000 hello_world.c
which sets the entry point to 0x800000 (+ the ELF header size, which gets loaded at 0x800000 in memory) instead of the default 0x400000.
Then both:
readelf -h a.out
and gdb -ex 'b _start' tell me the entry is at 0x800440 as expected (the header is 0x440 bytes).
This is because that value is an input that tells the Linux kernel where to set the PC when forking a new process.
The default 0x400000 comes from the default linker script used. You can also modify the linker script as mentioned in https://stackoverflow.com/a/31380105/895245 , change 0x400000 there, and use the new script with -T script
If I put it at anything below 0x200000 (2Mb) exactly or other low addresses, the program gets killed. I think this is because ld always loads the sections at multiples of 2Mb, which is the largest page size supported (in huge page), so anything lower starts at 0, which is bad: Why is the ELF execution entry point virtual address of the form 0x80xxxxx and not zero 0x0?
I am developing a basic hypervisor on ARM (using the board Arndale Exynos 5250).
I want to load Linux(ubuntu or smth else)/Android as the guest. Currently I'm using a Linaro distribution.
I'm almost there, most of the big problems have already been dealt with, except for the last one: reserving memory for my hypervisor such that the kernel does not try to OVERWRITE it BEFORE parsing the FDT or the kernel command line.
The problem is that my Linaro distribution's U-Boot passes a FDT in R2 to the linux kernel, BUT the kernel tries to overwrite my hypervisor's memory before seeing that I reserved that memory region in the FDT (by decompiling the DTB, modifying the DTS and recompiling it). I've tried to change the kernel command-line parameters, but they are also parsed AFTER the kernel tries to overwrite my reserved portion of memory.
Thus, what I need is a safe memory location in the physical RAM where to put my hypervisor's code at such that the Linux kernel won't try to access (r/w) it BEFORE parsing the FDT or it's kernel command line.
Context details:
The system RAM layout on Exynos 5250 is: physical RAM starts at 0x4000_0000 (=1GB) and has the length 0x8000_0000 (=2GB).
The linux kernel is loaded (by U-Boot) at 0x4000_7000, it's size (uncompressed uImage) is less than 5MB and it's entry point is set to be at 0x4000_8000;
uInitrd is loaded at 0x4200_0000 and has the size less than 2MB
The FDT (board.dtb) is loaded at 0x41f0_0000 (passed in R2) and has the size less than 35KB
I currently load my hypervisor at 0x40C0_0000 and I want to reserve 200MB (0x0C80_0000) starting from that address, but the kernel tries to write there (a stage 2 HYP trap tells me that) before looking in the FDT or in the command line to see that the region is actually reserved. If instead I load my hypervisor at 0x5000_0000 (without even modifying the original DTB or the command line), it does not try to overwrite me!
The FDT is passed directly, not through ATAGs
Since when loading my hypervisor at 0x5000_0000 the kernel does not try to overwrite it whatsoever, I assume there are memory regions that Linux does not touch before parsing the FDT/command-line. I need to know whether this is true or not, and if true, some details regarding these memory regions.
Thanks!
RELATED QUESTION:
Does anyone happen to know what is the priority between the following: ATAGs / kernel-command line / FDT? For instance, if I reserve memory through the kernel command-line, but not in the FDT (.dtb) should it work or is the command-line overriden by the FDT? Is there somekind of concatenation between these three?
As per
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm/Booting, safe locations start 128MB from start of RAM (assuming the kernel is loaded in that region, which is should be). If a zImage was loaded lower in memory than what is likely to be the end address of the decompressed image, it might relocate itself higher up before it starts decompressing. But in addition to this, the kernel has a .bss region beyond the end of the decompressed image in memory.
(Do also note that your FDT and initrd locations already violate this specification, and that the memory block you are wanting to reserve covers the locations of both of these.)
Effectively, your reserved area should go after the FDT and initrd in memory - which 0x50000000 is. But anything > 0x08000000 from start of RAM should work, portably, so long as that doesn't overwrite the FDT, initrd or U-Boot in memory.
The priority of kernel/FDT/bootloader command line depends on the kernel configuration - do a menuconfig and check under "Boot options". You can combine ATAGS with the built-in command lines, but not FDT - after all, the FDT chosen node is supposed to be generated by the bootloader - U-boot's FDT support is OK so you should let it do this rather than baking it into the .dts if you want an FDT command line.
The kernel is pretty conservative before it's got its memory map since it has to blindly trust the bootloader has laid things out as specified. U-boot on the other hand is copying bits of itself all over the place and is certainly the culprit for the top end of RAM - if you #define DEBUG in (I think) common/board_f.c you'll get a dump of what it hits during relocation (not including the Exynos iRAM SPL/boot code stuff, but that won't make a difference here anyway).
I am using OpenCv with the following processor :
Intel(R) Atom(TM) CPU N455 # 1.66GHz
Intel indicates that this processeur has 1 core but 2 threads.
So would it be relevent to use OpenCv with TBB option and expect faster execution ?
Thank you
this might help.
TBB example works better using 2 threads even on my Megafon Mint phone that contains Intel(R) Atom(TM) CPU Z2460 # 1.6GHz processor which has 1 core but 2 threads also.
--Vladimir
I'm using SNMP to get Server Infomation.
I want to get CPU/Process type, for example : Intel (R) Pentium (R) Dual CPU E2180 # 2.00GHz
I try very hard but I can't find suitable OID.
Can't anyone help me?
P/s : sorry about my lack Eng ability
The sysDescr table usually (but not always) contains information related to the machine's hardware, CPU included. However you should definitely take a look at the hrDevice table where the hrDeviceType is hrDeviceProcessor, here is what I get querying my laptop:
snmpwalk -v 2c -c ******* 192.168.*.* 1.3.6.1.2.1.25.3.2.1
HOST-RESOURCES-MIB::hrDeviceDescr.768 = STRING: GenuineIntel: Intel(R) Core(TM) i3 CPU M 330 # 2.13GHz