Memory Detection in ARM - memory

I am new to ARM and finding out ways to detect the memory map of platform based on ARM.Earlier I worked little in x86 and can find out memory map using some BIOS calls.
Same way can we do in ARM though BIOS is not there in ARM.
Is there any instruction do exist in ARM to find the Memory map ??

How do I find the memory map for an ARM CPU guide:
Read the documentation from arm.com for your coresponding core
Read the documentation of your CPU
Read the documentation of your platform, to see if it has external memory connected to SOC(CPU)
Or as a shortcut:
If your platform vendor provides a toolchain to compile code for it, make a dummy project and look for the memory layout in you linker file...
Gather this information:
Memory map for the corresponding core
Memory map of your CPU
If it has external accessible memory you have to perform some steps to initialize the controller.
Use gathered data and build the linker file for you project
Do whatever you want with it

There is no interface as ubiquitous as BIOS or EFI for ARM systems, though Microsoft does specify UEFI for systems that run Windows.
The Linux boot interface is the most common interface, see Documentation/arm/Booting in the kernel source and the header files.

If you want to write a program that to be portable across different Arm devices, you have to detect the memory by yourself. I am not very good especially in ARM, but there are common principles - you simply can scan the whole address space and probe the memory by writing a number and then reading it back. Usually, two such operations are provided with different numbers in order to exclude the occasional mistakes:
1. write 0aah
2. read and check for 0aah
3. write 055h
4. read and check for 055h
Note 1: for better speed not every byte have to be checked - some natural granularity have to be used and to check only at the start of the pages (whatever size are on this platform).
At the end you will have a map for the RAM memory. The ROM memory is not so easy to be detected, though and there is no common solution.
Note 2: Depending on the architecture (well, I said I am not ARM expert) your program must have access to the whole memory, according to the memory protection mechanisms of the CPU (if any).
Note 3: The only possible problem with this approach is the memory mapped IO. Touching it can affect the IO devices in unpredictable way. That is why, you must know what area of the addressing space is used for memory mapped IO and not to test it at all.

Related

Can I use the "Instruction" TCM in an Atmel SAM E70 processor for data?

I'm developing an application for a Atmel SAME70Q21 Microprocessor. This MCU has a ARM Coretex-M7 core.
Atmel have implemented the ARM TCM (Tightly Coupled Memory) in this particular MCU variant. Atmel seems to classify the TCM into two sections "ITCM" (instruction TCM), and "DTCM" (data TCM)
I'm currently using the DTCM for fast storage, usually from interrupts. However, the ITCM is currently actually turned off, though the configuration system for the TCM still allocates it 32K of data.
I was thinking, since I'm not executing out of the ITCM, and the ram is already allocated, can I use the ITCM for data storage? The Cortex-M7 is a Von Neumann architecture CPU, and the architecture diagrams I've seen show the two TCM memory segments as having separate interfaces from the CPU.
Both the DTCM and ITCM memory spaces are rw in the linker script (to use the ITCM, you actually have to relocate your code into it at runtime, actually). What are the performance affects of (ab)using ARM cores this way?
From the ARM Cortex-M7 Processor Technical Reference Manual Section 5.8 TCM Interfaces:
The Prefetch Unit (PFU) can fetch instructions from any of the TCM interfaces. The Load Store Unit (LSU) and the AHBS interface can each read and write data using any of the TCM interfaces. Best performance is achieved if code is placed in ITCM and data in DTCM. However, there is no functional restriction in which TCM, code and data is placed.
If you are using neither for code, then there is probably no performance hit, but if you are running code in TCM, then separating them benefits from the Harvard architecture, allowing concurrent instruction fetch and data read. The ITCM's 64 bit bus presumably allows single cycle instruction and operand fetch - but I doubt that will be of any benefit for data read/write.

Does simulating memory-mapped I/O using VMX require instruction decoding?

I am wondering how a hypervisor using Intel's VMX / VT technology would simulate memory-mapped I/O (so that the guest could think it was performing memory mapped I/O againsta device).
I think the basic principle would be to set up the EPT page tables in such a way that the memory addresses in question would cause an EPT violation (i.e. VM exit) by setting them such that they cannot be read or written? However, the next question is how to process the VM exit. Such a VM-exit would fill out all the exit qualification reasons etc. including the guest-linear and guest-physical address etc. But what I am missing in these exit qualification fields is some field indicating - in case of a write instruction - the value that was attempted to be written and the size of the write. Likewise, for a read instruction it would be nice with some bit fields indicating the destination of the read, say a register or a memory location (in case of memory-to-memory string operations). This would make it very easy for the hypervisor to figure out what the guest was trying to do and then simulate the device behavior towards the guest.
But the trouble is, I can't find such fields among the exit qualifications. I can see an instruction pointer to where the faulting instruction is, so I could walk the page tables to read in the instruction and then decode it to understand the instruction, then simulate the I/O behavior. However, this requires the hypervisor to have a fairly complete picture of all x86 instructions, and be able to decode them. That seems to be quite a heavy burden on the hypervisor, and will also require it to stay current with later instruction additions. And the CPU should already have this information.
There's a chance that that I am missing these relevant fields because the documentation is quite extensive, but I have tried to search carefully but have not been able to find it. Maybe someone can point me in the right direction OR confirm that the hypervisor will need to contain an instruction decoder.
I believe most VMs decode the instruction. It's not actually that hard, and most VMs have software emulators to fallback on when the CPU VM extensions aren't available or up to the task. You don't need to handle every instruction, just those that can take memory operands, and you can probably ignore everything that isn't a 1, 2, or 4 byte memory operand since you're not likely to emulating device registers other than those sizes. (For memory mapped device buffers, like video memory, you don't want to be trapping every memory accesses because that's too slow, and so you'll have to take different approach.)
However, there is one way you can let the CPU do the work for you, but it's much slower then decoding the instruction itself and it's not entirely perfect. You can single step the instruction while temporarily mapping in a valid page of RAM. The VM exit will tell you the guest physical address access and whether it was a read or write. Unfortunately it doesn't reliably tell you whether it was read-modify-write instruction, those may just set the write flag, and with some device registers that can make a difference. It might be easier to copy the instruction (it can only be a most 15 bytes, but watch out for page boundaries) and execute it in the host, but that requires that you can map the page to same virtual address in the host as in the guest.
You could combine these techniques, decode the common instructions that are actually used to access memory mapped device registers, while using single stepping for the instructions you don't recognize.
Note that by choosing to write your own hypervisor you've put a heavy burden on yourself. Having to decode instructions in software is a pretty minor burden compared to the task of emulating an entire IBM PC compatible computer. The Intel virtualisation extensions aren't designed to make this easier, they're just designed to make it more efficient. It would be easier to write a pure software emulator that interpreted the instructions. Handling memory mapped I/O would be just a matter of dispatching the reads and writes to the correct function.
I don't know in details how VT-X works, but I think I see a flaw in your wishlist way it could work:
Remember that x86 is not a load/store machine. The load part of add [rdi], 2 doesn't have an architecturally-visible destination, so your proposed solution of telling the hypervisor where to find or put the data doesn't really work, unless there's some temporary location that isn't part of the guest's architectural state, used only for communication between the hypervisor and the VMX hardware.
To handle a read-modify-write instruction with a memory destination efficiently, the VM should do the whole thing with one VM exit. So you can't just provide separate load and store interfaces.
More importantly, handling atomic read-modify-writes is a special case. lock add [rdi], 2 can't just be done as a separate load and store.

ARM memory mapping: INT15 equivalent? Standard way to query memory map?

On PC-architectures (where the presence of the BIOS and the usage of it is pretty much standardized), you can discover the size of the RAM memory, as well as its reserved/free for use regions by using the INT15 BIOS interrupt, function 0xE820.
Since I'm passionate about low-level programming and after programming Intel architectures for approximately 6 months, I decided I should try and learn how other architectures work. So I've started digging into ARM development. I've got 2 boards I'm currently working on: Olimex A20 OlinuXino-MICRO and Samsung Arndale's Exynos 5250. What I'm trying to do is to port a hypervisor I've developed for Intel architectures to these two boards. I am now in the stage of trying to programmatically detect the memory map of the system in a reliable and acceptably standardized way (I would prefer not to write entirely different code for different ARM boards). But so far, I find the relevant documentation to be a little bit confusing.
On the Olimex A20 I've got a Cortex-A7 ARM CPU.
The PDF found here: http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_principles_of_arm_memory_maps.pdf , which applies to Cortex-A7 and other CPUs, states at page 14 that the memory addressing space from 1GB-to-2GB is reserved for Memory-Mapped I/O devices, whereas the Olimex-A20 documentation found at this link https://github.com/OLIMEX/OLINUXINO/blob/master/HARDWARE/A20-PDFs/A20%20User%20Manual%202013-03-22.pdf?raw=true states at page 21 that the memory addressing space from 1GB-to-3GB is DDR-II/DDR-III memory.
Am I simply confused or is there an inconsistency between these two documents?
The memory maps on ARM chips are highly chip-specific. There is also usually nothing like a BIOS, so your bootloader or hypervisor will have to figure out the memory layout on its own.
Generally you'd need to work with the SDRAM controller to query and initialize the installed SDRAM chips. This is a non-trivial and, once again, very chip-specific process. You should check the code of bootloaders (e.g. U-Boot) you have available for your chips and look for the memory init code.
However, in many of cases, the memory "map" (start of RAM and its size) is simply hardcoded for each board the bootloader is ported to, since it's unlikely to change at every boot.
Historically ARM boot-loaders pass information to the Linux kernel using ATAG structures as desribed in Booting ARM Linux. At a minimum the boot-loader is expected initialize the RAM in the system and pass ATAG_MEM structures to describe where the RAM lives in the address space. Interpreting these structures would give you some of the information you need but it doesn't tell you anything about any peripheral devices. In this booting method the machine type is used to trigger platform specific code to initialize the rest of the hardware.
The new way of doing this is through the Flattened Device Tree. The device tree originated with OpenFirmware and besides describing the RAM mapping can also describe the rest of the hardware and peripherals.

How does compiler lay out code in memory

Ok I have a bit of a noob student question.
So I'm familiar with the fact that stacks contain subroutine calls, and heaps contain variable length data structures, and global static variables are assigned to permanant memory locations.
But how does it all work on a less theoretical level?
Does the compiler just assume it's got an entire memory region to itself from address 0 to address infinity? And then just start assigning stuff?
And where does it layout the instructions, stack, and heap? At the top of the memory region, end of memory region?
And how does this then work with virtual memory? The virtual memory is transparent to the program?
Sorry for a bajilion questions but I'm taking programming language structures and it keeps referring to these regions and I want to understand them on a more practical level.
THANKS much in advance!
A comprehensive explanation is probably beyond the scope of this forum. Entire texts are devoted to the subject. However, at a simplistic level you can look at it this way.
The compiler does not lay out the code in memory. It does assume it has the entire memory region to itself. The compiler generates object files where the symbols in the object files typically begin at offset 0.
The linker is responsible for pulling the object files together, linking symbols to their new offset location within the linked object and generating the executable file format.
The linker doesn't lay out code in memory either. It packages code and data into sections typically labeled .text for the executable code instructions and .data for things like global variables and string constants. (and there are other sections as well for different purposes) The linker may provide a hint to the operating system loader where to relocate symbols but the loader doesn't have to oblige.
It is the operating system loader that parses the executable file and decides where code and data are layed out in memory. The location of which depends entirely on the operating system. Typically the stack is located in a higher memory region than the program instructions and data and grows downward.
Each program is compiled/linked with the assumption it has the entire address space to itself. This is where virtual memory comes in. It is completely transparent to the program and managed entirely by the operating system.
Virtual memory typically ranges from address 0 and up to the max address supported by the platform (not infinity). This virtual address space is partitioned off by the operating system into kernel addressable space and user addressable space. Say on a hypothetical 32-bit OS, the addresses above 0x80000000 are reserved for the operating system and the addresses below are for use by the program. If the program tries to access memory above this partition it will be aborted.
The operating system may decide the stack starts at the highest addressable user memory and grows down with the program code located at a much lower address.
The location of the heap is typically managed by the run-time library against which you've built your program. It could live beginning with the next available address after your program code and data.
This is a wide open question with lots of topics.
Assuming the typical compiler -> assembler -> linker toolchain. The compiler doesnt know a whole lot, it simply encodes stack relative stuff, doesnt care how much or where the stack is, that is the purpose/beauty of a stack, dont care. The compiler generates assembler the assembler is assembled into an object, then the linker takes info linker script of some flavor or command line arguments that tell it the details of the memory space, when you
gcc hello.c -o hello
your installation of binutils has a default linker script which is tailored to your target (windows, mac, linux, whatever you are running on). And that script contains the info about where the program space starts, and then from there it knows where to start the heap (after the text, data and bss). The stack pointer is likely set either by that linker script and/or the os manages it some other way. And that defines your stack.
For an operating system with an mmu, which is what your windows and linux and mac and bsd laptop or desktop computers have, then yes each program is compiled assuming it has its own address space starting at 0x0000 that doesnt mean that the program is linked to start running at 0x0000, it depends on the operating system as to what that operating systems rules are, some start at 0x8000 for example.
For a desktop like application where it is somewhat a single linear address space from your programs perspective you will likely have .text first then either .data or .bss and then after all of that the heap will be aligned at some point after that. The stack however it is set is typically up high and works down but that can be processor and operating system specific. that stack is typically within the programs view of the world the top of its memory.
virtual memory is invisible to all of this the application normally doesnt know or care about virtual memory. if and when the application fetches an instruction or does a data tranfer it goes through hardware which is configured by the operating system and that converts between virtual and physical. If the mmu indicates a fault, meaning that space has not been mapped to a physical address, that can sometimes be intentional and then another use of the term "Virtual memory" applies. This second definition the operating system can then for example take some other chunk of memory, yours or someone elses, move that to hard disk for example, mark that other chunk as not being there, and then mark your chunk as having some ram then let you execute not knowing you were interrupted with some ram that you didnt know you had to take from someone else. Your application by design doesnt want to know any of this, it just wants to run, the operating system takes care of managing physical memory and the mmu that gives you a virtual (zero based) address space...
If you were to do a little bit of bare metal programming, without mmu stuff at first then later with, microcontroller, qemu, raspberry pi, beaglebone, etc you can get your hands dirty both with the compiler, linker script and configuring an mmu. I would use an arm or mips for this not x86, just to make your life easier, the overall big picture all translates directly across targets.
It depends.
If you're compiling a bootloader, which has to start from scratch, you can assume you've got the entire memory for yourself.
On the other hand, if you're compiling an application, you can assume you've got the entire memory for yourself.
The minor difference is that in the first case, you have all physical memory for yourself. As a bootloader, there's nothing else in RAM yet. In the second case, there's an OS in memory, but it will (normally) set up virtual memory for you so that it appears you have the entire address space for yourself. Usuaully you still have to ask the OS for actual memory, though.
The latter does mean that the OS imposes some rules. E.g. the OS very much would like to know where the first instruction of your program is. A simple rule might be that your program always starts at address 0, so the C compiler could put int main() there. The OS typically would like to know where the stack is, but this is already a more flexible rule. As far as "the heap" is concerned, the OS really couldn't care.

Windows to embedded port: data and code memory size

I am in the process of porting a windows 7 library to an embedded platform. In order to do so my employer asks me the amount of memory (and CPU but let us concentrate on the memory for now) that my system will need once ported - so he can size the board to my needs.
I had a look on the internet and there seem not be exist much information about this question, hence my questions:
in order to get a rough idea of the memory footprint of the code in flash memory (code only without memory for data), I read on the Internet that I should sum the size of all the dlls I use. It seems that all compilers and platforms give a different size for the code footprint but overall the size of the code (without data) is often very close. Do you confirm?
in order to deal with the memory required by the data only (heap + stack but no code), I had a look at the task manager (and process explorer). It seems the overall amount of data which I use is specified in the 'peak working set'. I have a few questions about it though:
2.a. Does the 'working set' include the heap + stack memory or does it correspond to the heap only?
2.b. Does the 'working set' include the size for the code as well? (as I am on windows 7, the code is also stored in RAM and not in flash as on embedded systems), or does it only correspond to the data?
2.c. it seems the 'peak working set' reflects the maximum amount of physical memory that was actually in RAM from the time the program was started, but it does not reflect the size the program could take afterwards (if I happen to allocate memory at runtime - which would be bad ;) - the peak value would go on increasing). Do you confirm?
2.d. Hence, do you also confirm that if I do not allocate memory at runtime, the 'peak working set' should roughly be the maximum size of RAM my embedded system will need? Up to a bit of size difference due to the difference in systems technology...
Thanks,
Antoine.
Unless you are intending to run your application on Windows Embedded, then looking at the code and data usage in Windows is not going to be much of an indicator of anything useful!
1) DLLs are libraries - not all the code within them will be utilised by your code. Most embedded systems are statically linked and the linker will link only modules that are actually referenced in your by your code. So taking the sum of the DLL dependencies is likley to lead to a gross over estimation of memory requirement.
2) Windows memory management is profligate with memory use - because it can be and to do so generally improves performance of typical desktop systems. For example, an thread stack in Windows is typically of the order to 2Mb - you may seldom use that much, but Windows gives it to you in any case because it can and to do so errs on the side of safety. A thread stack in an embedded system will typically range from a few tens of bytes to a few tens of kilobytes - it depends on your application.
Windows task manager shows what Windows allocates to your process, that may not relate to what your process needs. Also your application is using Windows services - all the memory used for kernel and device services will not show up as part of your process, but your embedded system may still need those.
If you do use your Windows prototype code to assess the embedded system requirements, then your best place to start is by getting the linker to generate a map file, which will give a detailed description of memory usage in terms of statically allocated data and code size.
Code size depends not only on the performance of the compiler, but also on the efficiency of the instruction set. Some architectures achieve higher code density than others. Windows application code size is never a good indicator of embedded code size because its execution environment is likley to be so much different. For example an pre-emptive multitasking RTOS kernel on a 32bit ARM can be implemented in less than 10Kb of code, a file system perhaps another 10, and network stack anything from 10 to 30K, USB another 10. As you can see this is a different world to desktop code.
Data memory usage is more easily determined perhaps; but you do that through analysis of your application rather than observing what Windows does. There is the data your application instantiates directly, and then there is data instantiated by libraries and device drivers you might call - in Windows the latter is likley to be relatively large and out of your control. Typical embedded systems libraries for things such a s network stacks, USB, file systems etc. are fall smaller and far more deterministic in both performance and size.
Your better bet is to describe your application in terms of its general purpose, performance requirements, real-time constraints, and its hardware requirements (display, networking, I/O, mass storage etc.), and then look at comparable solutions or at the libraries you will need to implement your solution; most embedded systems are "bare board" and do not have the services you find in Windows unless you write them or use third-party solutions - Windows is seldom a comparable solution to an embedded system.
If it is just a library rather than an application, then build it for a likley target using a Windows hosted GCC cross-compiler and see how big it ends up. You don't need hardware for that or even expend any money.

Resources