MPU not triggering faults in cortex M4 - memory

I want to protect a memory region from writing. I've configured MPU, but it is not generating any faults.
The base address of the region that I want to protect is 0x20000000. The region size is 64 bytes.
Here's a compiling code that demonstrates the issue.
#define MPU_CTRL (*((volatile unsigned long*) 0xE000ED94))
#define MPU_RNR (*((volatile unsigned long*) 0xE000ED98))
#define MPU_RBAR (*((volatile unsigned long*) 0xE000ED9C))
#define MPU_RASR (*((volatile unsigned long*) 0xE000EDA0))
#define SCB_SHCSR (*((volatile unsigned long*) 0xE000ED24))
void Registers_Init(void)
MPU_RNR = 0x00000000; // using region 0
MPU_RBAR = 0x20000000; // base address is 0x20000000
MPU_RASR = 0x0700110B; // Size is 64 bytes, no sub-regions, permission=7(ro,ro), s=b=c= 0, tex=0
MPU_CTRL = 0x00000001; // enable MPU
SCB_SHCSR = 0x00010000; // enable MemManage Fault
void MemManage_Handler(void)
"MOV R4, 0x77777777\n\t"
"MOV R5, 0x77777777\n\t"
int main(void)
"LDR R0, =0x20000000\n\t"
"MOV R1, 0x77777777\n\t"
"STR R1, [R0,#0]"
return (1);
void SystemInit(void)
So, in main function, I am writing in restricted area i.e. 0x20000000, but MPU is not generating any fault and instead of calling MemManage_Handler(), it writes successfully.

This looks fine to me. Make sure your hardware have a MPU. MPU has a register called MPU_TYPE Register. This is a read-only register that tells you if you have a MPU or not. If bits 15:8 in MPU_TYPE register read 0, there's no MPU.
And never use numbers when dealing with registers. This makes it really hard for you and other person to read your code. Instead, define a number of bit masks. See tutorials on how to do that.


lldb - how to read the permissions of a memory region for a thread?

Apple says that on ARM64 Macs memory regions can have either write or execution permissions for a thread. How would someone find out the current permissions for a memory region for a thread in lldb? I have tried 'memory region ' but that returns rwx. I am working on a Just-In-Time compiler that will run on my M1 Mac. For testing I made a small simulation of a Just-In-Time compiler.
#include <cstdio>
#include <sys/mman.h>
#include <pthread.h>
#include <libkern/OSCacheControl.h>
#include <stdlib.h>
int main(int argc, const char * argv[]) {
size_t size = 1024 * 1024 * 640;
int fd = -1;
int offset = 0;
unsigned *addr = 0;
// allocate a mmap'ed region of memory
addr = (unsigned *)mmap(0, size, prot, flags, fd, offset);
if (addr == MAP_FAILED){
printf("failure detected\n");
// Write instructions to the memory
addr[0] = 0xd2800005; // mov x5, #0x0
addr[1] = 0x910004a5; // add x5, x5, #0x1
addr[2] = 0x17ffffff; // b <address>
sys_icache_invalidate(addr, size);
// Execute the code
int(*f)() = (int (*)()) addr;
return 0;
Once the assembly instructions start executing thru the (*f)() call, I can pause execution in Xcode and type
memory region {address of instructions}
into the debugger. For some reason it keeps returning 'rwx'. Am I using the right command or could this be a bug with lldb?
When I run your little program on a Mac where I can poke around (I'm on x86_64 but it shouldn't matter, I don't actually need to run the instructions...) I see in lldb:
Process 43209 stopped
* thread #1, queue = '', stop reason = breakpoint 1.1
frame #0: 0x0000000100003f20 protectit`main at protectit.cpp:31
28 addr[2] = 0x17ffffff; // b <address>
30 pthread_jit_write_protect_np(1);
-> 31 sys_icache_invalidate(addr, size);
33 // Execute the code
34 int(*f)() = (int (*)()) addr;
Target 0: (protectit) stopped.
(lldb) memory region addr
[0x0000000101000000-0x0000000129000000) rwx
which is as you report. I then double-checked with vmmap:
> vmmap 43209 0x0000000101000000
0x101000000 is in 0x101000000-0x129000000; bytes after start: 0 bytes before end: 671088639
MALLOC_SMALL 100800000-101000000 [ 8192K 8K 8K 0K] rw-/rwx SM=PRV MallocHelperZone_0x1001c4000
---> VM_ALLOCATE 101000000-129000000 [640.0M 4K 4K 0K] rwx/rwx SM=PRV
GAP OF 0x5ffed7000000 BYTES
MALLOC_NANO 600000000000-600008000000 [128.0M 88K 88K 0K] rw-/rwx SM=PRV DefaultMallocZone_0x1001f1000
so vmmap agrees with lldb that the region is rwx.
Whatever pthread_jit_write_protect_np is doing, it doesn't seem to be changing the underlying memory region protections.
I found out the answer to my question is to read an undocumented Apple register called S3_6_c15_c1_5.
This code reads the raw value from the register:
// Returns the S3_6_c15_c1_5 register's value
uint64_t read_S3_6_c15_c1_5_register(void)
uint64_t v;
__asm__ __volatile__("isb sy\n"
"mrs %0, S3_6_c15_c1_5\n"
: "=r"(v)::"memory");
return v;
This code tells you what your thread's current mode is:
// Returns the mode for a thread.
// Returns "Executable" or "Writable".
// Remember to free() the value returned by this function.
char *get_thread_mode()
uint64_t value = read_S3_6_c15_c1_5_register();
char *return_value = (char *) malloc(50);
case 0x2010000030300000:
sprintf(return_value, "Writable");
case 0x2010000030100000:
sprintf(return_value, "Executable");
sprintf(return_value, "Unknown state: %llx", value);
return return_value;
This is a small test program to demonstrate these two functions:
int main(int argc, char *argv[]) {
printf("Thread's mode: %s\n", get_thread_mode());
// The mode is Executable
printf("Thread's mode: %s\n", get_thread_mode());
// The mode is Writable
return 0;

Global device memory size limit when using statically alocated memory in cuda

I thought the maximal size of global memory should be only limited by the GPU device no matter it is allocated statically using __device__ __manged__ or dynamically using cudaMalloc.
But I found that if using the __device__ manged__ way, the maximum array size I can declare is much smaller than the GPU device limit.
The minimal working example is as follows:
#include <stdio.h>
#include <cuda_runtime.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
#define MX 64
#define MY 64
#define MZ 64
#define NX 64
#define NY 64
#define M (MX * MY * MZ)
__device__ __managed__ float A[NY][NX][M];
__device__ __managed__ float B[NY][NX][M];
__global__ void swapAB()
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int j = 0; j < NY; j++)
for(int i = 0; i < NX; i++)
A[j][i][tid] = B[j][i][tid];
int main()
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
It uses 64 ^5 * 2 * 4 / 2^30 GB = 8 GB global memory, and I'll run compile and run it on a Nvidia Telsa K40c GPU which has a 12GB global memory.
Compiler cmd:
nvcc -gencode arch=compute_30,code=sm_30
Output warning:
warning: overflow in implicit constant conversion.
When I ran the generated executable, an error says:
GPUassert: an illegal memory access was encountered
Surprisingly, if I use the dynamically allocated global memory of the same size (8GB) via the cudaMalloc API instead, there is no compiling warning and runtime error.
I'm wondering if there are any special limitation about the allocatable size of static global device memory in CUDA.
PS: OS and CUDA: CentOS 6.5 x64, CUDA-7.5.
This would appear to be a limitation of the CUDA runtime API. The root cause is this function (in CUDA 7.5):
void **fatCubinHandle,
char *hostVar,
char *deviceAddress,
const char *deviceName,
int ext,
int size,
int constant,
int global
which only accepts a signed int for the size of any statically declared device variable. This would limit the maximum size to 2^31 (2147483648) bytes. The warning you see is because the CUDA front end is emitting boilerplate code containing calls to __cudaResgisterVar like this:
__cudaRegisterManagedVariable(__T26, __shadow_var(A,::A), 0, 4294967296, 0, 0);
__cudaRegisterManagedVariable(__T26, __shadow_var(B,::B), 0, 4294967296, 0, 0);
It is the 4294967296 which is the source of the problem. The size will overflow the signed integer and cause the API call to blow up. So it seems you are limited to 2Gb per static variable for the moment. I would recommend raising this as a bug with NVIDIA if it is a serious problem for your application.

malloc using 4 bytes for char

I am writing a code to examine how memory is managed between stack and heap. for a course work.
#define NUM_OF_CHARS 100
// function prototype
void f(void);
int main()
return 0;
void f(void)
char *ptr1;
ptr1 = (char *) malloc(NUM_OF_CHARS * sizeof(int));
printf("Address array 1: %016lx\n", (long)ptr1);
char *ptr2;
ptr2 = (char *) malloc(NUM_OF_CHARS * sizeof(int));
printf("Address array 2: %016lx\n", (long)ptr2);
when I run this code I get the following:
Address array 1: 000000000209e010
Address array 2: 000000000209e1b0
my expectation was to see a difference in the address of 100 bytes, but the difference is 416 bytes, when I changed the NUM_OF_CHARS to any other value (200,300,...) the result was always (NUM_OF_CHARS*4 + 16), so it seams like malloc is allocating 4 bytes for each char rather one byte plus 16 bytes of some overhead.
can anyone explain what is happening here?
Memory allocation is platform/compiler dependent. The only thing malloc ensures is that it allocates enough memory for what you are asking and nothing more.
There is no guarantee that your addresses will be contiguous due to memory alignment
Also, you are allocating by size of ints and not char in your code. This is most likely the reason why you see a NUM_OF_CHARS*4 difference, while the remaining difference can be attributed to padding.

Why 64-bits aligned XOR do not run faster than 32-bits aligned XOR?

I want to test the speed of two block of memmory, and I did a experiment in a 64 bits machine(4M cache), and XOR two region of memory with 32-bits aligned and 64-bits aligned respectively.I thought the 64-bits aligned region XOR counld much faster than 32-bits aligned region XOR, but the speed of two types of XOR are quiet the same.
void region_xor_w32( unsigned char *r1, /* Region 1 */
unsigned char *r2, /* Region 2 */
unsigned char *r3, /* Sum region */
int nbytes) /* Number of bytes in region */
uint32_t *l1;
uint32_t *l2;
uint32_t *l3;
uint32_t *ltop;
unsigned char *ctop;
ctop = r1 + nbytes;
ltop = (uint32_t *) ctop;
l1 = (uint32_t *) r1;
l2 = (uint32_t *) r2;
l3 = (uint32_t *) r3;
while (l1 < ltop) {
*l3 = ((*l1) ^ (*l2));
void region_xor_w64( unsigned char *r1, /* Region 1 */
unsigned char *r2, /* Region 2 */
unsigned char *r3, /* Sum region */
int nbytes) /* Number of bytes in region */
uint64_t *l1;
uint64_t *l2;
uint64_t *l3;
uint64_t *ltop;
unsigned char *ctop;
ctop = r1 + nbytes;
ltop = (uint64_t *) ctop;
l1 = (uint64_t *) r1;
l2 = (uint64_t *) r2;
l3 = (uint64_t *) r3;
while (l1 < ltop) {
*l3 = ((*l1) ^ (*l2));
I believe this is due to data starvation. That is, your CPU is so fast and your code is so efficient that your memory subsystem simply can't keep up. Even XORing in a 32-bit aligned way takes less time than fetching data from memory. That's why both 32-bit and 64-bit aligned approaches have the same speed — that of your memory subsystem.
To demonstrate, I've reproduces your experiment, but this time with four different ways of XORing:
non-aligned (i.e. byte-aligned) XORing;
32-bit aligned XORing;
64-bit aligned XORing;
128-bit aligned XORing.
The last one was implemented via _mm_xor_si128(), which is a part of the SSE2 instruction set.
As you can see, switching to 128-bit processing gave no performance boost. Switching to per-byte processing, on the other hand, slowed everything down — that's because in this case memory subsystem still beats CPU.

How is a pointer stored in memory?

I am working on an OS project and I am just wondering how a pointer is stored in memory? I understand that a pointer is 4 bytes, so how is the pointer spread amongst the 4 bytes?
My issue is, I am trying to store a pointer to a 4 byte slot of memory. Lets say the pointer is 0x7FFFFFFF. What is stored at each of the 4 bytes?
The way that pointer is stored is same as any other multi-byte values. The 4 bytes are stored according to the endianness of the system. Let's say the address of the 4 bytes is below:
Big endian (most significant byte first):
Address Byte
0x1000 0x7F
0x1001 0xFF
0x1002 0xFF
0x1003 0xFF
Small endian (least significant byte first):
Address Byte
0x1000 0xFF
0x1001 0xFF
0x1002 0xFF
0x1003 0x7F
Btw, 4 byte address is 32-bit system. 64-bit system has 8 bytes addresses.
To reference each individual part of the pointer, you need to use pointer. :)
Say you have:
int i = 0;
int *pi = &i; // say pi == 0x7fffffff
int **ppi = π // from the above example, int ppi == 0x1000
Simple pointer arithmetic would get you the pointer to each byte.
You should read up on Endianness. Normally you wouldn't work with just one byte of a pointer at a time, though, so the order of the bytes isn't relevant.
Update: Here's an example of making a fake pointer with a known value and then printing out each of its bytes:
#include <stdio.h>
int main(int arc, char* argv[]) {
int *p = (int *) 0x12345678;
unsigned char *cp = (unsigned char *) &p;
int i;
for (i = 0; i < sizeof(p); i++)
printf("%d: %.2x\n", i, cp[i]);
return 0;
