What lives above the last accessible address in the stack? - memory

I've asked people before about why the stack doesn't start at 0x7fff...c before, and was told that typically 0x800... onwards is for the kernel, and the cli args and environment variables live at the top of the user's stack which is why it starts below 0x7fff...c. But I recently tried to examine all the strings with the following program
#include <stdio.h>
#include <string.h>
int main(int argc, const char **argv) {
const char *ptr = argv[0];
while (1) {
printf("%p: %s\n", ptr, ptr);
size_t len = strlen(ptr);
ptr = (void *)ptr + len + 1;
}
}
However, after displaying all my environment variables, I see the following (I compiled the program to an executable called ./t):
0x7ffc19f84fa0: <final env variable string>
0x7ffc19f84fee: _=./t
0x7ffc19f84ff4: ./t
0x7ffc19f84ff8:
0x7ffc19f84ff9:
0x7ffc19f84ffa:
0x7ffc19f84ffb:
0x7ffc19f84ffc:
0x7ffc19f84ffd:
0x7ffc19f84ffe:
0x7ffc19f84fff:
So it appears there's one extra empty byte after the null terminator for the ./t string at bytes 0x7ffc19f84ff4..0x7ffc19f84ff7, and after that I segfault so I guess that's the base of the stack. What actually lives in the remaining "empty" space before kernel memory starts?
Edit: I also tried the following:
global _start
extern print_hex, fgets, puts, print, exit
section .text
_start:
pop rdi
mov rcx, 0
_start_loop:
mov rdi, rsp
call print_hex
pop rdi
call puts
jmp _start_loop
mov rdi, 0
call exit
where print_hex is a routine I wrote elsewhere. It seems this is all I can get
0x00007ffcd272de28
./bin/main
0x00007ffcd272de30
abc
0x00007ffcd272de38
def
0x00007ffcd272de40
ghi
0x00007ffcd272de48
make: *** [Makefile:47: run] Segmentation fault
so it seems that even in _start we don't begin at 0x7fff...

Related

How to write to shared library code segment loaded into RAM memory?

I would like you to ask why I cannot write to the loaded shared library code segment in RAM memory in Linux 2.6.28.9 on MIPS CPU platform (LG TV). I am able to read bytes but not able to write anything. In the example source code below (cross-compiled in gcc) I get ERROR 22: Invalid argument (EINVAL) when write() function is called.
// this app tries to replace 4 bytes in code segment memory of loaded shared library
#include <stdio.h> // printf
#include <stdlib.h> // off_t
#include <dlfcn.h> // dlopen, dlclose
#include <fcntl.h> // open, O_RDWR
#include <unistd.h> // lseek, close, read
#include <errno.h> // errno
#include <string.h> // strerror
#include <sys/mman.h> // mprotect, PROT_READ, PROT_WRITE, PROT_EXEC
#define BYTES_TO_REPLACE 4
int main (int argc, char *argv[])
{
int fd, pid;
unsigned *handle;
unsigned long pagesize;
off_t fun_addr, pa_fun_addr;
unsigned char buf[BYTES_TO_REPLACE];
char s[100];
// initialize
pagesize = sysconf(_SC_PAGESIZE); // memory page size from system
pid = getpid(); // PID of current process
// open shared library file, OK
handle = dlopen("/path_to_library_files/shared_library.so", RTLD_LAZY | RTLD_GLOBAL);
// get function address, OK
fun_addr = (off_t)dlsym(handle, "function_name_in_lib");
// open memory device (pseudo-file), OK
sprintf(s, "/proc/%d/mem", pid); // memory space of our process (/proc/self/mem)
//strcpy(s, "/dev/mem"); // in that case when reading from memory ==> ERROR 14: Bad address
fd = open(s, O_RDWR); // open for reading and writing
// go to starting address of the library function loaded earlier, OK
lseek(fd, fun_addr, SEEK_SET);
// read from memory, OK
read(fd, buf, BYTES_TO_REPLACE);
printf("old replaced bytes = [%02X %02X %02X %02X]\n", buf[0], buf[1], buf[2], buf[3]);
// move back, OK
lseek(fd, fun_addr, SEEK_SET);
// unprotect memory page - no error, but does not help
pa_fun_addr = (fun_addr / pagesize) * pagesize; // page-aligned address
mprotect((void *)pa_fun_addr, pagesize, PROT_READ | PROT_WRITE | PROT_EXEC);
// write new data to memory: ERROR 22: Invalid argument
buf[0] = 0x08; buf[1] = 0x00; buf[2] = 0xE0; buf[3] = 0x03; // replacing 4-byte command: jr $ra (MIPS CPU)
if (write(fd, buf, BYTES_TO_REPLACE) != BYTES_TO_REPLACE) printf("ERROR %d: %s!\n", errno, strerror(errno));
// close memory device and shared library
close(fd);
dlclose(handle);
return 0;
}
This is because process code in memory doesn't have write permission by default. To see the permissions of a process's memory, use pmap:
For example, shared libraries below have only rx permission at most:
sudo pmap 5869
5869: vim supervisor_meeting-2017-05-22.txt
000055b391f62000 2604K r-x-- vim
000055b3923ed000 56K r---- vim
000055b3923fb000 100K rw--- vim
000055b392414000 60K rw--- [ anon ]
000055b393377000 2868K rw--- [ anon ]
00007fc59ef5a000 40K r-x-- libnss_files-2.24.so
00007fc59ef64000 2048K ----- libnss_files-2.24.so
00007fc59f164000 4K r---- libnss_files-2.24.so
00007fc59f165000 4K rw--- libnss_files-2.24.so
00007fc59f166000 24K rw--- [ anon ]
<..snip..>
I understand that you're trying to change this with mprotect - but you're also not checking the return value from the mprotect() call - this is probably failing for some reason.
Also, as an aside - write() is not guaranteed to write all bytes given to it, and is quite within it's design to return with no or a partial number of bytes written - I'd suggest you also change the code to reflect this.

Getting Beaglebone PRUs to work using PASM

I have been trying to get the PRU to work in a way that makes sense to me and at this point I am completely clueless. I can get the examples to work, but anytime I make a change or try to write things from scratch I just beat my head against the wall. I just want to as a start access any of the USRLEDS and turn them off or on at some speed, or as first pass turn on a LED and leave it on. Here is a PASM code I got off the internet (Will post link when I find it):
.origin 0
.entrypoint START
#define PRU0_ARM_INTERRUPT 19
#define AM33XX
#define GPIO1 0x4804c000 //Trying to access the GPIO1
#define GPIO_CLEARDATAOUT 0x190 //writing 1 to the bit you want cleared in GPIO_DATAOUT register (what does that mean?)
#define GPIO_SETDATAOUT 0x194 (set a value for GPIO output pins, which pins am I even writing to? GPIO1?
#define GPIO_OE 0x134 //enable the pins output capabilities
START:
//clear that bit
lbco r0, c4, 4, 4 //This creates a constant offset and stores in c4, but why do you need that?
CLR r0, r0, 4 //if you copied the data why do you need to clear it?
SBCO r0, C4, 4, 4 //What is this for?
//MOV r1, 10
MOV r2, 0x00000000 //store address 0x00 into r2, why?
MOV r3, GPIO1 //Store GPIO1 address in r3
MOV r4, GPIO_OE //place address of GPIO_OE into r4
MOV r5, GPIO_SETDATAOUT //store address of GPIO_SETDATAOUT in r5
MOV r6, GPIO_CLEARDATAOUT //store addres of GPIOCLEARDATAOUT in r6
SBBO r2, r3, r4,4 //What is this even doing? Copying 4 bytes from r2 into r3+r4, but why do you want to copy that way and if not why not?
MOV r1, 10
MOV r2, 0xFFFFFFFF //Suppossedly this turn the GPIO1 ON and OFF?
SBBO r2, r3, r6, 4 and again the storage stuff?
HALT
I am also attaching the C code that I am using:
#include <stdio.h>
#include <pruss/prussdrv.h>
#include <pruss/pruss_intc_mapping.h>
#define PRU_NUM 0 //defining which PRU to use
int main() {
int ret;
tpruss_intc_initdata intc = PRUSS_INTC_INITDATA;
//initialize the PRU by using init command from prussdrv.h
ret = prussdrv_init();
if(ret != 0) {
printf("Error returned: %d\n",ret);
printf("PRU unable to be initialized");
return -1;
}
ret = prussdrv_open(PRU_EVTOUT_0);
if(ret != 0) {
printf("Error returned for prussdrv_open(): %d\n",ret);
printf("PRU can't open PRU_EVTOUT_0");
return -1;
}
//Map PRUS's INTC
ret = prussdrv_pruintc_init(&intc);
if (ret != 0) {
printf("Error returned for prussdrv_pruintc_int\n");
printf("PRU doesn't work");
return -1;
}
//load and execute binary on PRU
prussdrv_exec_program(PRU_NUM, "./ashwini_test.bin");
prussdrv_pru_wait_event(PRU_EVTOUT_0);
prussdrv_pru_clear_event(PRU_EVTOUT_0,PRU0_ARM_INTERRUPT);
/*Disable PRU and close memory mappings*/
prussdrv_pru_disable(PRU_NUM);
prussdrv_exit();
//prussdrv_pru_wait_event(PRU_EVTOUT_0);
return 0;
}
I have gone through THE TRM and https://groups.google.com/forum/#!topic/beaglebone/98eF1wQE_QA, and elinux and derekmolloy, I just feel like I am missing something very basic about how address scheme work or how to think about these things. Thanks again for your help!
When you say that's your PASM code... do you mean it's some code you got from somewhere else that you're trying to use? Because the comments on most lines asking what they do makes it seem unlikely that it's actually your code...
Anyways, can't really answer unless you have a specific question, but there's plenty of info out there about how to use the GPIO subsystem on the BeagleBone's AM335x processor. I talked about it some in a post a while back here: https://graycat.io/tutorials/beaglebone-io-using-python-mmap/
I've also got a few documented PRU assembly examples here: https://github.com/alexanderhiam/PRU-stuffs

Where is the global memory replay overhead coming from?

Running the code below to write 1 GB in global memory in the NVIDIA Visual Profiler, I get:
- 100% storage efficiency
- 69.4% (128.6 GB/s) DRAM utilization
- 18.3% total replay overhead
- 18.3% global memory replay overhead.
The memory writes are supposed to be coalesced and there is no divergence in the kernel, so the question is where is the global memory replay overhead coming from? I am running this on Ubuntu 13.04, with nvidia-cuda-toolkit version 5.0.35-4ubuntu1.
#include <cuda.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <ctype.h>
#include <sched.h>
#include <assert.h>
static void
HandleError( cudaError_t err, const char *file, int line )
{
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString(err), file, line);
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))
// Global memory writes
__global__ void
kernel_write(uint32_t *start, uint32_t entries)
{
uint32_t tid = threadIdx.x + blockIdx.x*blockDim.x;
while (tid < entries) {
start[tid] = tid;
tid += blockDim.x*gridDim.x;
}
}
int main(int argc, char *argv[])
{
uint32_t *gpu_mem; // Memory pointer
uint32_t n_blocks = 256; // Blocks per grid
uint32_t n_threads = 192; // Threads per block
uint32_t n_bytes = 1073741824; // Transfer size (1 GB)
float elapsedTime; // Elapsed write time
// Allocate 1 GB of memory on the device
HANDLE_ERROR( cudaMalloc((void **)&gpu_mem, n_bytes) );
// Create events
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
// Write to global memory
HANDLE_ERROR( cudaEventRecord(start, 0) );
kernel_write<<<n_blocks, n_threads>>>(gpu_mem, n_bytes/4);
HANDLE_ERROR( cudaGetLastError() );
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&elapsedTime, start, stop) );
// Report exchange time
printf("#Delay(ms) BW(GB/s)\n");
printf("%10.6f %10.6f\n", elapsedTime, 1e-6*n_bytes/elapsedTime);
// Destroy events
HANDLE_ERROR( cudaEventDestroy(start) );
HANDLE_ERROR( cudaEventDestroy(stop) );
// Free memory
HANDLE_ERROR( cudaFree(gpu_mem) );
return 0;
}
The nvprof profiler and the API profiler are giving different results:
$ nvprof --events gst_request ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms) BW(GB/s)
13.345920 80.454690
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel_write(unsigned int*, unsigned int)
1 8388608 8388608 8388608 gst_request
$ nvprof --events global_store_transaction ./app
======== NVPROF is profiling app...
======== Command: app
#Delay(ms) BW(GB/s)
9.469216 113.392892
======== Profiling result:
Invocations Avg Min Max Event Name
Device 0
Kernel: kernel_write(unsigned int*, unsigned int)
1 8257560 8257560 8257560 global_store_transaction
I had the impression that global_store_transation could not be lower than gst_request. What is going on here? I can't ask for both events in the same command, so I had to run the two separate commands. Could this be the problem?
Strangely, the API profiler shows different results with perfect coalescing. Here is the output, I had to run twice to get the proper counters:
$ cat config.txt
inst_issued
inst_executed
gst_request
$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log.csv COMPUTE_PROFILE_CONFIG=config.txt ./app
$ cat log.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eaca946b8
method,gputime,cputime,occupancy,inst_issued,inst_executed,gst_request,gld_request
_Z12kernel_writePjj,7771.776,7806.000,1.000,4737053,3900426,557058,0
$ cat config2.txt
global_store_transaction
$ COMPUTE_PROFILE=1 COMPUTE_PROFILE_CSV=1 COMPUTE_PROFILE_LOG=log2.csv COMPUTE_PROFILE_CONFIG=config2.txt ./app
$ cat log2.csv
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 580
# CUDA_CONTEXT 1
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff67eea92d0e8
method,gputime,cputime,occupancy,global_store_transaction
_Z12kernel_writePjj,7807.584,7831.000,1.000,557058
Here gst_request and global_store_transactions are exactly the same, showing perfect coalescing. Which one is correct (nvprof or the API profiler)? Why does NVIDIA Visual Profiler says that I have non-coalesced writes? There are still significant instruction replays, and I have no idea where they are coming from :(
Any ideas? I don't think this is hardware malfunctioning, since I have two boards on the same machine and both show the same behavior.

Printing memory accesses in GDB

I am new to gdb. I want to print the memory addresses used with the actual sequence during execution of a c program. Let’s explain my question with an example. Let’s assume that we have the following c code with two functions main() and test(). I know that, inside gdb, I can use "disassemble main" to disassemble main() function, or "disassemble test" to disassemble test() function separately. My question is, how can I disassemble these two functions as a single code; so that, I can see all the memory addresses used during execution and their sequence of accesses? To be specific, as main() is calling test() and test() is also calling itself multiple times, I want to see something like example 2. I am also wandering, the addresses shown in gdb disassembler, are they virtual or physical memory addresses? Any help or guidance will be appreciated.
Example 1:
#include "stdio.h"
int test(int q)
{
if(q<16)
test(q+5);
return q;
}
void main()
{
unsigned int a=5;
unsigned int b=5;
unsigned int c=5;
test(a);
}
Example 2:
<Memory Address> <assembly instruction> <c instructions>
0x12546a mov //for unsigned int a=5;
0x12546b mov //for unsigned int b=5;
0x12546c mov //for unsigned int c=5;
0x12546d jmp //for test(q=a=5);
0x12546e cmpl //for if(q<16)
0x12546f jmp //for test(q+5);
0x12546d jmp //for test(q=10);
0x12546e cmpl //for if(q<16)
0x12546f jmp //for test(q+5);
0x12547a jmp //for test(q=15);
0x12547b cmpl //for if(q<16)
0x12547c jmp //for test(q+5);
0x12547d jmp //for test(q=20);
0x12547e cmpl //for if(q<16)
0x12547f jmp //return q);
0x12548a jmp //return q);
0x12548b jmp //return q);
0x12548c jmp //return q);
There's really no pretty way to do this. You're just going to have to step through the code:
(gdb) stepi
(gdb) x/i $pc
(gdb) info registers
(gdb) stepi
(gdb) x/i $pc
(gdb) info registers
.....
You could script that up so that it does it quickly and dumps the data to a file, but that's about all.
I suppose you may have more luck with valgrind. If there's no existing tool to do so, it is possible to add your own instrumentation to report memory accesses (and not only that), or alter an existing one.
E.g. see http://valgrind.org/docs/manual/lk-manual.html
--trace-mem= [default: no]
When enabled, Lackey prints the size and address of almost every memory access made by the program.

activeX control crash on alt+tab

I've implemented an Delphi acitveX control. Everything runs fine on html. After that, I embed that activeX in an MFC application. It's good, except that when I test it, if I alt+tab to another windows(Chrome, for example...) and alt+tab back to my application, it just crashed.
I know that this is a non-clear question, but I have no clues. Any clues to solve this situation? What may go wrong or what events I should take a look at?
Thanks.
Edit 1:
This is what I got when I use spy++ on it.
<04518> 00100B7C S WM_ACTIVATE fActive:WA_INACTIVE fMinimized:False hwndPrevious:(null)
<04519> 00100B7C S WM_ACTIVATETOPLEVEL fActive:False dwThreadID:0025F40C
<04520> 00100B7C R WM_ACTIVATETOPLEVEL
Edit 2:
This is what I got from call stack when I break all at the time it hangs
mfc100d.dll!CThreadSlotData::GetThreadValue(int nSlot) Line 248 C++
mfc100d.dll!CThreadLocalObject::GetData(CNoTrackObject * (void)* pfnCreateObject) Line 420 + 0x11 bytes C++
mfc100d.dll!CThreadLocal<AFX_MODULE_THREAD_STATE>::GetData() Line 179 + 0xd bytes C++
mfc100d.dll!AfxGetModuleThreadState() Line 477 + 0x11 bytes C++
mfc100d.dll!afxMapHWND(int bCreate) Line 289 + 0x5 bytes C++
mfc100d.dll!CWnd::FromHandlePermanent(HWND__ * hWnd) Line 324 + 0x7 bytes C++
mfc100d.dll!AfxWndProc(HWND__ * hWnd, unsigned int nMsg, unsigned int wParam, long lParam) Line 405 + 0x9 bytes C++
mfc100d.dll!AfxWndProcBase(HWND__ * hWnd, unsigned int nMsg, unsigned int wParam, long lParam) Line 420 + 0x15 bytes C++
Edit 3:
I'm using Visual studio 2010, the call stack tell that application hangs in this function, line 2, from afxtls.cpp. Please anyone shed some lights on how to solve this.
inline void* CThreadSlotData::GetThreadValue(int nSlot)
{
EnterCriticalSection(&m_sect);
ASSERT(nSlot != 0 && nSlot < m_nMax);
ASSERT(m_pSlotData != NULL);
ASSERT(m_pSlotData[nSlot].dwFlags & SLOT_USED);
ASSERT(m_tlsIndex != (DWORD)-1);
if( nSlot <= 0 || nSlot >= m_nMax ) // check for retail builds.
{
LeaveCriticalSection(&m_sect);
return NULL;
}
CThreadData* pData = (CThreadData*)TlsGetValue(m_tlsIndex);

Resources