Getting the nearest free memory VirtualAllocEx - memory

I want get the nearest free memory address to allocate memory for CodeCave but i want it to be within the jmp instruction limit 0xffffffff-80000000 , Im trying the following code but without much luck.
DWORD64 MemAddr = 0;
DWORD64 Address = 0x0000000140548AE6 & 0xFFFFFFFFFFFFF000;
HANDLE hProc = OpenProcess(PROCESS_ALL_ACCESS, NULL, ProcessID);
if (hProc){
for (DWORD offset = 0; (Address + 0x000000007FFFEFFF)>((Address - 0x000000007FFFEFFF) + offset); offset += 100)
{
MemAddr = (DWORD64)VirtualAllocEx(hProc, (DWORD64*)((Address - 0x000000007FFFEFFF) + offset),MemorySize, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
if ((DWORD64)MemAddr){
break;
}
}
CloseHandle(hProc);
return (DWORD64)MemAddr;
}
return 0;
Target Process is 64bit .

If the target process is x64 then make sure you're compiling for x64 as well.
I have used this code for the same purpose, to find free memory within a 4GB address range for doing x64 jmps for a x64 hook.
char* AllocNearbyMemory(HANDLE hProc, char* nearThisAddr)
{
char* begin = nearThisAddr;
char* end = nearThisAddr + 0x7FFF0000;
MEMORY_BASIC_INFORMATION mbi{};
auto curr = begin;
while (VirtualQueryEx(hProc, curr, &mbi, sizeof(mbi)))
{
if (mbi.State == MEM_FREE)
{
char* addr = (char*)VirtualAllocEx(hProc, mbi.BaseAddress, 0x1000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
if (addr) return addr;
}
curr += mbi.RegionSize;
}
return 0;
}
Keep in mind there is no error checking, just a simple PoC

Related

How to measure CPU memory bandwidth in C RTOS?

I am working on an ARMv7 Embedded system, which uses an RTOS in the cortex-A7 SOC(2 cores, 256KB L2 Cache).
Now I want to measure the memory bandwidth of the CPU in this system, so I wrote following functions to do the measurement.
The function allocates 32MB memory and does memory reading in 10 loops. And measure the time of the reading, to get the memory reading bandwidth.. (I know there is dcache involved in it so the measurement is not precise).
#define T_MEM_SIZE 0x2000000
static void print_summary(char *tst, uint64_t ticks)
{
uint64_t msz = T_MEM_SIZE/1000000;
float msec = (float)ticks/24000;
printf("%s: %.2f MB/Sec\n", tst, 1000 * msz/msec);
}
static void memrd_cache(void)
{
int *mptr = malloc(T_MEM_SIZE);
register uint32_t i = 0;
register int va = 0;
uint32_t s, e, diff = 0, maxdiff = 0;
uint16_t loop = 0;
if (mptr == NULL)
return;
while (loop++ < 10) {
s = read_cntpct();
for (i = 0; i < T_MEM_SIZE/sizeof(int); i++) {
va = mptr[i];
}
e = read_cntpct();
diff = e - s;
if (diff > maxdiff) {
maxdiff = diff;
}
}
free(mptr);
print_summary("memrd", maxdiff);
}
Below is the measurement of reading, which tries to remove the caching effect of Dcache.
It fills 4Bytes in each cache line (CPU may fill the cacheline), until the 256KB L2 cache is full and be flushed/reloaded, so I think the Dcache effect should be minimized. (I may be wrong, correct me please).
#define CLINE_SIZE 16 // 16 * 4B
static void memrd_nocache(void)
{
int *mptr = malloc(T_MEM_SIZE);
register uint32_t col = 0, ln = 0;
register int va = 0;
uint32_t s, e, diff = 0, maxdiff = 0;
uint16_t loop = 0;
if (mptr == NULL)
return;
while (loop++ < 10) {
s = read_cntpct();
for (col = 0; col < CLINE_SIZE; col++) {
for (ln = 0; ln < T_MEM_SIZE/(CLINE_SIZE*sizeof(int)); ln++) {
va = *(mptr + ln * CLINE_SIZE + col);
}
}
e = read_cntpct();
diff = e - s;
if (diff > maxdiff) {
maxdiff = diff;
}
}
free(mptr);
print_summary("memrd_nocache", maxdiff);
}
After running these 2 functions, I found the bandwith is about,
memrd: 1973.04 MB/Sec
memrd_nocache: 1960.67 MB/Sec
The CPU is running at 1GHz, with DDR3 on dieļ¼Œ the two testing has the similar data!? It is a big surprise to me.
I had worked with lmbench in Linux ARM server, but I don't think it can be ran in this embedded system.
So I want to get a software tool to measure the memory bandwidth in this embedded system, get one from community or do it by myself.

Performance of ADD and OR instruction in X86

Basically, the question is what instruction takes less time to execute (or they take the exact same time):
add rax, rbx
; or
or rax, rbx
For example, if I want to access efficiently a virtual CPU core and OR executes more fast, than the code I should write would be something like this (C):
struct CPUCore { /* ... */ };
sizeof(CPUCore); // 64 e.g.
// normal allocation
CPUCore* allocNormal() {
CPUCore* inst = (CPUCore*)malloc(sizeof(CPUCore));
return inst;
};
// aligning allocation
struct Result {
// free memory by 'res.free(block)'
// use object by 'res.ptr'
CPUCore* block;
CPUCore* ptr;
};
CPUCore* allocAlign() {
// allocating 2 times the bytes in order to ensure 64 byte block
// with address & 63 == 0, so instead of 'ptr + shift'
// we can use 'ptr | shift'
Result ret;
ret.block = malloc(sizeof(CPUCore) * 2);
ret.ptr = ret.block + 0x40 - (ret.block & 0x3F);
return ret;
};

realloc() in a 64bit iOS device

When I use the C function realloc(p,size) in my project, the code runs well in both the simulator and on an iPhone 5.
However, when the code is running on an iPhone 6 plus, some odd things happen. The contents a points to are changed! What's worse, the output is different every time the function runs!
Here is the test code:
#define LENGTH (16)
- (void)viewDidLoad {
[super viewDidLoad];
char* a = (char*)malloc(LENGTH+1);
a[0] = 33;
a[1] = -14;
a[2] = 76;
a[3] = -128;
a[4] = 25;
a[5] = 49;
a[6] = -45;
a[7] = 56;
a[8] = -36;
a[9] = 56;
a[10] = 125;
a[11] = 44;
a[12] = 26;
a[13] = 79;
a[14] = 29;
a[15] = 66;
//print binary contents pointed by a, print each char in int can also see the differene
[self printInByte:a];
realloc(a, LENGTH);
[self printInByte:a]
free(a);
}
-(void) printInByte: (char*)t{
for(int i = 0;i < LENGTH ;++i){
for(int j = -1;j < 16;(t[i] >> ++j)& 1?printf("1"):printf("0"));
printf(" ");
if (i % 4 == 3) {
printf("\n");
}
}
}
One more thing, when LENGTH is not 16, it runs well with assigning up to a[LENGTH-1]. However, if LENGTH is exactly 16, things go wrong.
Any ideas will be appreciated.
From the realloc man doc:
The realloc() function tries to change the size of the allocation
pointed
to by ptr to size, and returns ptr. If there is not enough room to
enlarge the memory allocation pointed to by ptr, realloc() creates a new
allocation, copies as much of the old data pointed to by ptr as will fit
to the new allocation, frees the old allocation, and returns a pointer to
the allocated memory.
So, you must write: a = realloc(a, LENGTH); because in the case where the already allocated memory block can't be made larger, the zone will be moved to another location, and so, the adress of the zone will change. That is why realloc return a memory pointer.
You need to do:
a = realloc(a, LENGTH);
The reason is that realloc() may move the block, so you need to update your pointer to the new address:
http://www.cplusplus.com/reference/cstdlib/realloc/

Cuda-memcheck not reporting out of bounds shared memory access

I am runnig the follwoing code using shared memory:
__global__ void computeAddShared(int *in , int *out, int sizeInput){
//not made parameters gidata and godata to emphasize that parameters get copy of address and are different from pointers in host code
extern __shared__ float temp[];
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int ltid = threadIdx.x;
temp[ltid] = 0;
while(tid < sizeInput){
temp[ltid] += in[tid];
tid+=gridDim.x * blockDim.x; // to handle array of any size
}
__syncthreads();
int offset = 1;
while(offset < blockDim.x){
if(ltid % (offset * 2) == 0){
temp[ltid] = temp[ltid] + temp[ltid + offset];
}
__syncthreads();
offset*=2;
}
if(ltid == 0){
out[blockIdx.x] = temp[0];
}
}
int main(){
int size = 16; // size of present input array. Changes after every loop iteration
int cidata[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
/*FILE *f;
f = fopen("invertedList.txt" , "w");
a[0] = 1 + (rand() % 8);
fprintf(f, "%d,",a[0]);
for( int i = 1 ; i< N; i++){
a[i] = a[i-1] + (rand() % 8) + 1;
fprintf(f, "%d,",a[i]);
}
fclose(f);*/
int* gidata;
int* godata;
cudaMalloc((void**)&gidata, size* sizeof(int));
cudaMemcpy(gidata,cidata, size * sizeof(int), cudaMemcpyHostToDevice);
int TPB = 4;
int blocks = 10; //to get things kicked off
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
while(blocks != 1 ){
if(size < TPB){
TPB = size; // size is 2^sth
}
blocks = (size+ TPB -1 ) / TPB;
cudaMalloc((void**)&godata, blocks * sizeof(int));
computeAddShared<<<blocks, TPB,TPB>>>(gidata, godata,size);
cudaFree(gidata);
gidata = godata;
size = blocks;
}
//printf("The error by cuda is %s",cudaGetErrorString(cudaGetLastError()));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime , start, stop);
printf("time is %f ms", elapsedTime);
int *output = (int*)malloc(sizeof(int));
cudaMemcpy(output, gidata, sizeof(int), cudaMemcpyDeviceToHost);
//Cant free either earlier as both point to same location
cudaError_t chk = cudaFree(godata);
if(chk!=0){
printf("First chk also printed error. Maybe error in my logic\n");
}
printf("The error by threadsyn is %s", cudaGetErrorString(cudaGetLastError()));
printf("The sum of the array is %d\n", output[0]);
getchar();
return 0;
}
Clearly, the first while loop in computeAddShared is causing out of bounds error because I am allocating 4 bytes to shared memory. Why does cudamemcheck not catch this. Below is the output of cuda-memcheck
========= CUDA-MEMCHECK
time is 12.334816 msThe error by threadsyn is no errorThe sum of the array is 13
6
========= ERROR SUMMARY: 0 errors
Shared memory allocation granularity. The Hardware undoubtedly has a page size for allocations (probably the same as the L1 cache line side). With only 4 threads per block, there will "accidentally" be enough shared memory in a single page to let you code work. If you used a sensible number of threads block (ie. a round multiple of the warp size) the error would be detected because there would not be enough allocated memory.

What the DeviceIocontrol API function will do?

Please explain what does this VC++ code do? Is it possible to convert this code to Delphi2010?
void CDMOnLineView::OnActionGetdata()
{
bool retCode;
DWORD retByte = 0;
int TmpHigh, TmpLow;
UCHAR HIDData[64];
int LastX, LastY;
UCHAR Button;
CDC* pViewDC = GetDC();
if(yPos > 500) yPos = 0;
else yPos = yPos + 16;
if(hDriver == NULL)
{
pViewDC->TextOut(10,yPos,"Driver not connect yet.");
}
else
{
IO_Param.CallerHandle = m_hWnd;
IO_Param.Model = DM_A4;
retCode = DeviceIoControl(hDriver, IOCTL_DM_READ_DATA, &IO_Param, sizeof(DM_PARAM), HIDData,
6, &retByte, NULL);
if(retCode)
{
if(retByte != 0)
{
Button = HIDData[1] & 0x01;
TmpLow = (int)HIDData[2];
TmpHigh = (int)HIDData[3];
LastX = (TmpLow & 0x00FF) | ((TmpHigh << 8) & 0xFF00);
TmpLow = (int)HIDData[4];
TmpHigh = (int)HIDData[5];
LastY = (TmpLow & 0x00FF) | ((TmpHigh << 8) & 0xFF00);
sprintf(szStringBuffer, "Button: %d, X: %.5d, Y: %.5d", Button, LastX, LastY);
pViewDC->TextOut(10,yPos,szStringBuffer, strlen(szStringBuffer));
}
else pViewDC->TextOut(10,yPos,"Return bytes incorrect.");
}
else
{
ErrorCode = GetLastError();
sprintf(szStringBuffer, "Call IOCTL_DM_READ_DATA fail. Error: %d", ErrorCode);
pViewDC->TextOut(10,yPos,szStringBuffer, strlen(szStringBuffer));
}
}
ReleaseDC(pViewDC);
}
What the DeviceIocontrol function will do? Please try to explain the parameters also.
thanks all.
Here's the "translation" of all those bitwise operations in the code, hopefully those would get you going:
The operators you need to know about:
& as the bitwise AND operator.
| is the bitwise OR operator
<< is the bitwise SHIFT LEFT operator
The translations:
Button = HIDData[1] & 0x01; // C
Button := HIDData[1] and $01; // Delphi
TmpLow = (int)HIDData[2]; // C
TmpLow := Integer(HIDData[2]); // Delphi
TmpHigh = (int)HIDData[3]; // C
TmpHigh := Integer(HidData[3]); // Delphi
LastX = (TmpLow & 0x00FF) | ((TmpHigh << 8) & 0xFF00); // C
LastX := (TmpLow and $00FF) or ((TmpHigh shl 8) and $FF00); // Delphi
TmpLow = (int)HIDData[4]; // C
TmpLow := Integer(HIDData[4]); // Delphi
TmpHigh = (int)HIDData[5]; // C
TmpHigh := Integer(HIDData[5]); // Delphi
LastY = (TmpLow & 0x00FF) | ((TmpHigh << 8) & 0xFF00); // C
LastY := (TmpLow and $00FF) or ((TmpHigh shl 8) and $FF00); // Delphi
sprintf(szStringBuffer, "Button: %d, X: %.5d, Y: %.5d", Button, LastX, LastY); // C
pViewDC->TextOut(10,yPos,szStringBuffer, strlen(szStringBuffer)); // C
Caption := Format('Button: %d, x: %.5d, y: %.5d', [Button, LastX, LastY]); // Delphi
DeviceIoControl calls custom driver function. Driver is kernel-mode program representing some computer device. Drivers have standard operations (like open, close, read, write, which are called using CreateFile, CloseHandle, ReadFile and WriteFile API) and custom driver-specific operations, called using DeviceIoControl. Details about these operations are described in the driver documentation.
Every custom operation has generic interface: operation code, input and output buffers, which may contain any information.
The DeviceIoControl function is documented at MSDN. User mode programs use it to interact with device drivers.
Converting this code is pretty simple. The call to DeviceIoControl maps across trivially. The only area that you are likely to struggle with is the C bitwise operations. If you don't have a copy of K&R to hand, then you should!

Resources