Proper PE base relocation - memory

I'm trying to run a WIN32 PE executable from memory (not for malware just for software protection purposes). When I allocate at the desired image base address (0x00400000) it works perfectly. But this is not ideal since this address is not always available, sometimes even already in use by the current process depending on ASLR.
Instead I have to relocate the image with the new address obtained from VirtualAlloc() using this generic code.
while (pIBR->VirtualAddress)
{
if (pIBR->SizeOfBlock >= sizeof(IMAGE_BASE_RELOCATION))
{
count = (pIBR->SizeOfBlock - sizeof(IMAGE_BASE_RELOCATION)) / sizeof(WORD);
list = (PWORD)(pIBR + 1);
for (i = 0; i < count; i++)
{
if (list[i])
{
ptr = (PDWORD)((LPBYTE)image + (pIBR->VirtualAddress + (list[i] & 0xFFF)));
*ptr += delta;
}
}
}
pIBR = (PIMAGE_BASE_RELOCATION)((LPBYTE)pIBR + pIBR->SizeOfBlock);
}
which works fine for simple executable's, but more complex executable's with resources, TLS, and various other things, don't load correctly or at all.
My question, is there a better way of doing image relocation, or how can I always reserve the address 0x00400000 for my new PE image.
Thanks.

Related

writing to flash memory dspic33e

I have some questions regarding the flash memory with a dspic33ep512mu810.
I'm aware of how it should be done:
set all the register for address, latches, etc. Then do the sequence to start the write procedure or call the builtins function.
But I find that there is some small difference between what I'm experiencing and what is in the DOC.
when writing the flash in WORD mode. In the DOC it is pretty straightforward. Following is the example code in the DOC
int varWord1L = 0xXXXX;
int varWord1H = 0x00XX;
int varWord2L = 0xXXXX;
int varWord2H = 0x00XX;
int TargetWriteAddressL; // bits<15:0>
int TargetWriteAddressH; // bits<22:16>
NVMCON = 0x4001; // Set WREN and word program mode
TBLPAG = 0xFA; // write latch upper address
NVMADR = TargetWriteAddressL; // set target write address
NVMADRU = TargetWriteAddressH;
__builtin_tblwtl(0,varWord1L); // load write latches
__builtin_tblwth(0,varWord1H);
__builtin_tblwtl(0x2,varWord2L);
__builtin_tblwth(0x2,varWord2H);
__builtin_disi(5); // Disable interrupts for NVM unlock sequence
__builtin_write_NVM(); // initiate write
while(NVMCONbits.WR == 1);
But that code doesn't work depending on the address where I want to write. I found a fix to write one WORD but I can't write 2 WORD where I want. I store everything in the aux memory so the upper address(NVMADRU) is always 0x7F for me. The NVMADR is the address I can change. What I'm seeing is that if the address where I want to write modulo 4 is not 0 then I have to put my value in the 2 last latches, otherwise I have to put the value in the first latches.
If address modulo 4 is not zero, it doesn't work like the doc code(above). The value that will be at the address will be what is in the second set of latches.
I fixed it for writing only one word at a time like this:
if(Address % 4)
{
__builtin_tblwtl(0, 0xFFFF);
__builtin_tblwth(0, 0x00FF);
__builtin_tblwtl(2, ValueL);
__builtin_tblwth(2, ValueH);
}
else
{
__builtin_tblwtl(0, ValueL);
__builtin_tblwth(0, ValueH);
__builtin_tblwtl(2, 0xFFFF);
__builtin_tblwth(2, 0x00FF);
}
I want to know why I'm seeing this behavior?
2)I also want to write a full row.
That also doesn't seem to work for me and I don't know why because I'm doing what is in the DOC.
I tried a simple write row code and at the end I just read back the first 3 or 4 element that I wrote to see if it works:
NVMCON = 0x4002; //set for row programming
TBLPAG = 0x00FA; //set address for the write latches
NVMADRU = 0x007F; //upper address of the aux memory
NVMADR = 0xE7FA;
int latchoffset;
latchoffset = 0;
__builtin_tblwtl(latchoffset, 0);
__builtin_tblwth(latchoffset, 0); //current = 0, available = 1
latchoffset+=2;
__builtin_tblwtl(latchoffset, 1);
__builtin_tblwth(latchoffset, 1); //current = 0, available = 1
latchoffset+=2;
.
. all the way to 127(I know I could have done it in a loop)
.
__builtin_tblwtl(latchoffset, 127);
__builtin_tblwth(latchoffset, 127);
INTCON2bits.GIE = 0; //stop interrupt
__builtin_write_NVM();
while(NVMCONbits.WR == 1);
INTCON2bits.GIE = 1; //start interrupt
int testaddress;
testaddress = 0xE7FA;
status = NVMemReadIntH(testaddress);
status = NVMemReadIntL(testaddress);
testaddress += 2;
status = NVMemReadIntH(testaddress);
status = NVMemReadIntL(testaddress);
testaddress += 2;
status = NVMemReadIntH(testaddress);
status = NVMemReadIntL(testaddress);
testaddress += 2;
status = NVMemReadIntH(testaddress);
status = NVMemReadIntL(testaddress);
What I see is that the value that is stored in the address 0xE7FA is 125, in 0xE7FC is 126 and in 0xE7FE is 127. And the rest are all 0xFFFF.
Why is it taking only the last 3 latches and write them in the first 3 address?
Thanks in advance for your help people.
The dsPIC33 program memory space is treated as 24 bits wide, it is
more appropriate to think of each address of the program memory as a
lower and upper word, with the upper byte of the upper word being
unimplemented
(dsPIC33EPXXX datasheet)
There is a phantom byte every two program words.
Your code
if(Address % 4)
{
__builtin_tblwtl(0, 0xFFFF);
__builtin_tblwth(0, 0x00FF);
__builtin_tblwtl(2, ValueL);
__builtin_tblwth(2, ValueH);
}
else
{
__builtin_tblwtl(0, ValueL);
__builtin_tblwth(0, ValueH);
__builtin_tblwtl(2, 0xFFFF);
__builtin_tblwth(2, 0x00FF);
}
...will be fine for writing a bootloader if generating values from a valid Intel HEX file, but doesn't make it simple for storing data structures because the phantom byte is not taken into account.
If you create a uint32_t variable and look at the compiled HEX file, you'll notice that it in fact uses up the least significant words of two 24-bit program words. I.e. the 32-bit value is placed into a 64-bit range but only 48-bits out of the 64-bits are programmable, the others are phantom bytes (or zeros). Leaving three bytes per address modulo of 4 that are actually programmable.
What I tend to do if writing data is to keep everything 32-bit aligned and do the same as the compiler does.
Writing:
UINT32 value = ....;
:
__builtin_tblwtl(0, value.word.word_L); // least significant word of 32-bit value placed here
__builtin_tblwth(0, 0x00); // phantom byte + unused byte
__builtin_tblwtl(2, value.word.word_H); // most significant word of 32-bit value placed here
__builtin_tblwth(2, 0x00); // phantom byte + unused byte
Reading:
UINT32 *value
:
value->word.word_L = __builtin_tblrdl(offset);
value->word.word_H = __builtin_tblrdl(offset+2);
UINT32 structure:
typedef union _UINT32 {
uint32_t val32;
struct {
uint16_t word_L;
uint16_t word_H;
} word;
uint8_t bytes[4];
} UINT32;

How to send cluster in separated node ros pcl

Hi i'm new in pointcloud library. I'm trying to show clustering result point on rviz or pcl viewer, and then show nothing. And i realize that my data show nothing too when i subcsribe and cout that. Hopefully can help my problem, thanks
This is my code for clustering and send node
void cloudReceive(const sensor_msgs::PointCloud2ConstPtr& inputMsg){
mutex_lock.lock();
pcl::fromROSMsg(*inputMsg, *inputCloud);
cout<<inputCloud<<endl;
pcl::search::KdTree<pcl::PointXYZRGB>::Ptr tree (new pcl::search::KdTree<pcl::PointXYZRGB>);
tree->setInputCloud(inputCloud);
std::vector<pcl::PointIndices> cluster_indices;
pcl::EuclideanClusterExtraction<pcl::PointXYZRGB> ec;
ec.setClusterTolerance(0.03);//2cm
ec.setMinClusterSize(200);//min points
ec.setMaxClusterSize(1000);//max points
ec.setSearchMethod(tree);
ec.setInputCloud(inputCloud);
ec.extract(cluster_indices);
if(cluster_indices.size() > 0){
std::vector<pcl::PointIndices>::const_iterator it;
int i = 0;
for (it = cluster_indices.begin(); it != cluster_indices.end(); ++it){
if(i >= 10)
break;
cloud_cluster[i]->points.clear();
std::vector<int>::const_iterator idx_it;
for (idx_it = it->indices.begin(); idx_it != it->indices.end(); idx_it++)
cloud_cluster[i]->points.push_back(inputCloud->points[*idx_it]);
cloud_cluster[i]->width = cloud_cluster[i]->points.size();
// cloud_cluster[i]->height = 1;
// cloud_cluster[i]->is_dense = true;
cout<<"PointCloud representing the Cluster: " << cloud_cluster[i]->points.size() << " data points"<<endl;
std::stringstream ss;
ss<<"cobaa_pipecom2_cluster_"<< i << ".pcd";
writer.write<pcl::PointXYZRGB> (ss.str(), *cloud_cluster[i], false);
pcl::toROSMsg(*cloud_cluster[i], outputMsg);
// cout<<"data = "<< outputMsg <<endl;
cloud_cluster[i]->header.frame_id = FRAME_ID;
pclpub[i++].publish(outputMsg);
// i++;
}
}
else
ROS_INFO_STREAM("0 clusters extracted\n");
}
And this one is the main
int main(int argc, char** argv){
for (int z = 0; z < 10; z++) {
// std::cout << " - clustering/" << z << std::endl;
cloud_cluster[z] = pcl::PointCloud<pcl::PointXYZRGB>::Ptr(new pcl::PointCloud<pcl::PointXYZRGB>);
cloud_cluster[z]->height = 1;
cloud_cluster[z]->is_dense = true;
// cloud_cluster[z]->header.frame_id = FRAME_ID;
}
ros::init(argc,argv,"clustering");
ros::NodeHandlePtr nh(new ros::NodeHandle());
pclsub = nh->subscribe("/pclsegmen",1,cloudReceive);
std::string pub_str("clustering/0");
for (int z = 0; z < 10; z++) {
pub_str[11] = z + 48;//48=0(ASCII)
// z++;
pclpub[z] = nh->advertise <sensor_msgs::PointCloud2> (pub_str, 1);
}
// pclpub = nh->advertise<sensor_msgs::PointCloud2>("/pclcluster",1);
ros::spin();
}
This isn't an exact answer, but I think it addresses your issue & may ease your debugging.
RViz can directly subscribe to a published point cloud, the one I'm assuming you're trying to see in the cloud_receive callback. If you set the Frame to whichever frame it's being published at, and add it from the available topics, you should see the points. (Easier than trying to rebroadcast it as different topics).
Also, I recommend looking at the rostopic command line tool. You can do rostopic list to check if it's being published, rostopic bw to see if it's really publishing the expected volume of data (ex bytes vs kilobytes vs megabytes), rostopic hz to see how frequently (if ever) it's publishing, and (briefly) rostopic echo to look at the data itself. (This is me assuming from your question it's more an issue with the data coming into your node).
If you're having trouble, not with data coming into the node, nor with the visualization of pointcloud data in general, but with the transformed data that's supposed to come out of the node, I would check that the clustering worked, & reduce your code moreso to just having 1 publisher publish something. You may be doing something weird. Like messing up your pointers. You could also turn on stronger compilation warnings for your node with -Wall -Wextra -Werror or step through the execution of it via gdb (launch-prefix="xterm -e gdb --args").
The solution is, i change the ASCII number into lexical_cast. Thanks for your response, i hope this can help other
for (int z = 0; z < CLOUD_QTD; z++) {
// pub_str[11] = z + 48;
std::string topicName = "/pclcluster/" + boost::lexical_cast<std::string>(z);
global::pub[z] = n.advertise <sensor_msgs::PointCloud2> (topicName, 1);
}

Is it possible to use GPU just for calculations without copying the variables to GPU. Can we use RAM or even HardDisk for variable stoorage? [duplicate]

I tried the code in this link Is CUDA pinned memory zero-copy?
The one who asked claims the program worked fine for him
But does not work the same way on mine
the values does not change if I manipulate them in the kernel.
Basically my problem is, my GPU memory is not enough but I want to do calculations which require more memory. I my program to use RAM memory, or host memory and be able to use CUDA for calculations. The program in the link seemed to solve my problem but the code does not give output as shown by the guy.
Any help or any working example on Zero copy memory would be useful.
Thank you
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
void test()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
}
First of all, to allocate ZeroCopy memory, you have to specify cudaHostAllocMapped flag as an argument to cudaHostAlloc.
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
Still the pinnedHostPointer will be used to access the mapped memory from the host side only. To access the same memory from device, you have to get the device side pointer to the memory like this:
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
Pass this pointer as kernel argument.
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
Also, you have to synchronize the kernel execution with the host to read the updated values. Just add cudaDeviceSynchronize after the kernel call.
The code in the linked question is working, because the person who asked the question is running the code on a 64 bit OS with a GPU of Compute Capability 2.0 and TCC enabled. This configuration automatically enables the Unified Virtual Addressing feature of the GPU in which the device sees host + device memory as a single large memory instead of separate ones and host pointers allocated using cudaHostAlloc can be passed directly to the kernel.
In your case, the final code will look like this:
#include <cstdio>
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
int main()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
cudaDeviceSynchronize();
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
return 0;
}

C++ memory issue

I'm currently building a prime number finder, and am having a memory problem:
This may be due to a corruption of the heap, which indicates a bug in PrimeNumbers.exe or any of the DLLs it has loaded.
PS. Please don't say to me if this isn't the way to find prime numbers, I want to figure it out myself!
Code:
// PrimeNumbers.cpp : main project file.
#include "stdafx.h"
#include <vector>
using namespace System;
using namespace std;
int main(array<System::String ^> ^args)
{
Console::WriteLine(L"Until what number do you want to stop?");
signed const int numtstop = Convert::ToInt16(Console::ReadLine());
bool * isvalid = new bool[numtstop];
int allattempts = numtstop*numtstop; // Find all the possible combinations of numbers
for (int currentnumb = 0; currentnumb <= allattempts; currentnumb++) // For each number try to find a combination
{
for (int i = 0; i <= numtstop; i++)
{
for (int tnumb = 0; tnumb <= numtstop; tnumb++)
{
if (i*tnumb == currentnumb)
{
isvalid[currentnumb] = false;
Console::WriteLine("Error");
}
}
}
}
Console::WriteLine(L"\nAll prime number in the range of:" + Convert::ToString(numtstop));
for (int pnts = 0; pnts <= numtstop; pnts++)
{
if (isvalid[pnts] != false)
{
Console::WriteLine(pnts);
}
}
return 0;
}
I don't see the memory problem.
Please help.
You are allocating numtstop booleans, but you index that array using a variable that ranges from zero to numtstop*numtstop. This will be severely out of bounds for all numstop values greater than 1.
You should either allocate more booleans (numtstop*numtstop) or use a different variable to index into isvalid (for example, i, which ranges from 0 to numstop). I am sorry, I cannot be more precise than that because of your request not to comment on your algorithm of finding primes.
P.S. If you would like to read something on the topic of finding small primes, here is a link to a great book by Dijkstra. He teaches you how to construct a program for the first 1000 primes on pages 35..49.
Problem is that you use native C++ in managed C++/CLI code. And use new without delete of course.
`currentnumb` :
is bigger than the size of the array, which is just numtstop. You are probably going out of bound, this might be your issue.
You never delete[] your isvalid local, this is a memory leak.

Can the STREAM and GUPS (single CPU) benchmark use non-local memory in NUMA machine

I want to run some tests from HPCC, STREAM and GUPS.
They will test memory bandwidth, latency, and throughput (in term of random accesses).
Can I start Single CPU test STREAM or Single CPU GUPS on NUMA node with memory interleaving enabled? (Is it allowed by the rules of HPCC - High Performance Computing Challenge?)
Usage of non-local memory can increase GUPS results, because it will increase 2- or 4- fold the number of memory banks, available for random accesses. (GUPS typically limited by nonideal memory-subsystem and by slow memory bank opening/closing. With more banks it can do update to one bank, while the other banks are opening/closing.)
Thanks.
UPDATE:
(you may nor reorder the memory accesses that the program makes).
But can compiler reorder loops nesting? E.g. hpcc/RandomAccess.c
/* Perform updates to main table. The scalar equivalent is:
*
* u64Int ran;
* ran = 1;
* for (i=0; i<NUPDATE; i++) {
* ran = (ran << 1) ^ (((s64Int) ran < 0) ? POLY : 0);
* table[ran & (TableSize-1)] ^= stable[ran >> (64-LSTSIZE)];
* }
*/
for (j=0; j<128; j++)
ran[j] = starts ((NUPDATE/128) * j);
for (i=0; i<NUPDATE/128; i++) {
/* #pragma ivdep */
for (j=0; j<128; j++) {
ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
}
}
The main loop here is for (i=0; i<NUPDATE/128; i++) { and the nested loop is for (j=0; j<128; j++) {. Using 'loop interchange' optimization, compiler can convert this code to
for (j=0; j<128; j++) {
for (i=0; i<NUPDATE/128; i++) {
ran[j] = (ran[j] << 1) ^ ((s64Int) ran[j] < 0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ^= stable[ran[j] >> (64-LSTSIZE)];
}
}
It can be done because this loop nest is perfect loop nest. Is such optimization prohibited by rules of HPCC?
As far as I can tell it is allowed given that the memory interleaving
is a system setting rather than a code modification (you may nor reorder
the memory accesses that the program makes).
If GUPS actually gets better performance with non-local memory on a
NUMA machine seems doubtful to me. Will bank conflict-induced latency
really be greater than the off-node memory access latency?
STREAM should not be limited by bank conflicts but will probably
benefit from off-node accesses if the CPU has an on-chip memory
controller (like the Opterons) since the bandwidth is then shared
between the local memory controller and the NUMA interconnect.

Resources