I have an array of size 3000 the array contains 0 and 1.i want to find first array position that have 1 stored at that location starting from 0th index.i transfer this array to Host and this array is computed on device.then i sequentially computed index on Host.in my program i want to do this computation repeatably 4000 or more times.i want to reduce the time taken by this process.is there any other way by which we can do this and this array is computed on GPU actually so i have to transfer it each time.
int main()
{
for(int i=0;i<4000;i++)
{
cudaMemcpy(A,dev_A,sizeof(int)*3000,cudaMemcpyDeviceToHost);
int k;
for(k=0;k<3000;k++)
{
if(A[k]==1)
{
break;
}
}
printf("got k is %d",k);
}
}
Complete code is like this
#include"cuda.h"
#include
#define SIZE 2688
#define BLOCKS 14
#define THREADS 192
__global__ void kernel(int *A,int *d_pos)
{
int thread_id=threadIdx.x+blockIdx.x*blockDim.x;
while(thread_id<SIZE)
{
if(A[thread_id]==INT_MIN)
{
*d_pos=thread_id;
return;
}
thread_id+=1;
}
}
__global__ void kernel1(int *A,int *d_pos)
{
int thread_id=threadIdx.x+blockIdx.x*blockDim.x;
if(A[thread_id]==INT_MIN)
{
atomicMin(d_pos,thread_id);
}
}
int main()
{
int pos=INT_MAX,i;
int *d_pos;
int A[SIZE];
int *d_A;
for(i=0;i<SIZE;i++)
{
A[i]=78;
}
A[SIZE-1]=INT_MIN;
cudaMalloc((void**)&d_pos,sizeof(int));
cudaMemcpy(d_pos,&pos,sizeof(int),cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_A,sizeof(int)*SIZE);
cudaMemcpy(d_A,A,sizeof(int)*SIZE,cudaMemcpyHostToDevice);
cudaEvent_t start_cp1,stop_cp1;
cudaEventCreate(&stop_cp1);
cudaEventCreate(&start_cp1);
cudaEventRecord(start_cp1,0);
kernel1<<<BLOCKS,THREADS>>>(d_A,d_pos);
cudaEventRecord(stop_cp1,0);
cudaEventSynchronize(stop_cp1);
float elapsedTime_cp1;
cudaEventElapsedTime(&elapsedTime_cp1,start_cp1,stop_cp1);
cudaEventDestroy(start_cp1);
cudaEventDestroy(stop_cp1);
printf("\nTime taken by kernel is %f\n",elapsedTime_cp1);
cudaDeviceSynchronize();
cudaEvent_t start_cp,stop_cp;
cudaEventCreate(&stop_cp);
cudaEventCreate(&start_cp);
cudaEventRecord(start_cp,0);
cudaMemcpy(A,d_A,sizeof(int)*SIZE,cudaMemcpyDeviceToHost);
cudaEventRecord(stop_cp,0);
cudaEventSynchronize(stop_cp);
float elapsedTime_cp;
cudaEventElapsedTime(&elapsedTime_cp,start_cp,stop_cp);
cudaEventDestroy(start_cp);
cudaEventDestroy(stop_cp);
printf("\ntime taken by copy of an array is %f\n",elapsedTime_cp);
cudaEvent_t start_cp2,stop_cp2;
cudaEventCreate(&stop_cp2);
cudaEventCreate(&start_cp2);
cudaEventRecord(start_cp2,0);
cudaMemcpy(&pos,d_pos,sizeof(int),cudaMemcpyDeviceToHost);
cudaEventRecord(stop_cp2,0);
cudaEventSynchronize(stop_cp2);
float elapsedTime_cp2;
cudaEventElapsedTime(&elapsedTime_cp2,start_cp2,stop_cp2);
cudaEventDestroy(start_cp2);
cudaEventDestroy(stop_cp2);
printf("\ntime taken by copy of a variable is %f\n",elapsedTime_cp2);
cudaMemcpy(&pos,d_pos,sizeof(int),cudaMemcpyDeviceToHost);
printf("\nminimum index is %d\n",pos);
return 0;
}
how can i decrease total time taken by this code with any other option for performance.
If you are running your kernel 4000 times on GPU, it might be needed to use Asynchronous execution on kernel via different streams. It might be quicker using cudaMemCpyAsync is a non-blocking function for the host (in the case that you are executing M times your kernel).
A quick introduction to stream and asynchronous execution:
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
Streams and concurrency:
http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf
Hope this can help...
Related
My goal is to control the position and speed of a Nema 17 stepper motor based on the euler angle of a BNO055 inertial measurement unit. I am using an ESP32 to flash the code via WIFI to rosserial. I am powering the Nema 17 with a 12V power source and the BNO055 with a small external 5V battery pack.
In summary, the stepper motor should move between 0-4100 steps which would be mapped to -90 and 90 degrees of the BNO055's y-axis.
For this, I need to read the output of the BNO055 sensor as often as possible and only change directions of the Nema 17 when the BNO055 has changed position relative to the mapping.
The PROBLEM I am having is that when I incorporate reading the sensor in my code, my motor starts to shake and does not rotate smoothly. I am wondering how I can get both things to work simultaneously (reading sensor and moving nema 17).
PS: I will control speed by calculating a PI control with the BNO055 sensor and adjusting the delayMicroseconds() accordingly... but first thing is to get the readings and motor movement smooth.
Below is a code snippet I am using to debug this problem:
#include <WiFi.h>
#include <ros.h>
#include <Wire.h>
#include <std_msgs/Header.h>
#include <std_msgs/String.h>
#include <geometry_msgs/Quaternion.h>
#include <HardwareSerial.h>
#include <analogWrite.h>
#include <MultiStepper.h>
#include <AccelStepper.h>
#include <Stepper.h>
#include <Adafruit_Sensor.h>
#include <Adafruit_BNO055.h>
#include <utility/imumaths.h>
#include <math.h>
//////////////////////
// BNO055 //
//////////////////////
Adafruit_BNO055 bno_master = Adafruit_BNO055(55, 0x29);
Adafruit_BNO055 bno_slave = Adafruit_BNO055(55, 0x28);
geometry_msgs::Quaternion Quaternion;
std_msgs::String imu_msg;
#define I2C_SDA 21
#define I2C_SCL 22
TwoWire I2Cbno = TwoWire(0); // I2C connection will increase 6Hz data transmission
float ax_m, ay_m, az_m, ax_s, ay_s, az_s; // accelerometer
float gw_m, gx_m, gy_m, gz_m, gw_s, gx_s, gy_s, gz_s; // gyroscope
float ex_m, ey_m, ez_m, ex_s, ey_s, ez_s; // euler
float qw_m, qx_m, qy_m, qz_m, qw_s, qx_s, qy_s, qz_s; // quaternions
//////////////////////
// WiFi Definitions //
//////////////////////
const char* ssid = "FRITZ!Box 7430 PN"; // Sebas: "WLAN-481774"; Paula: "FRITZ!Box 7430 PN"; ICS: ICS24; Hotel Citadelle Blaye
const char* password = "37851923282869978396"; // Sebas: "Kerriganrocks!1337"; Paula: "37851923282869978396"; ICS: uZ)7xQ*0; citadelle
IPAddress server(192,168,178,112); // ip of your ROS server
IPAddress ip_address;
WiFiClient client;
int status = WL_IDLE_STATUS;
//long motorTimer = 0, getImuDataTimer = 0, millisNew = 0; //millisOld = 0,
//////////////////////
// Stepper motor //
//////////////////////
int stepPin = 4;
int stepPinState = LOW;
int dirPin = 2;
int dirPinState = HIGH;
unsigned long millisOld1 = 0;
unsigned long millisOld2 = 0;
long motorTimer = 1; // in milliseconds
long getImuDataTimer = 10; // in milliseconds
double maxPosition = 4100;
double stepsMoved = 0;
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
class WiFiHardware {
public:
WiFiHardware() {};
void init() {
// do your initialization here. this probably includes TCP server/client setup
client.connect(server, 11411);
}
// read a byte from the serial port. -1 = failure
int read() {
// implement this method so that it reads a byte from the TCP connection and returns it
// you may return -1 is there is an error; for example if the TCP connection is not open
return client.read(); //will return -1 when it will works
}
// write data to the connection to ROS
void write(uint8_t* data, int length) {
// implement this so that it takes the arguments and writes or prints them to the TCP connection
for(int i=0; i<length; i++)
client.write(data[i]);
}
// returns milliseconds since start of program
unsigned long time() {
return millis(); // easy; did this one for you
}
};
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
int i;
void chatterCallback(const std_msgs::String& msg) {
i = atoi(msg.data);
// s.write(i);
}
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
void setupWiFi()
{
// WIFI setup
WiFi.begin(ssid, password);
Serial.print("\nConnecting to "); Serial.println(ssid);
uint8_t i = 0;
while (WiFi.status() != WL_CONNECTED && i++ < 20) delay(500);
if(i == 21){
Serial.print("Could not connect to"); Serial.println(ssid);
while(1) delay(500);
}
Serial.print("Ready! Use ");
Serial.print(WiFi.localIP());
Serial.println(" to access client");
}
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
ros::Subscriber<std_msgs::String> sub("message", &chatterCallback);
ros::Publisher pub("imu_data/", &imu_msg);
ros::NodeHandle_<WiFiHardware> nh;
void setup() {
// set the digital pins as outputs
pinMode(stepPin, OUTPUT);
pinMode(dirPin, OUTPUT);
Serial.begin(57600);
setupWiFi();
// I2C connection IMUs
Wire.begin(I2C_SDA, I2C_SCL);
I2Cbno.begin(I2C_SDA, I2C_SCL, 400000);
bno_master.begin();
bno_slave.begin();
// get imu calibrations
uint8_t system, gyro, accel, mg = 0;
bno_master.getCalibration(&system, &gyro, &accel, &mg);
bno_slave.getCalibration(&system, &gyro, &accel, &mg);
bno_master.setExtCrystalUse(true);
bno_slave.setExtCrystalUse(true);
nh.initNode();
nh.advertise(pub);
}
/////////////////////////////
/// GET IMU DATA FUNCTION ///
/////////////////////////////
int get_imu_data(){
imu::Vector<3> Euler_s = bno_slave.getVector(Adafruit_BNO055::VECTOR_EULER); // 100 Hz capacity by BNO055 // IF I COMMENT THIS LINE OUT AND SET VARIABLES BELOW TO SET VALUES, MY MOTOR RUNS PERFECTLY
// Euler
float ex_s = Euler_s.x();
float ey_s = Euler_s.y();
float ez_s = Euler_s.z();
// putting data into string since adding accel, gyro, and both imu data becomes too cumbersome for rosserial buffer size. String is better for speed of data
String data = String(ex_s) + "," + String(ey_s) + "," + String(ez_s) + "!";
int length_data = data.indexOf("!") + 1;
char data_final[length_data + 1];
data.toCharArray(data_final, length_data + 1);
imu_msg.data = data_final;
pub.publish(&imu_msg);
nh.spinOnce();
Serial.println(ey_s);
return ey_s; // ex_s, ey_s, ez_s
}
/////////////////////////////
// MAIN LOOP //
/////////////////////////////
void loop() {
unsigned long currentMillis = millis();
//////////////////
// GET IMU DATA //
//////////////////
if(currentMillis - millisOld2 >= getImuDataTimer)
{
ey_s = get_imu_data();
Serial.print(ey_s);
}
////////////////
// MOVE MOTOR //
////////////////
// later, the direction will depend on the output of ey_s
if((dirPinState == HIGH) && (currentMillis - millisOld1 >= motorTimer))
{
if(stepsMoved <= maxPosition)
{
digitalWrite(dirPin, dirPinState);
millisOld1 = currentMillis; // update time
stepsMoved += 5;
for(int i =0; i<=5; i++)
{
digitalWrite(stepPin, HIGH);
delayMicroseconds(1200); // constant speed
digitalWrite(stepPin, LOW);
}
Serial.println(stepsMoved); // checking
}
else if(stepsMoved > maxPosition)
{
dirPinState = LOW;
millisOld1 = currentMillis; // update time
stepsMoved = 0;
}
}
if((dirPinState == LOW) && (currentMillis - millisOld1 >= motorTimer))
{
if(stepsMoved <= maxPosition)
{
digitalWrite(dirPin, dirPinState);
millisOld1 = currentMillis; // update time
stepsMoved += 5;
for(int i =0; i<=5; i++)
{
digitalWrite(stepPin, HIGH);
delayMicroseconds(1200); // constant speed
digitalWrite(stepPin, LOW);
}
Serial.println(stepsMoved); // checking
}
else if(stepsMoved > maxPosition)
{
dirPinState = HIGH;
millisOld1 = currentMillis; // update time
stepsMoved = 0;
}
}
}
I have tried the AccelStepper.h library but not getting the outputs desired in terms of position control and speed updates.
Arduino's all-in-one loop() is not the correct architecture for controlling real-time systems. Motor control requires rather accurate timing - e.g. looks like you wish to update motor control output with a frequency of 833 Hz (from the 1.2 ms delay) which should then be fairly accurate and stable.
Unfortunately you're not getting anywhere near this, as you're doing a bunch of non-critical stuff in each loop which potentially takes a very long (and undeterministic) amount of time - waiting for the IMU to give you a sample, printing to the serial port, talking to some ROS component, etc. Meanwhile the real-time critical control signal to your motor is waiting for all this to finish before it can do its work. Note that printing a few lines to the serial could already take dozens of milliseconds, so your delayMicroseconds(1200); is analogous to measuring a cut with a caliper and then making the cut with an axe with your eyes closed.
A real-time critical process should execute in its own thread which has higher priority than the non-real-time critical stuff. In your case it should probably run off a timer with a 1.2 ms period. The timer handler should execute with higher priority than all the other stuff, calculate desired output to motor using last received sensor input (i.e. don't go asking the IMU for a fresh reading when it's time to move the motor) and exit.
Then you can run all the other stuff from the loop() in idle priority which simply gets pre-empted when the motor control does its work.
Depending on how critical the accurate timing of IMU input is, you may want to run this also in a separate thread with a priority somewhere between the motor control interrupt and idle (remember to yield some CPU cycles to loop() or it'll starve).
I am trying to use grub in order to get the memory map, instead of going through the bios route. The problem is that grub seems to be giving me very weird values for some reason. Can anyone help with this?
Relevant code:
This is how I parse the mmap
void mm_init(mmap_entry_t *mmap_addr, uint32_t length)
{
mmap = mmap_addr;
/* Loop through mmap */
printk("-- Scanning memory map --");
for (size_t i = 0; mmap < (mmap_addr + length); i++) {
/* RAM is available! */
if (mmap->type == 1) {
uint64_t starting_addr = (((uint64_t) mmap->base_addr_high) << 32) | ((uint64_t) mmap->base_addr_low);
uint64_t length = (((uint64_t) mmap->length_high) << 32) | ((uint64_t) mmap->length_low);
printk("Found segment starting from 0x%x, with a length of %i", starting_addr, length);
}
/* Next entry */
mmap = (mmap_entry_t *) ((uint32_t) mmap + mmap->size + sizeof(mmap->size));
}
}
This is my mmap_entry_t struct (not the one in multiboot.h):
struct mmap_entry {
uint32_t size;
uint32_t base_addr_low, base_addr_high;
uint32_t length_low, length_high;
uint8_t type;
} __attribute__((packed));
typedef struct mmap_entry mmap_entry_t;
And this is how I call mm_init()
/* Kernel main function */
void kmain(multiboot_info_t *info)
{
/* Check if grub can give us a memory map */
/* TODO: Detect manually */
if (!(info->flags & (1<<6))) {
panic("couldn't get memory map!");
}
/* Init mm */
mm_init((mmap_entry_t *) info->mmap_addr, info->mmap_length);
for(;;);
}
This is the output I get on qemu:
-- Scanning memory map --
Found segment starting from 0x0, with a length of 0
Found segment starting from 0x100000, with a length of 0
And yes, I am pushing eax and ebx before calling kmain. Any ideas on what is going wrong here?
It turns out that the bit masking stuff was the problem. If we drop that, we can still have 32-bit addresses and the memory map works just fine.
I am working on a project in which I have to store the datas of an ADC Stream on a µSD card. However even if I use a 16 bits buffer, I lose data from the ADC stream. My ADC is used with DMA and I use FATFS (WITHOUT DMA) and the SDMMC1 peripheral to fill a .bin file with the datas.
Do you have an idea to avoid this loss ?
Here is my project : https://github.com/mathieuchene/STM32H743ZI
I use a nucleo-h743zi2 Board, CubeIDE, and CubeMx in their last version.
EDIT 1
I tried to implement Colin's solution, it's better but I have a strange things in the middle of my acquisition. However when I increase the maximal count value or try to debug, the HardFault_Handler appears. I modified main.c file by creating 2 blocks (uint16_t blockX[BUFFERLENGTH/2]) and 2 flags for when adcBuffer is half filled or completely filled.
I also changed the while(1) part in main function like this
if (flagHlfCplt){
//flagCplt=0;
res = f_write(&SDFile, block1, strlen((char*)block1), (void *)&byteswritten);
memcpy(block2, adcBuffer, BUFFERLENGTH/2);
flagHlfCplt = 0;
count++;
}
if (flagCplt){
//flagHlfCplt=0;
res = f_write(&SDFile, block2, strlen((char*)block2), (void *)&byteswritten);
memcpy(block1, adcBuffer[(BUFFERLENGTH/2)-1], BUFFERLENGTH/2);
flagCplt = 0;
count++;
}
if (count == 10){
f_close(&SDFile);
HAL_ADC_Stop_DMA(&hadc1);
while(1){
HAL_GPIO_TogglePin(LD1_GPIO_Port, LD1_Pin);
HAL_Delay(1000);
}
}
}
EDIT 2
I modified my program. I set block 1 and block 2 with the length of BUFFERLENGTH and I added a pointer (*idx) to change the buffer which is filled. I don't have HardFault_Handler anymore but I still loose some datas from my adc's stream.
Here are the modification I made:
// my pointer and buffers
uint16_t block1[BUFFERLENGTH], block2[BUFFERLENGTH], *idx;
// init of pointer and adc start
idx=block1;
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)idx, BUFFERLENGTH);
// while(1) part
while (1)
{
if (flagCplt){
if (flagToChangeBuffer) {
idx=block1;
res = f_write(&SDFile, block2, strlen((char*)block2), (void *)&byteswritten);
flagCplt = 0;
flagToChangeBuffer=0;
count++;
}
else {
idx=block2;
res = f_write(&SDFile, block1, strlen((char*)block1), (void *)&byteswritten);
flagCplt = 0;
flagToChangeBuffer=1;
count++;
}
}
if (count == 150){
f_close(&SDFile);
HAL_ADC_Stop_DMA(&hadc1);
while(1){
HAL_GPIO_TogglePin(LD1_GPIO_Port, LD1_Pin);
HAL_Delay(1000);
}
}
}
Does someone know how to solve my matter with these loss?
Best Regards
Mathieu
// defines pins numbers
const int trigPin = 9;
const int echoPin = 10;
// defines variables
long duration;
int distance; // float distance ;
void setup() {
pinMode(trigPin, OUTPUT); // Sets the trigPin as an Output
pinMode(echoPin, INPUT); // Sets the echoPin as an Input
Serial.begin(9600); // Starts the serial communication
}
void loop() {
// Clears the trigPin
digitalWrite(trigPin, LOW);
delayMicroseconds(2);
// Sets the trigPin on HIGH state for 10 micro seconds
digitalWrite(trigPin, HIGH);
delayMicroseconds(10);
digitalWrite(trigPin, LOW);
// Reads the echoPin, returns the sound wave travel time in microseconds
duration = pulseIn(echoPin, HIGH);
// Calculating the distance
distance= duration*0.034/2;
// Prints the distance on the Serial Monitor
Serial.println(distance);
}
I want to get
1 as 01 for int
2.54 as 02.54 for float
in my arduino Serial Monitor. Please how do I go about it. My sensor sends out the value without placing the zero in front of it, which is normal. How can I edit the print format.
Thank you all
The easiest way is simply:
if (distance < 10) Serial.write('0'); Serial.println(distance);
Does not care about negative ints, which might be ok for a distance
I am working on STM32L475 IoT kit which is an ARM M4 Cortex device. I want to swap two regions of flash memory. The board I am using has two banks for flash memory each having a size of 512KB.So I have 1 MB Flash memory. I read that for swapping contents of flash memory you have to first unlock it, then erase it and then write it and lock the flash memory after the operation is over.
There is another restriction that at a time only 2KB of memory can be copied which is defined as a page. So only page by page copying of memory is possible. For my application I have to swap application 1 and 2 which are stored in FLASH memory,if some conditions are met. Though both the applications have been allotted 384 KB of memory each but both of them actually use less memory than that(say 264 KB for example).
I tried to follow the above steps but its not working. Here is the code which I tried:-
#define APPLICATION_ADDRESS 0x0800F000
#define APPLICATION2_ADDRESS 0x0806F800
#define SWAP_ADDRESS 0x0806F000
boolean swap(void)
{
char *app_1=( char*) APPLICATION_ADDRESS;//points to the 1st address of application1
char *app_2=(char*) APPLICATION2_ADDRESS;//points to the 1st address of application2
int mem1 = getMemorySize((unsigned char*)APPLICATION_ADDRESS);//returns the number of bytes in Application1
int mem2 = getMemorySize((unsigned char*)APPLICATION2_ADDRESS);//returns the number of bytes in Application2
int limit;
if(mem1>mem2)
limit= mem1;
else
limit= mem2;
Unlock_FLASH();
int lm = limit/2048;
for(int i=1; i<=lm; i++,app_1+=2048,app_2+=2048)
{
int *swap = (int *)SWAP_ADDRESS;
Erase_FLASH(swap);
Write_FLASH(app_1, swap);
Erase_FLASH(app_1);
Write_FLASH(app_2, app_1);
Erase_FLASH(app_2);
Write_FLASH(swap, app_2);
}
Lock_FLASH();
return TRUE;
}
void Unlock_FLASH(void)
{
while ((FLASH->SR & FLASH_SR_BSY) != 0 );
// Check if the controller is unlocked already
if ((FLASH->CR & FLASH_CR_LOCK) != 0 ){
// Write the first key
FLASH->KEYR = FLASH_FKEY1;
// Write the second key
FLASH->KEYR = FLASH_FKEY2;
}
}
void Erase_FLASH(int *c)
{
FLASH->CR |= FLASH_CR_PER; // Page erase operation
FLASH->ACR = c; // Set the address to the page to be written
FLASH->CR |= FLASH_CR_STRT;// Start the page erase
// Wait until page erase is done
while ((FLASH->SR & FLASH_SR_BSY) != 0);
// If the end of operation bit is set...
if ((FLASH->SR & FLASH_SR_EOP) != 0){
// Clear it, the operation was successful
FLASH->SR |= FLASH_SR_EOP;
}
//Otherwise there was an error
else{
// Manage the error cases
}
// Get out of page erase mode
FLASH->CR &= ~FLASH_CR_PER;
}
void Write_FLASH(int *a, int *b)
{
for(int i=1;i<=2048;i++,a++,b++)
{
FLASH->CR |= FLASH_CR_PG; // Programing mode
*(__IO uint16_t*)(b) = *a; // Write data
// Wait until the end of the operation
while ((FLASH->SR & FLASH_SR_BSY) != 0);
// If the end of operation bit is set...
if ((FLASH->SR & FLASH_SR_EOP) != 0){
// Clear it, the operation was successful
FLASH->SR |= FLASH_SR_EOP;
}
//Otherwise there was an error
else{
// Manage the error cases
}
}
FLASH->CR &= ~FLASH_CR_PG;
}
void Lock_FLASH(void)
{
FLASH->CR |= FLASH_CR_LOCK;
}
Here swap buffer is used to store each page(2KB) temporarily as a buffer while swapping. Also the variable limit stores the maximum size out of application 1 and 2 so that there is no error while swapping in case of unequal memory sizes as mentioned before. So basically I am swapping page by page, that is only 2 KB at a time.
Can anyone figure out whats wrong in the code?
Thanks,
Shetu
2K is 2048 bytes, not 2024. Fix the increments all over the code.
There is another restriction that at a time only 2KB of memory can be copied
and yet another, that these memory blocks must be aligned to 2KB.
This address
#define APPLICATION2_ADDRESS 0x08076400
is not properly aligned, it should have a value that is evenly divisible by 2048 (0x800).