Basically, the question is what instruction takes less time to execute (or they take the exact same time):
add rax, rbx
; or
or rax, rbx
For example, if I want to access efficiently a virtual CPU core and OR executes more fast, than the code I should write would be something like this (C):
struct CPUCore { /* ... */ };
sizeof(CPUCore); // 64 e.g.
// normal allocation
CPUCore* allocNormal() {
CPUCore* inst = (CPUCore*)malloc(sizeof(CPUCore));
return inst;
};
// aligning allocation
struct Result {
// free memory by 'res.free(block)'
// use object by 'res.ptr'
CPUCore* block;
CPUCore* ptr;
};
CPUCore* allocAlign() {
// allocating 2 times the bytes in order to ensure 64 byte block
// with address & 63 == 0, so instead of 'ptr + shift'
// we can use 'ptr | shift'
Result ret;
ret.block = malloc(sizeof(CPUCore) * 2);
ret.ptr = ret.block + 0x40 - (ret.block & 0x3F);
return ret;
};
Related
How can I get current milliseconds from MQL4 using an Expert Advisor.
i.e.: in Java we can get current milliseconds using system.currenttimemillis()
This MT4 "Get millisecond" problem has been around for ages. This is a hack I created to solve this problem.
//+------------------------------------------------------------------+
//| timeInMs.mq4 |
//| Copyright 2017, Joseph Lee |
//+------------------------------------------------------------------+
#property copyright "Copyright 2017, Joseph Lee"
#property link "https://www.facebook.com/joseph.fhlee"
#property version "1.00"
#property strict
int prevSecondTime = 0;
uint prevSecondTick = 0;
int OnInit() {
// Create an Event that triggers every 1 millisecond.
// **NOTE: GetTickCount() is accurate to 16ms only, so
// in practice, no need to trigger every 1ms.
EventSetMillisecondTimer(1);
return(INIT_SUCCEEDED);
}
void OnTick() {
Comment( "Now: " + TimeLocal() + " :: " + getCurrentMs() + " ms. +- 16ms accuracy.");
}
int getCurrentMs() {
return(GetTickCount() - prevSecondTick);
}
// This is an EVENT function that will be called every
// x milliseconds [see EventSetMillisecondTimer() in OnInit()]
void OnTimer() {
// If a new "second" occurs, record GetTickCount()
if(TimeLocal() > prevSecondTime) {
prevSecondTick = GetTickCount();
prevSecondTime = TimeLocal();
}
}
Can have relative [ms] or even [us]:
Be careful as both are relative, but one with respect to the system start, the other with respect to the MQL4 code-execution unit start.
The GetTickCount() function returns the number of milliseconds that elapsed since the system start.
uint GetTickCount();
Counter is limited by the restrictions of the system timer. Time is stored as an unsigned integer, so it's overfilled every 49.7 days if a computer works uninterruptedly.
The GetMicrosecondCount() function returns the number of microseconds that have elapsed since the start of MQL program.
ulong GetMicrosecondCount();
Can have absolute [ms] or even [us], with const(!) ABSOLUTE ERROR,
that will exhibit neither any drift, nor jitter in exact measuring of time.
Isn't this great for FOREX domain, where milliseconds are "full of events" and microseconds ( nanoseconds in recent professional-grade designs ) matter ?!
// -----------------------------------------------------------------
ulong system_currenttimemillis(){
return( OnStart_GLOB_MILLISECONDS // ABS [ms] SYNC-ed OnStart() WITH [s]-EDGE-ALIGNMENT
+ ( GetMicrosecondCount() // + DELTA ------------------
- OnStart_BASE_MICROSECONDS // since SYNC-ing OnStart()
) / 1000 // =================== // DELTA [ms] =============
);
}
// -----------------------------------------------------------------
static ulong OnStart_GLOB_MICROSECONDS;
static ulong OnStart_GLOB_MILLISECONDS;
static ulong OnStart_EoDY_MICROSECONDS;
static datetime OnStart_EoDY_DATETIME;
static datetime OnStart_BASE_DATETIME;
static uint OnStart_BASE_MILLISECONDS;
static ulong OnStart_BASE_MICROSECONDS;
// -----------------------------------------------------------------
void OnStart(){ /* { SCRIPT | EXPERT ADVISOR | CUSTOM INDICATOR } CALL
THIS */
OnStart_BASE_DATETIME = TimeLocal(); // .SET int == the number of seconds elapsed since January 01, 1970.
while( OnStart_BASE_DATETIME == TimeLocal() ){ // ---- // EDGE-ALIGNMENT -------------------------------------------------------
OnStart_BASE_MICROSECONDS = GetMicrosecondCount(); // .SET ulong, since MQL4 program launch
OnStart_BASE_MILLISECONDS = GetTickCount(); // .SET uint, since system start
} // ==[ MAX 1 SECOND ]=============================== NOW // EDGE-ALIGNED TO [s] ==================================================
OnStart_BASE_DATETIME = TimeLocal(); // .SET date and time as the number of seconds elapsed since January 01, 1970.
OnStart_GLOB_MICROSECONDS = ( (ulong) OnStart_BASE_DATETIME ) * 1000000;
OnStart_GLOB_MILLISECONDS = ( (ulong) OnStart_BASE_DATETIME ) * 1000;
OnStart_EoDY_DATETIME = OnStart_BASE_DATETIME
- ( OnStart_BASE_DATETIME % 86400 );
OnStart_EoDY_MICROSECONDS = ( TimeSecond( OnStart_BASE_DATETIME )
+ ( TimeMinute( OnStart_BASE_DATETIME )
+ TimeHour( OnStart_BASE_DATETIME ) * 60 ) * 60 ) * 1000000;
}
// -----------------------------------------------------------------
int OnInit() {
OnStart(); /* HACK 4 { EXPERT ADVISOR | CUSTOM INDICATOR } CALL
... THAT */
..
return( INIT_SUCCEEDED );
}
// -----------------------------------------------------------------
ulong Get_a_Microsecond_of_a_Day(){ // THIS HAS A !!_CONSTANT_!! ONLY ABSOLUTE SYSTEMATIC TIMING ERROR
return( ( OnStart_EoDY_MICROSECONDS // EDGE-SYNC OnStart_EoDY + DELTA-SINCE-OnStart-SYNC-ed:
+ ( GetMicrosecondCount() // == NOW ( 8B ulong ) ROLL-OVER ~ 213M504 DAYS AFTER THE PROGRAM START, WAY LONGER, THAN WE WILL LIVE UNDER THE SUN
- OnStart_BASE_MICROSECONDS // // - OnStart_BASE_MICROSECONDS
)
) // ================== // SECONDS-EDGE-SYNC-ed DISTANCE FROM EoDY-EDGE
% 86400000000 // MODULO DAY-LENGTH ROLL-OVER
); // ALL DST-MOVs TAKE PLACE OVER WEEKENDS, SO NOT DURING TRADING-HOURS, SHOULD BE JUST-ENOUGH GOOD SOLUTION :o)
}
// -----------------------------------------------------------------
uint Get_a_Millisecond_of_a_Day(){ // IMMUNE TO A uint ROLL-OVER ~ 49.7 DAYS
return( Get_a_Microsecond_of_a_Day()
/ 1000
);
}
This solution runs in all { Script | Expert Advisor | Custom Indicator }
something like this:
ulong time = GetTickCount();
// function();
time = GetTickCount()-time;
dt1 = TimeLocal()+2;
do
{
dt2 = TimeLocal();
}
while(TimeSecons(dt2) < TimeSecons(dt1));
after finish you may count time from 0.000
Just cast values to ulong and make sure to multiply TimeGMT() with 1000
for print result cast to string :
ulong time = (ulong) TimeGMT()*1000 - (ulong) GetTickCount() ;
Print("milliseconds: ", (string)time);
Another way seems to be using the WinAPI (kernel32.dll). This seems to be more integrated with MQL5 by using #include <WinAPI/windef.mqh> but can be used with MQL4 too by defining the required structs.
The following snippet shows how to get it in MQL4 –in MQL5 you don't need to define the structs as they can be included:
// MQL5 seems to have Include/WinAPI/windef.mqh - constants, structures and enumerations
#ifdef __MQL4__
//+------------------------------------------------------------------+
//| MQL4 equivalent of Windows _FILETIME struct: GetSystemTimeAsFileTime().
//| #see: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime
//+------------------------------------------------------------------+
struct FILETIME {
uint dwLowDateTime;
uint dwHighDateTime;
};
//+------------------------------------------------------------------+
//| MQL4 equivalent of Windows GetSystemTime()/GetLocalTime()/... struct.
//| Credits: https://www.mql5.com/en/forum/146837/page3#comment_3702187
//+------------------------------------------------------------------+
struct SYSTEMTIME {
ushort wYear; // 2014 etc
ushort wMonth; // 1 - 12
ushort wDayOfWeek; // 0 - 6 with 0 = Sunday
ushort wDay; // 1 - 31
ushort wHour; // 0 - 23
ushort wMinute; // 0 - 59
ushort wSecond; // 0 - 59
ushort wMilliseconds; // 0 - 999
string toString() {
return StringFormat("%hu-%02hu-%02hu %02hu:%02hu:%02hu.%03hu", wYear, wMonth, wDay, wHour, wMinute, wSecond, wMilliseconds);
}
};
#endif
#import "kernel32.dll"
// Millisecond (ms) precision, but limited by Windows timer configuration?? (15-16 ms by default)
void GetSystemTime(SYSTEMTIME &time); // (struct version)
void GetLocalTime(SYSTEMTIME &time); // (struct version)
// Microsecond (us) precision, limited by Windows timer configuration?? (15-16 ms by default)
void GetSystemTimeAsFileTime(FILETIME& t); // (ulong version) Returns the system time in 100-nsec units
#import
Then you can define user-friendly functions to get the time around the above ones, e.g.
//+------------------------------------------------------------------+
//| Get the system time (UTC) in milliseconds.
//| Credits: https://www.mql5.com/en/forum/146837/page4#comment_13522420
//+------------------------------------------------------------------+
ulong systemTimeMillis(){
FILETIME t; // time measured in 100-nano second units
GetSystemTimeAsFileTime(t);
ulong time = (long)t.dwHighDateTime << 32 | t.dwLowDateTime;
ulong diffTo1970 = 11644473600000; // FILETIME starts at January 1, 1601 (UTC)
return (ulong)(time * 0.0001 - diffTo1970); // 100-ns to ms (divide 10_000)
}
ulong localTimeMillis() {
return systemTimeMillis() - TimeGMTOffset() * 1000;
}
More references
https://www.mql5.com/en/forum/146837/page3#comment_3702187
Windows Timer Resolution: Megawatts Wasted
Timers, Timer Resolution, and Development of Efficient Code
here i have written code to find number of cycles taken by a function but i am getting error at first MCR instruction can any one suggest me how to solve this problem.This code is written in XCODE and running on ios.
#include <stdio.h>
static inline unsigned int get_cyclecount (void)
{
unsigned int value;
// Read CCNT Register
asm volatile ("MRC p15, 0, %0, c9, c13, 0\t\n": "=r"(value));
return value;
}
static inline void init_perfcounters (int do_reset, int enable_divider)
{
// in general enable all counters (including cycle counter)
int value = 1;
// perform reset:
if (do_reset)
{
value |= 2; // reset all counters to zero.
value |= 4; // reset cycle counter to zero.
}
if (enable_divider)
value |= 8; // enable "by 64" divider for CCNT.
value |= 16;
// program the performance-counter control-register:
asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
// enable all counters:
asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x8000000f));
// clear overflows:
asm volatile ("MCR p15, 0, %0, c9, c12, 3\t\n" :: "r"(0x8000000f));
}
int main () {
float x = 100.0f;
float y = 0.00000f;
float inst,cycl,cycl_inst;
int do_reset=0;
int enable_divider=0;
init_perfcounters (1, 0);
// measure the counting overhead:
unsigned int overhead = get_cyclecount();
overhead = get_cyclecount() - overhead;
unsigned int t = get_cyclecount();
// do some stuff here..
log_10_c_function(x);
t = get_cyclecount() - t;
printf ("Totaly %d cycles (including function call) ", t - overhead);
return 0;
}
Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:
Write a function which takes an array of char (ASCII symbols) and find
the most frequent character.
char mostFrequentCharacter(char* str, int size)
The function should be optimized to run on dual-core ARM-based
processors, and an infinity amount of memory.
On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:
#define RESULT_SIZE 127
inline int set_char(char c, int result[])
{
int count = result[c];
result[c] = ++count;
return count;
}
char mostFrequentChar(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char current_char;
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(size_t i = 0; i<size; i++)
{
current_char = str[i];
current_char_frequency = set_char(current_char, result);
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
}
return frequent_char;
}
Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = current_char;
}
we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.
char mostFrequentCharOpt1(char str[], int size)
{
int result[RESULT_SIZE] = {0};
char frequent_char = '\0';
int current_char_frequency = 0;
int char_frequency = 0;
for(int i = 0; i<size; i++)
{
set_char(str[i], result);
}
for(int i = 0; i<RESULT_SIZE; i++)
{
current_char_frequency = result[i];
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 7.842381
char mostFrequentChar(char str[], int size)
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.
Type optimization
The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.
Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type.
infocenter.arm.com - ARMv8 Registers
While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)
// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)
Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)
// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)
Loop optimization
Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.
before
for(int i = 0; i<size; i++) { ... }
after
for(int i = size; i--) { ... }
Again, comparing assembly code, gave me a clear distinction between the two approaches.
mostFrequentCharOpt2 | mostFrequentCharOpt3
0x10001250c <+88>: ldr w8, [sp, #28] ; w8 = i | 0x100012694 <+92>: ldr w8, [sp, #28] ; w8 = i
0x100012510 <+92>: ldr w9, [sp, #44] ; w9 = size | 0x100012698 <+96>: sub w9, w8, #1 ; w9 = i - 1
0x100012514 <+96>: cmp w8, w9 ; if i<size | 0x10001269c <+100>: str w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge 0x100012548 ; if true => end loop | 0x1000126a0 <+104>: cbz w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine | 0x1000126a4 <+108>: ... set_char start routine
... | ...
0x100012534 <+128>: ... set_char end routine | 0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr w8, [sp, #28] ; w8 = i | 0x1000126c0 <+136>: b 0x100012694 ; back to the first line
0x10001253c <+136>: add w8, w8, #1 ; i++ | 0x1000126c4 <+140>: ...
0x100012540 <+140>: str w8, [sp, #28] ; save i to $sp+28 |
0x100012544 <+144>: b 0x10001250c ; back to the first line |
0x100012548 <+148>: str ... |
Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization
// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization
Threading optimization
Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.
int mostFrequentCharThreadOpt(char str[], int size)
{
int s;
int tnum;
int num_threads = THREAD_COUNT; //by default 2
struct thread_info *tinfo;
tinfo = calloc(num_threads, sizeof(struct thread_info));
if (tinfo == NULL)
exit(EXIT_FAILURE);
int minCharCountPerThread = size/num_threads;
int startIndex = 0;
for (tnum = num_threads; tnum--;)
{
startIndex = minCharCountPerThread*tnum;
tinfo[tnum].thread_num = tnum + 1;
tinfo[tnum].startIndex = minCharCountPerThread*tnum;
tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
tinfo[tnum].str = str;
s = pthread_create(&tinfo[tnum].thread_id, NULL,
(void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
if (s != 0)
exit(EXIT_FAILURE);
}
int frequent_char = 0;
int char_frequency = 0;
int current_char_frequency = 0;
for (tnum = num_threads; tnum--; )
{
s = pthread_join(tinfo[tnum].thread_id, NULL);
}
for(int i = RESULT_SIZE; i--; )
{
current_char_frequency = 0;
for (int z = num_threads; z--;)
{
current_char_frequency += tinfo[z].resultArray[i];
}
if(current_char_frequency >= char_frequency)
{
char_frequency = current_char_frequency;
frequent_char = i;
}
}
free(tinfo);
return frequent_char;
}
Benchmarks: iPhone 5s
size = 1000000
iterations = 500
// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.
Benchmarks: iPhone 4
size = 1000000
iterations = 500
// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization
// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization
Question
How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?
Source code
Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.
It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.
If THREAD == 2, num_threads is a unnecessary variable and can be removed.
int minCharCountPerThread = size/num_threads;
And the olden way to many discussed topic about bit-shifting, try it:
int minCharCountPerThread = size >> 1; //divide by 2
The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect?
This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.
Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.
There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.
I want get the nearest free memory address to allocate memory for CodeCave but i want it to be within the jmp instruction limit 0xffffffff-80000000 , Im trying the following code but without much luck.
DWORD64 MemAddr = 0;
DWORD64 Address = 0x0000000140548AE6 & 0xFFFFFFFFFFFFF000;
HANDLE hProc = OpenProcess(PROCESS_ALL_ACCESS, NULL, ProcessID);
if (hProc){
for (DWORD offset = 0; (Address + 0x000000007FFFEFFF)>((Address - 0x000000007FFFEFFF) + offset); offset += 100)
{
MemAddr = (DWORD64)VirtualAllocEx(hProc, (DWORD64*)((Address - 0x000000007FFFEFFF) + offset),MemorySize, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
if ((DWORD64)MemAddr){
break;
}
}
CloseHandle(hProc);
return (DWORD64)MemAddr;
}
return 0;
Target Process is 64bit .
If the target process is x64 then make sure you're compiling for x64 as well.
I have used this code for the same purpose, to find free memory within a 4GB address range for doing x64 jmps for a x64 hook.
char* AllocNearbyMemory(HANDLE hProc, char* nearThisAddr)
{
char* begin = nearThisAddr;
char* end = nearThisAddr + 0x7FFF0000;
MEMORY_BASIC_INFORMATION mbi{};
auto curr = begin;
while (VirtualQueryEx(hProc, curr, &mbi, sizeof(mbi)))
{
if (mbi.State == MEM_FREE)
{
char* addr = (char*)VirtualAllocEx(hProc, mbi.BaseAddress, 0x1000, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
if (addr) return addr;
}
curr += mbi.RegionSize;
}
return 0;
}
Keep in mind there is no error checking, just a simple PoC
I am trying to load a flattened 2D matrix into shared memory, shift the data along x, write back to global memory shifting also along y. The input data is therefore shifted along x and y. What I have:
__global__ void test_shift(float *data_old, float *data_new)
{
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
// load from global to shared
VAR = data_old[glob_index];
// do some stuff on VAR
if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}
__syncthreads();
// write to global memory
if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; // redefine glob_index to shift along y (+1)
data_new[glob_index] = VAR2[threadIdx.x];
}
The call to the kernel:
test_shift <<< grid, block >>> (data_old, data_new);
and grid and blocks (blockDim.x is equal to the matrix width, i.e. 64):
dim3 block(NUM_THREADS, 1);
dim3 grid(1, ny);
I am not able to achieve it. Could someone please point out what's wrong with this? Should I use a strided index or an offset?
VAR should not have been declared as shared, because in the current form all threads scribble over each other's data when you load from global memory: VAR = data_old[glob_index];.
You also have an out-of-bounds access when you access VAR2[threadIdx.x + 1], so your kernel never finishes (depending on the compute capability of the device - 1.x devices didn't check shared memory accesses as rigorously).
You could have detected the latter by checking the return codes of all calls to CUDA functions for errors.
Shared variables are, well, shared by all threads in a single block. This means that you don't have blockDim.y complects of shared variables but only a single complect per block.
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
VAR = data_old[glob_index];
if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}
This instructs all threads in a block to write data into a single variable (VAR). Next you have no synchronization, and you use this variable in the second assignment. This will have undefined result, because threads from the first warp are reading from this variable and threads from the second warp are still trying to write something there.
You should change VAR to be local, or create an array of shared memory variables for all threads in block.
if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x;
data_new[glob_index] = VAR2[threadIdx.x];
}
In VAR2[0] you still have some garbage (you've never written there). threadIdx.y is always zero in your blocks.
And avoid using uints. They have (or used to have) some perfomance problems.
Actually, for such simple task you don't need to use shared memory
__global__ void test_shift(float *data_old, float *data_new)
{
int glob_index = threadIdx.x + blockIdx.y*blockDim.x;
float VAR;
// load from global to local
VAR = data_old[glob_index];
int glob_index_new;
// calculate only if we are going to output something
if ( (blockIdx.y < gridDim.y - 1) && ( threadIdx.x < blockDim.x - 1 ))
{
glob_index_new = threadIdx.x + 1 + (blockIdx.y + 1)*blockDim.x;
// do some stuff on VAR
} else // just write 0.0 to remove garbage
{
glob_index_new = ( (blockIdx.y == gridDim.y - 1) && ( threadIdx.x == blockDim.x - 1 ) ) ? 0 : ((blockIdx.y == gridDim.y - 1) ? threadIdx.x : (blockIdx.y)*blockDim.x );
VAR = 0.0;
}
// write to global memory
data_new[glob_index_new] = VAR;
}