Is there a 8/16/32 checksum algorithm that will yield a result that is not equal to all FFs or zeros? - checksum

I would like to calculate a checksum (preferable 8 bit) that will yield a result that is not FF and not 0. It is to be used in a circular log SPI flash file system for a microcontroller. In the file system 0 marks the start of a record, and FF indicates erased memory. Thus when I calculate the checksum I do not want the result to be confused with the start of a record or unused memory.
I have looked at Fletcher's checksums, but that could still yield 0 as a result. Alternatively I though of using a 7 bit checksum and using the last bit to make sure I do not have a zero or FF result.
Does anyone know about such an implementation?

I ended up doing the following:
uint8_t CrcCalc(uint8_t* buffer, size_t len)
{
// .... some calculation here with polynomial of own choice
}
uint8_t CrcCalcNon0orFF(uint8_t* buffer, size_t len)
{
uint8_t tempCrc = CrcCalc(buffer,len);
if (tempCrc == 0xFF) tempCrc++;
if (tempCrc == 0) tempCrc++;
return tempCrc;
}
The above can be extended to 16 and 32 bit problems as well.
I am not sure if it will satisfy the math purists, but it worked for me.

Related

Buffer Overflow Not Overflowing Return Address

Below is the C code
#include <stdio.h>
void read_input()
{
char input[512];
int c = 0;
while (read(0, input + c++,1) == 1);
}
int main ()
{
read_input();
printf("Done !\n");
return 0;
}
In the above code, there should be a buffer overflow of the array 'input'. The file we give it will have over 600 characters in it, all 2's ( ex. 2222222...) (btw, ascii of 2 is 32). However, when executing the code with the file, no segmentation fault is thrown, meaning program counter register was unchanged. Below is the screenshot of the memory of input array in gdb, highlighted is the address of the ebp (program counter) register, and its clear that it was skipped when writing:
LINK
The writing of the characters continues after the program counter, which is maybe why segmentation fault is not shown. Please explain why this is happening, and how to cause the program counter to overflow.
This is tricky! Both input[] and c are in stack, with c following the 512 bytes of input[]. Before you read the 513th byte, c=0x00000201 (513). But since input[] is over you are reading 0x32 (50) onto c that after reading is c=0x00000232 (562): in fact this is little endian and the least significative byte comes first in memory (if this was a big endian architecture it was c=0x32000201 - and it was going to segfault mostly for sure).
So you are actually jumping 562 - 513 = 49 bytes ahead. Than there is the ++ and they are 50. In fact you have exactly 50 bytes not overwritten with 0x32 (again... 0x3232ab64 is little endian. If you display memory as bytes instead of dwords you will see 0x64 0xab 0x32 0x32).
So you are writing in not assigned stack area. It doesn't segfault because it's in the process legal space (up to the imposed limit), and is not overwriting any vital information.
Nice example of how things can go horribly wrong without exploding! Is this a real life example or an assignment?
Ah yes... for the second question, try declaring c before input[], or c as static... in order not to overwrite it.

How to declare local memory in OpenCL?

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...
It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.
You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.

memset() behaving undesirably

I am using memset function in C and having a problem. Here is my problem:
char* tail;
tail = //some memory address
int pbytes = 5;
When I call memset like:
**memset(tail+pbytes, 0 , 8); // It gives no error**
When I call memset like:
**memset(tail+pbytes, 0 , 9); // It goes into infinite loop**
When I call memset like:
**memset(tail+pbytes, 0 , 10); // last parameter (10 or above). It gives Segmentation fault**
What can be the reason of this? The program runs and gives output as desired but it gives segmentation fault in the end. I am using Linux 64 virtual machine.
Any help would be appreciated.
OK. Let me clarify more with what i am doing. I am making 128 bytes (0-127 in array) data. I write 0(NULL) from byte 112 to 119 (it goes well) but when I try to write 0 on 120th byte and run the program, it goes into infinite loop. If I write 1,2,4,6 at 120th byte, program runs well. If I write other numbers at 120th byte, program gives segmentation fault. Basically there is something wrong with bytes from 120 to 127.
This is nothing wrong with memset. It's something wrong with how you defined your pointer variable tail.
If you simply wrote
char tail[128];
memset(tail+5, 0, 9);
of course it would work fine. Your problem is that you're not doing anything that simple and correct; you're doing something obscure and incorrect, such as
char tail[1];
memset(tail+5, 0, 9);
or
void foo(int x) {
char *tail = &x;
memset(tail+5, 0, 9);
}
To paraphrase Charles Babbage: When you put wrong code into the machine, wrong answers come out.
The segfault is probably because you're trying to write to a virtual address that has not yet been allocated. The infinite loop might be because you're overwriting some part of the memset's return address, so that it returns to the wrong place (such as into the middle of an infinite loop) instead of returning to the place it was called from.

arm asm/neon optimisation for image processing

I m currently working on a painting app on ios.
I use a directly draw into a NSMutableData buffer and apply blending with my brush like this:
- (void) combineColorDestination:(unsigned char*) dest source:(unsigned char*) src
{
const unsigned char sra = ((unsigned char *)src)[3];
const float oneminusalpha = 1.0f - (sra / 255.f);
int d[4];
for (int i=0;i<4;i++)
{
d[i] = oneminusalpha * ((unsigned char *)dest)[i] + ((unsigned char *)src)[i];
if (d[i]>255)
d[i] = 255;
((unsigned char *)dest)[i] = (unsigned char)d[i];
}
}
Any suggestions for optimisations ?
I previously tried to use neon , but i ve got a bug I wasnt able to fix (the bordering pixels was buggy)
I was iterating pixels 2 by 2 like this :
uint8x8_t va = vld1_u8(dest);
uint8x8_t vb = vld1_u8(src);
uint8x8_t res = vqadd_u8(va,vb);
vst1_u8(dest, res);
Suggestions? Alright. Note that these are valid whichever multimedia manipulation you are doing and is hardly restricted to your case.
First, before you even do NEON, you should change your code to have one function that changes a bunch of pixels (at least a row, a rectangle if you can) at once, instead of a function (or method - even worse) that changes one pixel and is called a bunch of times: somehow I doubt the brush is only 1x1 pixel.
Second, except for the column loop (and eventual row loop), there should be no branch (that is, flow control structures). No for (i=0;i<4;i++); just write the code for the four channels in sequence (use a macro if necessary). No if (d[i]>255); express that as an alternative: dest[i] = (temp>255?255:temp); at the very least, if not replacing it by a more efficient way to do saturation (tricks using subtractions, shifts, and masks exist).
Third, avoid any conversion between floating-point and integer; this is always valid advice, but float->int conversions are particularly devastating on ARM. Since you're manipulating integers, this means foregoing floating-point here.
And once you've done that, surprise, besides making your code faster you have in fact done the preparation work for NEON: NEON is only remotely useful if you process a bunch of pixels at once, if there is no branch, and if you don't convert between floating-point and integer all over the place. So only then will we talk about NEON, if it is even necessary at this point.

What's the best way to load 2 unaligned 64-bit values into an sse register with SSSE3?

There are 2 pointers to 2 unaligned 8 byte chunks to be loaded into an xmm register. If possible, using intrinsics. And if possible, without using an auxiliary register. Without pinsrd. (SSSE Core 2)
From the msvc specs, it looks like you can do the following:
__m128d xx; // an uninitialised xmm register
xx = _mm_loadh_pd(xx, ptra); // load the higher 64 bits from (unaligned) ptra
xx = _mm_loadl_pd(xx, ptrb); // load the lower 64 bits from (unaligned) ptrb
Loading from unaligned storage (in my experience) is very much slower than loading from aligned pointers, so you properly wouldn't want to be doing this type of operation too often - if you really want higher performance.
Hope this helps.
Unaligned access is so much slower than aligned access (at least pre-Nehalem );
you may get better speed by loading the aligned 128 bit words that contain the desired unaligned 64 bit words, then shuffle them to make the result you want.
Assumes:
you have memory read access to the full 128 word
the 64 bit words are aligned on at least 32 bit boundaries
e.g. (not tested)
int aoff = ptra & 15;
int boff = ptrb & 15;
__m128 va = _mm_load_ps( (char*)ptra - aoff );
__m128 vb = _mm_load_ps( (char*)ptrb - boff );
switch ( (aoff<<4) | boff )
{
case 0: _mm_shuffle_ps(va,vb, ...
The number of cases depends on whether you can assume 64 bit alignment

Resources