I'm trying to implement a simple data transfer using UDP. I have a problem for the checksum, given a packet containing the data, how should I implement the checksum? also any idea how to implement the timeouts so it will trigger the retransmission ? Thanks
Why not try Reliable UDP, see http://en.wikipedia.org/wiki/Reliable_User_Datagram_Protocol
It has a standard.
here's one approach for the internet checksum
unsigned short checkSum() {
unsigned long sum = 0;
int i;
for(i=0; i < your packet length ; i++) {
sum += (your packet data[i] & 0xFFFF);
while (sum >> 16) {
sum = (sum & 0xFFFF) + (sum >> 16);
sum = ~sum;
return ((unsigned short) sum);
for the retransmission, you can set alarm to trigger timeout
when packet is loss. you can do something using
signal (SIGALRM, timeout function);
Hope it helps!
I have two functions, one which calculates the difference between successive elements of a row and the second calculates the successive difference between values in a column. Therefore one would calculate M[i][j+1] -M[i][j] and second would do M[i+1][j] - M[i][j], M being the matrix. I implement them as follows -
inline void firstFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i=0; i < M; i++){
for(int j=0; j <=N - 33; j+=32){
auto pos = i*N + j;
_mm256_storeu_epi8(output + pos, _mm256_sub_epi8(_mm256_loadu_epi8(input + pos + 1), _mm256_loadu_epi8(input + pos)));
void secondFunction(uchar* input, uchar* output, size_t M, size_t N){
for(int i = 0; i < M-1; i++){
//#pragma prefetch input : (i+1)*N : (i+1)*N + N
for(int j = 0; j <N-33; j+=32){
auto idx = i * N + j;
auto idx_1 = (i+1)*N + j;
_mm256_storeu_epi8(output + idx, _mm256_sub_epi8(_mm256_loadu_epi8(input + idx_1), _mm256_loadu_epi8(input + idx)));
However, Benchmarking them, Average runtimes for the first and second function are as follows -
firstFunction = 21.1432ms
secondFunction = 166.851ms
Where the size of matrix is M = 9024 and N = 12032
This is a huge increase in the runtime for a similar operation. I suspect this has something to do with memory accesses and caching, where way more cycles are spent in getting the memory from another row in the second case.
So my question is two-part.
Is my reasoning for the difference in runtimes correct.
How do I alleviate it. My first idea is to prefetch the second row in the memory and go ahead, but I am not able to prefetch a dynamically calculated position. Would _mm_prefetch help if the issue is indeed of the memory access times
I am using the dpcpp compiler. with compile options as -g -O3 -fsycl -fsycl-targets=spir64 -mavx512f -mavx512vl -mavx512bw -qopenmp -liomp5 -lpthread. This compiler has a pragma prefetch but it does not allow runtime calculated prefetches. However, I would really appreciate something which is not specific to the compiler and it could also be spefic to GCC.
Edit1 - Just tried _mm_prefetch, but that too throws error: argument to 'error: argument to '__builtin_prefetch' must be a constant integer _mm_prefetch(input + (i+1) * N, N);. So an additional question, is there any way we can prefetch runtime calculated memory locations ?
Currently i'm working with an custom implementation of the Mersenne Twister, and i'd like to improve my understanding of vector operations.
I have the following code:
#define N 624
#define M 397
for( k = N -1; k; k-- )
array[i] = (array[i] ^ ((array[i-1] ^ (array[i-1] >> 30)) * 1566083941UL)) - i;
array[i] &= 0xffffffffUL;
if ( i >= N )
array[0] = array[N-1];
i = 1;
Here i'm working with 32 bit integers only, so as i understand, I could perform 8 times as much operations at the same time, using AVX2 instructions? How can I do that in practice?
I know how to deal with addition of 2 vectors, but this case seems to be more complicated. I don't know how to begin.
For a scalar approach i'd work like that, but i'd like to get sure how to perform these actions in my case.
for (i = 0; i < 1024; i++)
C[i] = A[i]*B[i];
for (i = 0; i < 1024; i+=4)
C[i:i+3] = A[i:i+3]*B[i:i+3];
Unfortunately at my university there are no lessons about intrinsics, but i'm quite curious in order to get an improvement.
I'm also doing some thoughts, about how to create the array using vectors? Maybe matrix? (Maybe _mm256_setr_epi32)
I hope to get some advice regarding this topic!
I want to use PayMaya EMV Merchant Presented QR Code Specification for Payment Systems everything is good except CRC i don't understand how to generate this code.
that's all exist about it ,but i still can't understand how to generate this .
The checksum shall be calculated according to [ISO/IEC 13239] using the polynomial '1021' (hex) and initial value 'FFFF' (hex). The data over which the checksum is calculated shall cover all data objects, including their ID, Length and Value, to be included in the QR Code, in their respective order, as well as the ID and Length of the CRC itself (but excluding its Value).
Following the calculation of the checksum, the resulting 2-byte hexadecimal value shall be encoded as a 4-character Alphanumeric Special value by converting each nibble to an Alphanumeric Special character.
Example: a CRC with a two-byte hexadecimal value of '007B' is included in the QR Code as "6304007B".
This converts a string to its UTF-8 representation as a sequence of bytes, and prints out the 16-bit Cyclic Redundancy Check of those bytes (CRC-16/CCITT-FALSE).
int crc16_CCITT_FALSE(String data) {
int initial = 0xFFFF; // initial value
int polynomial = 0x1021; // 0001 0000 0010 0001 (0, 5, 12)
Uint8List bytes = Uint8List.fromList(utf8.encode(data));
for (var b in bytes) {
for (int i = 0; i < 8; i++) {
bool bit = ((b >> (7-i) & 1) == 1);
bool c15 = ((initial >> 15 & 1) == 1);
initial <<= 1;
if (c15 ^ bit) initial ^= polynomial;
return initial &= 0xffff;
The CRC for ISO/IEC 13239 is this CRC-16/ISO-HDLC, per the notes in that catalog. This implements that CRC and prints the check value 0x906e:
import 'dart:typed_data';
int crc16ISOHDLC(Uint8List bytes) {
int crc = 0xffff;
for (var b in bytes) {
crc ^= b;
for (int i = 0; i < 8; i++)
crc = (crc & 1) != 0 ? (crc >> 1) ^ 0x8408 : crc >> 1;
return crc ^ 0xffff;
void main() {
Uint8List msg = Uint8List.fromList([0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39]);
print("0x" + crc16ISOHDLC(msg).toRadixString(16));
I'm working on an app of hardware communication that I send or require data from an external hardware. I have the require data part done.
And I just find out I could use some help to calculate the checksum.
A package is created as NSMutableData, then it will be converted in to Byte Array before sending out.
A package looks like this:
0x1E 0x2D 0x2F DATA checksum
I'm thinking I can convert hex into binary to calculate them one by one. But I don't know if it's a good idea. Please let me know if this is the only way to do it, or there are some built in functions I don't know.
Any suggestions will be appreciated.
BTW, I just found the code for C# from other's post, I'll try to make it work in my app. If I can, I'll share it with you. Still any suggestions will be appreciated.
package org.example.checksum;
public class InternetChecksum {
* Calculate the Internet Checksum of a buffer (RFC 1071 - http://www.faqs.org/rfcs/rfc1071.html)
* Algorithm is
* 1) apply a 16-bit 1's complement sum over all octets (adjacent 8-bit pairs [A,B], final odd length is [A,0])
* 2) apply 1's complement to this final sum
* Notes:
* 1's complement is bitwise NOT of positive value.
* Ensure that any carry bits are added back to avoid off-by-one errors
* #param buf The message
* #return The checksum
public long calculateChecksum(byte[] buf) {
int length = buf.length;
int i = 0;
long sum = 0;
long data;
// Handle all pairs
while (length > 1) {
// Corrected to include #Andy's edits and various comments on Stack Overflow
data = (((buf[i] << 8) & 0xFF00) | ((buf[i + 1]) & 0xFF));
sum += data;
// 1's complement carry bit correction in 16-bits (detecting sign extension)
if ((sum & 0xFFFF0000) > 0) {
sum = sum & 0xFFFF;
sum += 1;
i += 2;
length -= 2;
// Handle remaining byte in odd length buffers
if (length > 0) {
// Corrected to include #Andy's edits and various comments on Stack Overflow
sum += (buf[i] << 8 & 0xFF00);
// 1's complement carry bit correction in 16-bits (detecting sign extension)
if ((sum & 0xFFFF0000) > 0) {
sum = sum & 0xFFFF;
sum += 1;
// Final 1's complement value correction to 16-bits
sum = ~sum;
sum = sum & 0xFFFF;
return sum;
When I post this question a year ago, I was still quite new to Objective-C. It turned out to be something very easy to do.
The way you calculate checksum is based on how checksum is defined in your communication protocol. In my case, checksum is just the sum of all the previous bytes sent or the data you want to send.
So if I have a NSMutableData *cmd that has five bytes:
0x10 0x14 0xE1 0xA4 0x32
checksum is the last byte of 0x10+0x14+0xE1+0xA4+0x32
So the sum is 01DB, checksum is 0xDB.
//i is the length of cmd
- (Byte)CalcCheckSum:(Byte)i data:(NSMutableData *)cmd
{ Byte * cmdByte = (Byte *)malloc(i);
memcpy(cmdByte, [cmd bytes], i);
Byte local_cs = 0;
int j = 0;
while (i>0) {
local_cs += cmdByte[j];
local_cs = local_cs&0xff;
return local_cs;
To use it:
Byte checkSum = [self CalcCheckSum:[command length] data:command];
Hope it helps.
If I try to send to my CUDA device a struct wich is heavier than the size of memory available, will CUDA give me any kind of warning or error?
I'm asking that because my GPU has 1024 MBytes (1073414144 bytes) Total amount of global memory, but I don't know how I should handle and eventual problem.
That's my code:
#define VECSIZE 2250000
#define WIDTH 1500
#define HEIGHT 1500
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
struct Matrix
int width;
int height;
int* elements;
int main()
Matrix M;
M.width = WIDTH;
M.height = HEIGHT;
M.elements = (int *) calloc(VECSIZE,sizeof(int));
int row, col;
// define Matrix M
// Matrix generator:
for (int i = 0; i < M.height; i++)
for(int j = 0; j < M.width; j++)
row = i;
col = j;
if (i == j)
M.elements[row * M.width + col] = INFINITY;
M.elements[row * M.width + col] = (rand() % 2); // because 'rand() % 1' just does not seems to work ta all.
if (M.elements[row * M.width + col] == 0) // can't have zero weight.
M.elements[row * M.width + col] = INFINITY;
else if (M.elements[row * M.width + col] == 2)
M.elements[row * M.width + col] = 1;
// Declare & send device Matrix to Device.
Matrix d_M;
d_M.width = M.width;
d_M.height = M.height;
size_t size = M.width * M.height * sizeof(int);
cudaMalloc(&d_M.elements, size);
cudaMemcpy(d_M.elements, M.elements, size, cudaMemcpyHostToDevice);
int *d_k= (int*) malloc(sizeof(int));
cudaMalloc((void**) &d_k, sizeof (int));
int *d_width=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_width, sizeof(int));
unsigned int *width=(unsigned int*)malloc(sizeof(unsigned int));
width[0] = M.width;
cudaMemcpy(d_width, width, sizeof(int), cudaMemcpyHostToDevice);
int *d_height=(int*)malloc(sizeof(int));
cudaMalloc((void**) &d_height, sizeof(int));
unsigned int *height=(unsigned int*)malloc(sizeof(unsigned int));
height[0] = M.height;
cudaMemcpy(d_height, height, sizeof(int), cudaMemcpyHostToDevice);
et cetera .. */
While you may not currently be sending enough data to the GPU to max out it's memory, when you do, your cudaMalloc will return the error code cudaErrorMemoryAllocation which as per the cuda api docs, signals that the memory allocation failed. I note that in your example code you are not checking the return values of the cuda calls. These return codes need to be checked to make sure your program is running correctly. The cuda api does not throw exceptions: you must check the return codes. See this article for info on checking the errors and getting meaningful messages about the errors
If you are using cutil.h, then it provides two very useful macros:
CUDA_SAFE_CALL (used while issuing functions like cudaMalloc, cudaMemcpy etc.)
CUT_CHECK_ERROR (used after executing a kernel to check for errors in kernel execution).
They take care of the errors, if any, by using the error checking mechanism detailed in the article provided by flipchart.