Dealing with fragmented UDP - network-programming

Ok, before anyone asks, TCP is not an option.
So, I'm sending some messages over UDP. Each message has a 4 byte length field at the beginning. Thus far, I've been using this field to determine if I have a complete message.
But I wondered, if I have two large messages, large enough that they are both fragmented like this:
Message 1
Length 1 | Fragment 1-1 | Fragment 1-2 | Fragment 1-3
Message 2
Length 2 | Fragment 2-1 | Fragment 2-2 | Fragment 2-3
and I send one immediately after the other, is it possible for them to be delivered interleaved like this:
Length 1 | Length 2 | Fragment 1-1 | Fragment 2-1 | Fragment 1-2 | Fragment 2-2 | Fragment 1-3 | Fragment 2-3
And if so, how can I possibly reassemble these if I don't have any control of how the message is fragmented?
EDIT: Also.. It just occurred to me that UDP might not fragment, and the "fragmentation" I'm seeing might be from calling the .receive() method with a fixed size buffer. So maybe this is not even a problem. Can anyone confirm if UDP fragments?

No it isn't possible. UDP datagrams are delivered entire and intact or not at all. You don't have to worry about interleaving or reassembly. All you have to worry about is non-delivery, duplicate delivery, and out-of-order delivery ;-)
If you think you're seeing fragmentation, you're really seeing a programming bug in your code.

Related

Algorithms for correlation of events/issues

We are working on a system that aims to help development teams, SRE, DevOps team members by debugging many of the well known infrastructure issues (k8s to begin with) on their behalf and generate a detailed report which details report which details the specifics of the issue, possible root causes and clear next steps for the users facing the problem. In short, instead of you having to open up terminal, run several commands to arrive at an issue, a system does it for you and show it in a neat UI. We plan to leverage AI to provide better user experiences.
Questions:
1.There are several potential use case like predictive analytics, anomaly detection, forecasting, etc. We will not analysis application logs or metrics (may include metrics in future). Unlike application level logs, the platform logs are more unified. What is a good starting point for AI usage especially for platform based logs?
2.We plant to use AI to analysis issue correlations, we Apyori, FP Growth and got output. The output looks like below
| antecedent | consequent | confidence | lift |
|----------------------------|-------------------| ---------- | ---- |
| [Failed, FailedScheduling] | [BackOff] | 0.75 | 5.43 |
| [NotTriggerScaleUp] | [FailedScheduling]| 0.64 | 7.29 |
| [Failed] | [BackOff] | 0.52 | 3.82 |
| [FailedCreatePodSandBox] | [FailedScheduling]| 0.51 | 5.88 |
FP Growth is data mining algorithm, from the output we can figure the pattern of events. There is one potential use case, save the previous output and compare it with latest output to detect abnormal pattern in the latest output. Can we use the output to inference issue correlations or any other scenario we can use the output?
3.Some logs seems irrelevant, but actually they have connections, like one host has issue, it will impact the applications running on it, the time span maybe long, how can we figure out this kind of relationships?
Any comments and suggestions will be greatly appreciated, thank you in advance.

I'm failing to understand how the stack works

I'm building an emulator for the MOS6502 processor, and at the moment I'm trying to simulate the stack in code, but I'm really failing to understand how the stack works in the context of the 6502.
One of the features of the 6502's stack structure is that when the stack pointer reaches the end of the stack it will wrap around, but I don't get how this feature even works.
Let's say we have a stack with 64 maximum values if we push the values x, y and z onto the stack, we now have the below structure. With the stack pointer pointing at address 0x62, because that was the last value pushed onto the stack.
+-------+
| x | 0x64
+-------+
| y | 0x63
+-------+
| z | 0x62 <-SP
+-------+
| | ...
+-------+
All well and good. But now if we pop those three values off the stack we now have an empty stack, with the stack pointer pointing at value 0x64
+-------+
| | 0x64 <-SP
+-------+
| | 0x63
+-------+
| | 0x62
+-------+
| | ...
+-------+
If we pop the stack a fourth time, the stack pointer wraps around to point at address 0x00, but what's even the point of doing this when there isn't a value at 0x00?? There's nothing in the stack, so what's the point in wrapping the stack pointer around????
I can understand this process when pushing values, if the stack is full and a value needs to be pushed to the stack it'll overwrite the oldest value present on the stack. This doesn't work for popping.
Can someone please explain this because it makes no sense.
If we pop the stack a fourth time, the stack pointer wraps around to point at address 0x00, but what's even the point of doing this when there isn't a value at 0x00?? There's nothing in the stack, so what's the point in wrapping the stack pointer around????
It is not done for a functional reason. The 6502 architecture was designed so that pushing and popping could be done by incrementing an 8 bit SP register without any additional checking. Checks for overflow or underflow of the SP register would involve more silicon to implement them, more silicon to implement the stack overflow / underflow handling ... and extra gate delays in a critical path.
The 6502 was designed to be cheap and simple using 1975 era chip technology1. Not fast. Not sophisticated. Not easy to program2
1 - According to Wikipedia, the original design had ~3200 or ~3500 transistors. One of the selling points of the 6502 was that it was cheaper than its competitors. Fewer transistors meant smaller dies, better yields and lower production costs.
2 - Of course, this is relative. Compared to some ISAs, the 6502 is easy because it is simple and orthogonal, and you have so few options to chose from. But compared to others, the limitations that make it simple actually make it difficult. For example, the fact that there are at most 256 bytes in the stack page that have to be shared by everything. It gets awkward if you are implementing threads or coroutines. Compare this with an ISA where the SP is a 16 bit register or the stack can be anywhere.

How to prevent "partial write" data corruption during power loss?

In an embedded environment (using MSP430), I have seen some data corruption caused by partial writes to non-volatile memory. This seems to be caused by power loss during a write (to either FRAM or info segments).
I am validating data stored in these locations with a CRC.
My question is, what is the correct way to prevent this "partial write" corruption? Currently, I have modified my code to write to two separate FRAM locations. So, if one write is interrupted causing an invalid CRC, the other location should remain valid. Is this a common practice? Do I need to implement this double write behavior for any non-volatile memory?
A simple solution is to maintain two versions of the data (in separate pages for flash memory), the current version and the previous version. Each version has a header comprising of a sequence number and a word that validates the sequence number - simply the 1's complement of the sequence number for example:
---------
| seq |
---------
| ~seq |
---------
| |
| data |
| |
---------
The critical thing is that when the data is written the seq and ~seq words are written last.
On start-up you read the data that has the highest valid sequence number (accounting for wrap-around perhaps - especially for short sequence words). When you write the data, you overwrite and validate the oldest block.
The solution you are already using is valid so long as the CRC is written last, but it lacks simplicity and imposes a CRC calculation overhead that may not be necessary or desirable.
On FRAM you have no concern about endurance, but this is an issue for Flash memory and EEPROM. In this case I use a write-back cache method, where the data is maintained in RAM, and when modified a timer is started or restarted if it is already running - when the timer expires, the data is written - this prevents burst-writes from thrashing the memory, and is useful even on FRAM since it minimises the software overhead of data writes.
Our engineering team takes a two pronged approach to these problem: Solve it in hardware and software!
The first is a diode and capacitor arrangement to provide a few milliseconds of power during a brown-out. If we notice we've lost external power, we prevent the code from entering any non-violate writes.
Second, our data is particularly critical for operation, it updates often and we don't want to wear out our non-violate flash storage (it only supports so many writes.) so we actually store the data 16 times in flash and protect each record with a CRC code. On boot, we find the newest valid write and then start our erase/write cycles.
We've never seen data corruption since implementing our frankly paranoid system.
Update:
I should note that our flash is external to our CPU, so the CRC helps validates the data if there is a communication glitch between the CPU and flash chip. Furthermore, if we experience several glitches in a row, the multiple writes protect against data loss.
We've used something similar to Clifford's answer but written in one write operation. You need two copies of the data and alternate between them. Use an incrementing sequence number so that effectively one location has even sequence numbers and one has odd.
Write the data like this (in one write command if you can):
---------
| seq |
---------
| |
| data |
| |
---------
| seq |
---------
When you read it back make sure both the sequence numbers are the same - if they are not then the data is invalid. At startup read both locations and work out which one is more recent (taking into account the sequence number rolling over).
Always store data in some kind of protocol , like START_BYTE, Total bytes to write, data , END BYTE.
Before writting to external / Internal memory always check POWER Moniter registers/ ADC.
if anyhow you data corrupts, END byte will also corrupt. So that entry will not vaild after validation of whole protocol.
checksum is not a good idea , you can choose CRC16 instead of that if you want to include CRC into your protocol.

Automatic people counting + twittering

Want to develop a system accurately counting people that go through a normal 1-2m wide door. and twitter whenever people goes in or out and tells how many people remain inside.
Now, Twitter part is easy, but people counting is difficult. There is some semi existing counting solution, but they do not quite fit my needs.
My idea/algorithm:
Should I get some infra-red camera mounting on top of my door and constantly monitoring, and divide the camera image into several grid and calculating they entering and gone?
can you give me some suggestion and starting point?
How about having two sensors about 6 inches apart. They could be those little beam sensors (you know, the ones that chime when you walk into some shops) placed on either side of the door jam. We'll call the sensors S1 and S2
If they are triggered in the order of S1 THEN S2 - this means a person came in
If they are triggered in the order of S2 THEN S1 - this means a person left.
-----------------------------------------------------------
| sensor | door jam | sensor |
-----------------------------------------------------------
| |
| |
| |
| |
S1 S2 this is inside the store
| |
| |
| |
| |
-----------------------------------------------------------
| sensor | door jam | sensor |
-----------------------------------------------------------
If you would like to have the people filmed by a camera you can try to segment the people in the image and track them using a Particle Filter for multi-object tracking.
http://portal.acm.org/citation.cfm?id=1561072&preflayout=flat
This is a paper by one of my professors. Maybe you wanna have a look at it.
If your camera is mounted and doesnt move you can use a substraction-method for segmentation of the moving people (Basically just substract two following images and all that stays where the things that move). Then do some morphological operations on it so only big parts (people) stay. Maybe even identify them by checking on rectangularity so you only keep "standing" objects.
Then use a Particle Filter to track the people in the scene automatically... And each new object would increase the counter...
If you want I could maybe send you a presentation I held a while ago (unfortunately its in German, but you can translate it)
Hope that helps...

Rationalizing what is going on in my simple OpenCL kernel in regards to global memory

const char programSource[] =
"__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"
"{"
" int gid = get_global_id(0);"
"for(int i=0; i<10; i++){"
" a[gid] = b[gid] + c[gid];}"
"}";
The kernel above is a vector addition done ten times per loop. I have used the programming guide and stack overflow to figure out how global memory works, but I still can't figure out by looking at my code if I am accessing global memory in a good way. I am accessing it in a contiguous fashion and I am guessing in an aligned way. Does the card load 128kb chunks of global memory for arrays a, b, and c? Does it then load the 128kb chunks for each array once for every 32 gid indexes processed? (4*32=128) It seems like then I am not wasting any global memory bandwidth right?
BTW, the compute profiler shows a gld and gst efficiency of 1.00003, which seems weird, I thought it would just be 1.0 if all my stores and loads were coalesced. How is it above 1.0?
Yes your memory access pattern is pretty much optimal. Each halfwarp is accessing 16 consecutive 32bit words. Furthermore the access is 64byte aligned, since the buffers themselves are aligned and the startindex for each halfwarp is a multiple of 16. So each halfwarp will generate one 64Byte transaction. So you shouldn't waste memory bandwidth through uncoalesced accesses.
Since you asked for examples in your last question lets modify this code for other (less optimal access pattern (since the loop doesn't really do anything I will ignore that):
kernel void vecAdd(global int* a, global int* b, global int* c)
{
int gid = get_global_id(0);
a[gid+1] = b[gid * 2] + c[gid * 32];
}
At first lets se how this works on compute 1.3 (GT200) hardware
For the writes to a this will generate a slightly unoptimal pattern (following the halfwarps identified by their id range and the corresponding access pattern):
gid | addr. offset | accesses | reasoning
0- 15 | 4- 67 | 1x128B | in aligned 128byte block
16- 31 | 68-131 | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
32- 47 | 132-195 | 1x128B | in aligned 128byte block
48- 63 | 196-256 | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
So basically we are wasting about half our bandwidth (the less then doubled access width for the odd halfwarps doesn't help much because it generates more accesses, which isn't faster then wasting more bytes so to speak).
For the reads from b the threads access only even elements of the array, so for each halfwarp all accesses lie in a 128byte aligned block (the first element is at the 128B boundary, since for that element the gid is a multiple of 16=> the index is a multiple of 32, for 4 byte elements, that means the address offset is a multiple of 128B). The accesspattern stretches over the whole 128B block, so this will do a 128B transfer for every halfwarp, again waisting half the bandwidth.
The reads from c generate one of the worst case scenarios, where each thread indices in its own 128B block, so each thread needs its own transfer, which one one hand is a bit of a serialization scenario (although not quite as bad as normaly, since the hardware should be able to overlap the transfers). Whats worse is the fact that this will transfer a 32B block for each thread, wasting 7/8 of the bandwidth (we access 4B/thread, 32B/4B=8, so only 1/8 of the bandwidth is utilized). Since this is the accesspattern of naive matrixtransposes, it is highly advisable to do those using local memory (speaking from experience).
Compute 1.0 (G80)
Here the only pattern which will create a good access is the original, all patterns in the example will create completely uncoalesced access, wasting 7/8 of the bandwidth (32B transfer/thread, see above). For G80 hardware every access where the nth thread in a halfwarp doesn't access the nth element creates such uncoalesced accesses
Compute 2.0 (Fermi)
Here every access to memory creates 128B transactions (as many as necessary to gather all data, so 16x128B in the worst case), however those are cached, making it less obvious where data will be transfered. For the moment lets assume the cache is big enough to hold all data and there are no conflicts, so every 128B cacheline will be transferred at most once. Lets furthermoe assume a serialized execution of the halfwarps, so we have a deterministic cache occupation.
Accesses to b will still always transfer 128B Blocks (no other thread indices in the coresponding memoryarea). Access to c will generate 128B transfers per thread (worst accesspattern possible).
For accesses to a it is the following (treating them like reads for the moment):
gid | offset | accesses | reasoning
0- 15 | 4- 67 | 1x128B | bringing 128B block to cache
16- 31 | 68-131 | 1x128B | offsets 68-127 already in cache, bring 128B for 128-131 to cache
32- 47 | 132-195 | - | block already in cache from last halfwarp
48- 63 | 196-259 | 1x128B | offsets 196-255 already in cache, bringing in 256-383
So for large arrays the accesses to a will waste almost no bandwidth theoretically.
For this example the reality is of course not quite as good, since the accesses to c will trash the cache pretty nicely
For the profiler I would assume that the efficiencies over 1.0 are simply results of floating point inaccurencies.
Hope that helps

Resources