why 4 in TCP timeout formula - network-programming

In the formula for TCP timeout calculation we multiply 4 with Deviation, my question is that why we multiply 4 and not any other number with the deviation? please tell.
FORMULA
Timeout_Interval = Estimated_RTT + 4*(Deviation_RTT)

If you think of standard deviations, then +1σ will cover about 84.1%, +2σ will cover about 97.7%, +3σ will cover about 99.8%, and +4σ will cover about 99.9%.
The "Deviation_RTT" is much the same and thus this equation will avoid retransmission for the vast majority of packets that are sent. You can make it larger to have slightly fewer retransmissions but also be slower to pull the trigger when a packet does get lost.

Related

Bit error rate uncoded vs Bit error rate digital communication

graph
Above is the graph showing the BER (bit error rate) at different Eb/No values using BPSK over AWGN channel. The pink curve shows the BER of the uncoded system (without channel encoder and decoder) while the black curve represent the BER of the digital communication with the used of hamming (7,4) code for channel encoding. However, I can't explain why both curves started to intersect and cross over at 6dB.
I started writing this in a comment and it started getting long. I figure this is either correct or not. It makes sense to me though so maybe you will have to do more research beyond this.
Note: I am aware BER is normally over seconds but for our purpose we will look at something smaller.
My first assumption though (based on your graph) is that the BER is on the actual data and not the signal. If we have a BER on these 2 different encoding schemes of 1 error every 7 bits we have a BER on the hamming encoded signal of 0 errors every 7 bits compared to 1 in 7.
Initial:
Unencoded: 1 error every 7 bits received
Hamming(7,4): 0 errors every 4 bits (if corrected)
Now lets increase the noise thereby increasing the error rate of the entire signal.
Highly increased BER:
Unencoded: 3.5 errors in 7 bits (50%) (multiple sequences to get an average)
Hamming(7,4): 2 errors in 4 bits (50%)
Somewhere during the increase in BER these must crossover as you are seeing on your graph. Beyond the crossover I would expect to see it worse on the Hamming side because of less data per error (lower actual data density). I am sure you could calculate this mathematically... unfortunately, it would take me more time to look into that than I care to though as it just intuitively makes sense to me.

CUDA: Best number of pixel computed per thread (grayscale)

I'm working on a program to convert an image in grayscale. I'm using the CImg library. I have to read for each pixel, the 3 values R-G-B, calculate the corresponding gray value and store the gray pixel on the output image. I'm working with an NVIDIA GTX 480. Some details about the card:
Microarchitecture: Fermi
Compute capability (version): 2.0
Cores per SM (warp size): 32
Streaming Multiprocessors: 15
Maximum number of resident warps per multiprocessor: 48
Maximum amount of shared memory per multiprocessor: 48KB
Maximum number of resident threads per multiprocessor: 1536
Number of 32-bit registers per multiprocessor: 32K
I'm using a square grid with blocks of 256 threads.
This program can have as input images of different sizes (e.g. 512x512 px, 10000x10000 px). I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. Have I to consider this number? The following, is the code executed by the kernel.
for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[i]);
float g = static_cast< float >(inputImage[(width * height) + i]);
float b = static_cast< float >(inputImage[(2 * width * height) + i]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}
The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number?
Although you haven't shown any code, you've mentioned an observed characteristic:
I observed that incrementing the number of the pixels assigned to each thread increments the performance,
This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. A similar observation occurs during matrix transpose. If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. This ultimately has to do with the ability of the machine to hide latency, and the ability of your code to expose enough work to allow the machine to hide latency. Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple.
GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency.
The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency.
Once we have met the above number, we want to pack as much "work" as possible into each thread. Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread.
So the steps are fairly simple:
Launch as many warps as the machine can handle instantaneously
Put all remaining work in the thread code, if possible.
Let's take a simplistic example. Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output.
My objective then is to create a grid of 8 warps, i.e. 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread.
Note that this method gives us an optimization direction. We do not necessarily have to achieve this objective completely in order to saturate the machine. It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below.
Following this method will tend to remove latency as a limiting factor for performance of your kernel. Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric, and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case.
Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth.

CAN busoff with CAN_High and CAN_Low shorted

It is known fact that short circuiting CAN_High and CAN_Low on a CAN bus leads to a bus off condition.
With respect to the physical layer, how does this condition lead to bus off condition?
CAN is a differential protocol. That means 0 or 1 (to be specific, dominant and recessive) is decided on the basis of the difference between voltages on the CANH and CANL lines.
When you short these two lines, there will not be any voltage difference and that falls under the voltage range of recessive bits. In other words, shorting two lines will be considered as a continuous transmission of recessive bits.
When you transmit 6 or more consecutive recessive bits, it is considered as an error!
And when this error count goes more than 255, the CAN controller goes into the BUS_OFF state.
As lines are shorted, there will be way more recessive bits and the error count will reach 255 in no time which will lead to BUS_OFF.
The CAN protocol does have a "bus recovery mechanism" in which it will wait for 11 consecutive recessive bits for 128 times (which it will as the bus is shorted), but again, the same error frame thing will happen, and it will be back in BUS_OFF.
This cycle will continue!

Jitter calculation in Wireshark

I have a query regarding the Jitter calculation method in Wireshark.
Wireshark calculates jitter according to RFC3550 (RTP):
If Si is the RTP timestamp from packet i, and Ri is the time of arrival in RTP timestamp units for packet i, then for two packets i and j, D may be expressed as
D(i,j) = (Rj - Ri) - (Sj - Si) = (Rj - Sj) - (Ri - Si)
The interarrival jitter SHOULD be calculated continuously as each data packet i is received from source SSRC_n, using this difference D for that packet and the previous packet i-1 in order of arrival (not necessarily in sequence), according to the formula
J(i) = J(i-1) + (|D(i-1,i)| - J(i-1))/16
Now, here the absolute value of inter-arrival jitter has been taken into consideration. My query is why the absolute value has been taken when the jitter could be negative also and i think if we take the negative jitter into consideration also, we will get the much actual value rather than the value we are taking at present
Also, when we plot the jitter distribution graph using the above method, it wont be centered around zero as we have made all the values positive and that graph wont look realistic.
Can someone clarify my query?
Wikipedia has a good definition of jitter:
Jitter is the undesired deviation from true periodicity of an assumed
periodic signal...
A jitter value of zero means the signal has no variation from the expected value. As the variation increases (the packets are getting bunched up and spread out) the jitter increases in magnitude.
The bunching and spreading are really the same effect; bunching in one place causes spreading in another so this 'bunching and spreading' doesn't have a direction, just a magnitude.
I hope this helps - it was the best explanation I could come up with.

Can FFT length affect filtering accuracy?

I am designing a fractional delay filter, and my lagrange coefficient of order 5 h(n) have 6 taps in time domain. I have tested to convolute the h(n) with x(n) which is 5000 sampled signal using matlab, and the result seems ok. When I tried to use FFT and IFFT method, the output is totally wrong. Actually my FFT is computed with 8192 data in frequency domain, which is the nearest power of 2 for 5000 signal sample. For the IFFT portion, I convert back the 8192 frequency domain data back to 5000 length data in time domain. So, the problem is, why this thing works in convolution, but not in FFT multiplication. Does converting my 6 taps h(n) to 8192 taps in frequency domain causes this problem?
Actually I have tried using overlap-save method, which perform the FFT and multiplication with smaller chunks of x(n) and doing it 5 times separately. The result seems slight better than the previous, and at least I can see the waveform pattern, but still slightly distorted. So, any idea where goes wrong, and what is the solution. Thank you.
The reason I am implementing the circular convolution in frequency domain instead of time domain is, I am try to merge the Lagrange filter with other low pass filter in frequency domain, so that the implementation can be more efficient. Of course I do believe implement filtering in frequency domain will be much faster than convolution in time domain. The LP filter has 120 taps in time domain. Due to the memory constraints, the raw data including the padding will be limited to 1024 in length, and so with the fft bins.
Because my Lagrange coefficient has only 6 taps, which is huge different with 1024 taps. I doubt that the fft of the 6 taps to 1024 bins in frequency domain will cause error. Here is my matlab code on Lagrange filter only. This is just a test code only, not implementation code. It's a bit messy, sorry about that. Really appreciate if you can give me more advice on this problem. Thank you.
t=1:5000;
fs=2.5*(10^12);
A=70000;
x=A*sin(2*pi*10.*t.*(10^6).*t./fs);
delay=0.4;
N=5;
n = 0:N;
h = ones(1,N+1);
for k = 0:N
index = find(n ~= k);
h(index) = h(index) * (delay-k)./ (n(index)-k);
end
pad=zeros(1,length(h)-1);
out=[];
H=fft(hh,1024);
H=fft([h zeros(1,1024-length(h))]);
for i=0:1:ceil(length(x)/(1024-length(h)+1))-1
if (i ~= ceil(length(x)/(1024-length(h)+1))-1)
a=x(1,i*(1024-length(h)+1)+1:(i+1)*(1024-length(h)+1));
else
temp=x(1,i*(1024-length(h)+1)+1:length(x));
a=[temp zeros(1,1024-length(h)+1-length(temp))];
end
xx=[pad a];
X=fft(xx,1024);
Y=H.*X;
y=abs(ifft(Y,1024));
out=[out y(1,length(h):length(y))];
pad=y(1,length(a)+1:length(y));
end
Some comments:
The nearest power of two is actually 4096. Do you expect the remaining 904 samples to contribute much? I would guess that they are significant only if you are looking for relatively low-frequency features.
How did you pad your signal out to 8192 samples? Padding your sample out to 8192 implies that approximately 40% of your data is "fictional". If you used zeros to lengthen your dataset, you likely injected a step change at the pad point - which implies a lot of high-frequency content.
A short code snippet demonstrating your methods couldn't hurt.

Resources