Dropped packets even if rte_eth_rx_burst do not return a full burst - network-programming

I have a weird drop problem, and to understand my question the best way is to have a look at this simple snippet:
while( 1 )
{
if( config->running == false ) {
break;
}
num_of_pkt = rte_eth_rx_burst( config->port_id,
config->queue_idx,
buffers,
MAX_BURST_DEQ_SIZE);
if( unlikely( num_of_pkt == MAX_BURST_DEQ_SIZE ) ) {
rx_ring_full = true; //probably not the best name
}
if( likely( num_of_pkt > 0 ) )
{
pk_captured += num_of_pkt;
num_of_enq_pkt = rte_ring_sp_enqueue_bulk(config->incoming_pkts_ring,
(void*)buffers,
num_of_pkt,
&rx_ring_free_space);
//if num_of_enq_pkt == 0 free the mbufs..
}
}
This loop is retrieving packets from the device and pushing them into a queue for further processing by another lcore.
When I run a test with a Mellanox card sending 20M (20878300) packets at 2.5M p/s the loop seems to miss some packets and pk_captured is always like 19M or similar.
rx_ring_full is never true, which means that num_of_pkt is always < MAX_BURST_DEQ_SIZE, so according to the documentation I shall not have drops at HW level. Also, num_of_enq_pkt is never 0 which means that all the packets are enqueued.
Now, if from that snipped I remove the rte_ring_sp_enqueue_bulk call (and make sure to release all the mbufs) then pk_captured is always exactly equal to the amount of packets I've send to the NIC.
So it seems (but I cant deal with this idea) that rte_ring_sp_enqueue_bulk is somehow too slow and between one call to rte_eth_rx_burst and another some packets are dropped due to full ring on the NIC, but, why num_of_pkt (from rte_eth_rx_burst) is always smaller than MAX_BURST_DEQ_SIZE (much smaller) as if there was always sufficient room for the packets?
Note, MAX_BURST_DEQ_SIZE is 512.
edit 1:
Perhaps this information might help: the drops seems to be visible also by rte_eth_stats_get, or to be more correct, no drops are reported (imissed and ierrors are 0) but the value of ipackets equals my counter pk_captured (are the missing packets just disappeared??)
edit 2:
According to ethtools rx_crc_errors_phy is zero and all the packets are received at PHY level (rx_packets_phy is updated with the correct amount of transferred packets).
The value from rx_nombuf from rte_eth_stats seems to contain trash (this is a print from our test application):
OUT(4): Port 1 stats: ipkt:19439285,opkt:0,ierr:0,oerr:0,imiss:0, rxnobuf:2061021195718
For a transfer of 20M packets, as you can see rxnobuf is garbage OR it has a meaning which I do not understand. The log is generated by:
log("Port %"PRIu8" stats: ipkt:%"PRIu64",opkt:%"PRIu64",ierr:%"PRIu64",oerr:%"PRIu64",imiss:%"PRIu64", rxnobuf:%"PRIu64,
config->port_id,
stats.ipackets, stats.opackets,
stats.ierrors, stats.oerrors,
stats.imissed, stats.rx_nombuf);
where stats came from rte_eth_stats_get.
The packets are not generated on the fly but replayed from an existing PCAP.
edit 3
After the answer from Adriy (Thanks!) I've included the xstats output for the Mellanox card, while reproducing the same problem with a smaller set of packets I can see that rx_mbuf_allocation_errors get's updated, but it seems to contain trash:
OUT(4): rx_good_packets = 8094164
OUT(4): tx_good_packets = 0
OUT(4): rx_good_bytes = 4211543077
OUT(4): tx_good_bytes = 0
OUT(4): rx_missed_errors = 0
OUT(4): rx_errors = 0
OUT(4): tx_errors = 0
OUT(4): rx_mbuf_allocation_errors = 146536495542
Also those counters seems interesting:
OUT(4): tx_errors_phy = 0
OUT(4): rx_out_of_buffer = 257156
OUT(4): tx_packets_phy = 9373
OUT(4): rx_packets_phy = 8351320
Where rx_packets_phy is the exact amount of packets I've been sending, and summing up rx_out_of_buffer with rx_good_packets I get that exact amount. So it seems that the mbufs get depleted and some packets are dropped.
I made a tweak in the original code and now I'm making a copy of the mbuf from the RX ring using link and them releasing immediately the memory, further processing is done on the copy by another lcore. This do not fix the problem sadly, it turns out that to solve the problem I've to disable the packet processing and release also the packet copy (on the other lcore), which make no sense.
Well, will do a bit more investigation, but at least rx_mbuf_allocation_errors seems to need a fix here.

I guess, debugging rx_nombuf counter is a way to go. It might look like a garbage, but in fact this counter does not reflect the number of dropped packets (like ierrors or imissed do), but rather number of failed RX attempts.
Here is a snippet from MLX5 PMD:
uint16_t
mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
{
[...]
while (pkts_n) {
[...]
rep = rte_mbuf_raw_alloc(rxq->mp);
if (unlikely(rep == NULL)) {
++rxq->stats.rx_nombuf;
if (!pkt) {
/*
* no buffers before we even started,
* bail out silently.
*/
break;
So, the plausible scenario for the issue is as follow:
There is a packet in RX queue.
There are no buffers in the corresponding mempool.
The application polls for new packets, i.e. calls in a loop: num_of_pkt = rte_eth_rx_burst(...)
Each time we call rte_eth_rx_burst(), the rx_nombuf counter gets increased.
Please also have a look at rte_eth_xstats_get(). For MLX5 PMD there is a hardware rx_out_of_buffer counter, which might confirm this theory.

The solution to missing packets will be in changing ring API from bulk to burst. In dpdk there are 2 modes ring operation bulk and burst. In bulk dequeue mode if the requested elements are 32, and there are 31 elements the API returns 0.
I have faced similar issue too.

Related

pyserial issues with high baudrate FTDI

I have the following setup:
A FPGA sending out data on UART at a baudrate of 3Mbps. The data transmitted is a chunk of 1024 bytes sent at a variable periodicity ranging from 20ms to 200ms. (So even in the worst case, datarate is far from 3Msps)
A FTDI 232RG
A piece of python running on my computer (Windows), doing basically : opening a COM port with pyserial, 3Msps, polling the in_waiting until it reaches the size of a packet (1024 bytes), formatting the packet received and print it on the screen
The script works well for low repetition frequency, but I face issues with higher repetitions (typically 20ms). When the periodicity in 20ms I eventually end up getting a buffer overflow somewhere before the in_waiting. I checked the timing of my python loop and it takes about 4ms. So it looks like there is something upstream (in the FTDI or Windows) that feeds the pyserial buffer with more than one packet within the 4ms following a packet.
I tried changing the FTDI latency in the driver (from 16ms default down to a few ms) but it does not seem to help.
I am currently clueless about what is happening. Would you have any advice about how to understand better what is happening?
Thanks for your help!
You could create a "loop" between TX and RX and run the following code (tested with a FT2232H, so mostlikely you need to change the identifier string):
import time
import serial
import serial.tools.list_ports
print([(x[0],x[2]) for x in serial.tools.list_ports.comports()])
port = [x[0] for x in serial.tools.list_ports.comports() if "FT4Q1LJFB" in x[2]][0]
ser = serial.Serial(port,12000000)
while True:
t0 = time.time()
counter = 0
for i in range(1000):
ser.write([1]*3000)
recv = ser.read(ser.inWaiting())
delta_t = time.time() - t0
counter += len(recv)
print(counter / delta_t)
For me the following output is shown
[('COM7', 'USB VID:PID=0403:6010 SER=FT4Q1LJFA'), ('COM8', 'USB VID:PID=0403:6010 SER=FT4Q1LJFB')]
0.0
0.0
0.0
0.0
96787.81184093593
1201991.0268273412
1201197.0857713912
1201166.9350959768
1201445.4072856384
You will notice that it is 0.0 in the beginning. This is because I connected RX and TX after starting the program resulting in a ramping up of the received bytes. This is the "default" mode meaning 8 bits + 1 start bit + 1 stop bit = 10 bits per word which explains why "only" 1.2 Mbytes per second are transmitted.

Why is my program reporting more captured packets than Wireshark?

I am writing a packet sniffer using pcap and visual studio. I've taken sample code for offline capturing and combined it with code that looks for an interface and captures packets live. This is what I have to display the packets information gotten from 1.
while (int returnValue = pcap_next_ex(pcap, &header, &data) >= 0)
{
// Print using printf. See printf reference:
// http://www.cplusplus.com/reference/clibrary/cstdio/printf/
// Show the packet number
printf("Packet # %i\n", ++packetCount);
// Show the size in bytes of the packet
printf("Packet size: %d bytes\n", header->len);
// Show a warning if the length captured is different
if (header->len != header->caplen)
printf("Warning! Capture size different than packet size: %ld bytes\n", header->len);
// Show Epoch Time
printf("Epoch Time: %d:%d seconds\n", header->ts.tv_sec, header->ts.tv_usec);
// loop through the packet and print it as hexidecimal representations of octets
// We also have a function that does this similarly below: PrintData()
for (u_int i=0; (i < header->caplen ) ; i++)
{
// Start printing on the next after every 16 octets
if ( (i % 16) == 0) printf("\n");
// Print each octet as hex (x), make sure there is always two characters (.2).
printf("%.2x ", data[i]);
}
// Add two lines between packets
printf("\n\n");
}
The problem I'm having is that if I run a WireShark live capture and also run my program, both capture packets live, but WireShark will show that it's capturing packet 20 and VS will show packetCount = 200.(Note: arbitrary numbers chosen to show Wireshark hasn't captured many packets, but VS is counting extremely fast.)
From what I understand, it seems the while loop is just running much faster than the packets are coming in and so it's just printing information from the last packet it received over and over again until a new one comes in.
How can I get VS to only capture packets as they come in?
I was unaware that Visual Studio included a packet sniffer; did you mean "how can I get my application, which I'm building with Visual Studio, to only capture packets as they come in?"
If that's what you meant, then that's what your code is doing.
Unfortunately, what it's not doing is "checking whether a packet has actually come in".
To quote the man page for pcap_next_ex() (yeah, UN*X, but this applies to Windows and WinPcap as well):
pcap_next_ex() returns 1 if the packet was read without problems,
0 if packets are being read from a live capture, and the timeout
expired, -1 if an error occurred while reading the packet, and -2
if packets are being read from a ``savefile'', and there are no
more packets to read from the savefile. If -1 is returned,
pcap_geterr() or pcap_perror() may be called with p as an argument
to fetch or display the error text.
Emphasis on "0 if packets are being read from a live capture, and the timeout expired"; 0 means that you did not get a packet.
Do not act as if a packet were captured if pcap_next_ex() returned 0, only do so if it returned 1:
while (int returnValue = pcap_next_ex(pcap, &header, &data) >= 0)
{
if (returnValue == 1) {
// Print using printf. See printf reference:
// http://www.cplusplus.com/reference/clibrary/cstdio/printf/
// Show the packet number
printf("Packet # %i\n", ++packetCount);
...
// Add two lines between packets
printf("\n\n");
}
}
SOLUTION: So apparently, adding parentheses around the argument fixes the issue.
while ( (( int returnValue = pcap_next_ex(pcap, &header, &data) ) >= 0 ) )

lua dissector for custom protocol

I have written several Lua Dissectors for custom protocols we use and they work fine. In order to spot problems with missing packets I need to check the custom protocol sequence numbers against older packets.
The IP source and Destination addresses are always the same for device A to device B.
Inside this packet we have one custom ID.
Each ID has a sequence number so device B can determine if a packet is missing. The sequence number increments by 256 and rolls over when it reaches 65k
I have tried using global dictionary but when you scroll up and down the trace the decoder is rerun and the values change.
a couple of lines below show where the information is stored.
ID = buffer(0,6):bitfield(12,12)
SeqNum = buffer(0,6):bitfield(32,16)
Ideally I would like to list in each decoded frame if the previous sequence number is more than 256 away and to produce a table lists all these bad frames.
Src IP; Dst IP; ID; Seq
1 10.12.1.2; 10.12.1.3; 10; 0
2 10.12.1.2; 10.12.1.3; 11; 0
3 10.12.1.2; 10.12.1.3; 12; 0
4 10.12.1.2; 10.12.1.3; 11; 255
5 10.12.1.2; 10.12.1.3; 12; 255
6 10.12.1.2; 10.12.1.3; 10; 511 Packet with seq 255 is missing
I have now managed to get the dissector to check the current packet against previous packets by using a global array, where I store specific information about each frame. In the current packet being dissected I recheck the most recent packet and work my way back to the start to find a suitable packet.
dict[pinfo.number] = {frame = pinfo.number, dID = ID, dSEQNUM = SeqNum}
local frameCount = 0
local frameFound = false
while frameFound == false do
if pinfo.number > frameCount then
frameCount = frameCount + 1
if dict[(pinfo.number - frameCount)] ~= nil then
if dict[(pinfo.number - frameCount)].dID == dict[pinfo.number].dID then
seq_difference = (dict[(pinfo.number)].dSEQNUM - dict[(pinfo.number - frameCount)].dSEQNUM)
if seq_difference > 256 then
pinfo.cols.info = string.format('ID-%d SeqNum-%d missing packet(s) %d last frame %d ', ID,SeqNum, seq_difference, dict[(pinfo.number - frameCount)].frame)
end
frameFound = true
end
end
else
frameFound = true
end
end
I'm not sure I see a question to answer? If you're asking "how can I avoid having to deal with the dissector being invoked multiple times and screwing up the previous decoding of the values" - the answer to that is using the pinfo.visited boolean. It will be false the first time a given packet is dissected, and true thereafter no matter how much clicking around the user does - until the file is reloaded or a new one loaded.
To handle the reloading/new-file case, you'd hook into the init() function call for your proto, by defining a function myproto.init() function, and in that you'd clear your entire array table.
Also, you might want to google for related questions/answer on ask.wireshark.org, as that site is more frequently used for wireshark Lua API questions. For example this question/answer is similar and related to your case.

Reassembling packets in a Lua Wireshark Dissector

I'm trying to write a dissector for the Safari Remote Debug protocol which is based on bplists and have been reasonably successful (current code is here: https://github.com/andydavies/bplist-dissector).
I'm running into difficultly with reassembling packets though.
Normally the protocol sends a packet with 4 bytes containing the length of the next packet, then the packet with the bplist in.
Unfortunately some packets from the iOS simulator don't follow this convention and the four bytes are either tagged onto the front of the bplist packet, or onto the end of the previous bplist packet, or the data is multiple bplists.
I've tried reassembling them using desegment_len and desegment_offset as follows:
function p_bplist.dissector(buf, pkt, root)
-- length of data packet
local dataPacketLength = tonumber(buf(0, 4):uint())
local desiredPacketLength = dataPacketLength + 4
-- if not enough data indicate how much more we need
if desiredPacketLen > buf:len() then
pkt.desegment_len = dataPacketLength
pkt.desegment_offset = 0
return
end
-- have more than needed so set offset for next dissection
if buf:len() > desiredPacketLength then
pkt.desegment_len = DESEGMENT_ONE_MORE_SEGMENT
pkt.desegment_offset = desiredPacketLength
end
-- copy data needed
buffer = buf:range(4, dataPacketLen)
...
What I'm attempting to do here is always force the size bytes to be the first four bytes of a packet to be dissected but it doesn't work I still see a 4 bytes packet, followed by a x byte packet.
I can think of other ways of managing the extra four bytes on the front, but the protocol contains a lookup table thats 32 bytes from the end of the packet so need a way of accurately splicing the packet into bplists.
Here's an example cap: http://www.cloudshark.org/captures/2a826ee6045b #338 is an example of a packet where the bplist size is at the start of the data and there are multiple plists in the data.
Am I doing this right (looking other questions on SO, and examples around the web I seem to be) or is there a better way?
TCP Dissector packet-tcp.c has tcp_dissect_pdus(), which
Loop for dissecting PDUs within a TCP stream; assumes that a PDU
consists of a fixed-length chunk of data that contains enough information
to determine the length of the PDU, followed by rest of the PDU.
There is no such function in lua api, but it is a good example how to do it.
One more example. I used this a year ago for tests:
local slicer = Proto("slicer","Slicer")
function slicer.dissector(tvb, pinfo, tree)
local offset = pinfo.desegment_offset or 0
local len = get_len() -- for tests i used a constant, but can be taken from tvb
while true do
local nxtpdu = offset + len
if nxtpdu > tvb:len() then
pinfo.desegment_len = nxtpdu - tvb:len()
pinfo.desegment_offset = offset
return
end
tree:add(slicer, tvb(offset, len))
offset = nxtpdu
if nxtpdu == tvb:len() then
return
end
end
end
local tcp_table = DissectorTable.get("tcp.port")
tcp_table:add(2506, slicer)

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

Resources