when does librdkafka free its message payload? - memory

My librdkafka consumer app's memory usage is increasing all the time if it is busy consuming a huge amount of messages from Kafka( > 300,000 messages/s).
The message is freed by "rd_kafka_message_destroy". But I found that it does not free message payload because "rkm->rkm_flags" is equal to zero.
if (rkm->rkm_flags & RD_KAFKA_MSG_F_FREE && rkm->rkm_payload)
rd_free(rkm->rkm_payload);
Strangely when the consumer is spare, its memory usage decclined quickly. Some doc says that librdkafka manages its payload buffer by reference count, but I cannot find any count decrease in "rd_kafka_message_destroy".
Does librdkafka free its buffer asynchonically? or how can I free payload instantly?

Related

Does cacheline size affect memory access latency?

Intel architecture has had 64 byte caches for a long time. I am curious, if instead of 64-byte cache lines a processor had 32-byte or 16-byte cachelines, would this improve the RAM-to-register data transfer latency? if so, how much? if not, why?
Thank you.
Transferring a larger amount of data of course increases the communication time. But the increase is very small due the way memory are organized and it does it does not impact memory to register latency.
Memory access operations are done in three steps:
bitline precharge: row address is sent and the internal busses of memory are precharged (duration tRP)
row access: an internal row of a memory is read and written to internal latches. During that time, column address is sent (duration tRCD)
column access: the selected columns are read in the row latches and start to be sent to the processor (duration tCL)
Row access is a long operation.
A memory is a matrix of cell elements. To increase the capacity of memory, cells must be rendered as small as possible. And when reading a row of cells, one has to drive a very capacitive and large bus that goes along a memory column. The voltage swing is very low and there are sense amplifiers amplifiers to detect small voltage variations.
Once this operation is done, a complete row is memorized in latches and reading them can be fast and are generally sent in burst mode.
Considering a typical DDR4 memory, with a 1GHz IO cycle time, we generally have tRP/tRCD/tCL=12-15cy/12-15cy/10-12cy and the complete time is around 40 memory cycles (if processor frequency is 4GHz, this is ~160 processor cycles). Then data is sent in burst mode twice per cycle, and 2x64 bits are sent every cycle. So, data transfer adds 4 cycles for 64 bytes and it would add only 2 cycles for 32 bytes.
So reducing cache line from 64B to 32B would reduce the transfer time by ~2/40=5%
If row address do not change, precharging and reading memory row is not required and the access time is ~15 memory cycles. In that case, the relative increase of time for transferring 64B vs 32B is larger but still limited: ~2/15~15%.
Both evaluations do not take into account the extra time required to process a miss in the memory hierachy and the actual percentage will be even smaller.
Data can be sent "critical word first" by the memory. If processor requires a given word, the address of this word is sent to memory. Once the row is read, memory sends first this word, then the other words in the cache line. So, caches can serve processor request as soon as the first word is received, whatever cache line is, and decreasing line width would have no impact on cache latency. So if using this feature, memory-to-register time would not change.
In recent processors, exchanges between different caches levels are based on the cache line width and sending the critical word first does not bring any gain.
Besides that, large line sizes reduce mandatory misses thanks to spatial locality and reducing line size would have a negative impact on cache miss rate.
Last, using larger cache lines increases data transfer rate between cache and memory.
The only negative aspect of large cache lines (besides the small transfer time increase) are that the number of lines in the cache is reduced and conflict misses may increase. But with the large associativity of modern caches, this effect is limited.

How do I receive arbitrary length data using a UdpSocket?

I am writing an application which sends and receives packages using UDP. However, the documentation of recv_from states:
If a message is too long to fit in the supplied buffer, excess bytes may be discarded.
Is there any way to receive all bytes and write them into a vector? Do I really have to allocate an array with the maximum packet length (which, as far as I know, is 65,507 bytes for IPv4) in order to be sure to receive all data? That seems a bit much for me.
Check out the next method in the docs, UdpSocket::peek_from (emphasis mine):
Receives a single datagram message on the socket, without removing it from the queue.
You can use this method to read a known fixed amount of data, such as a header which contains the length of the entire packet. You can use crates like byteorder to decode the appropriate part of the header, use that to allocate exactly the right amount of space, then call recv_from.
This does require that the protocol you are implementing always provides that total size information at a known location.
Now, is this a good idea?
As ArtemGr states:
Because extra system calls are much more expensive than getting some space from the stack.
And from the linked question:
Obviously at some point you will start wondering if doubling the number of system calls to save memory is worth it. I think it isn't.
With the recent Spectre / Meltdown events, now's a pretty good time to be be reminded to avoid extra syscalls.
You could, as suggested, just allocate a "big enough" array ahead of time. You'll need to track how many bytes you've actually read vs allocated though. I recommend something like arrayvec to make it easier.
You could instead implement a pool of pre-allocated buffers on the heap. When you read from the socket, you use a buffer or create a new one. When you are done with the buffer, you put it back in the pool for reuse. That way, you incur the memory allocation once and are only passing around small Vecs on the stack.
See also:
How can I create a stack-allocated vector-like container?
How large should my recv buffer be when calling recv in the socket library
How to read UDP packet with variable length in C

Flash Memory Management

I'm collecting data on an ARM Cortex M4 based evaluation kit in a remote location and would like to log the data to persistent memory for access later.
I would be logging roughly 300 bytes once every hour, and would want to come collect all the data with a PC after roughly 1 week of running.
I understand that I should attempt to minimize the number of writes to flash, but I don't have a great understanding of the best way to do this. I'm looking for a resource that would explain memory management techniques for this kind of situation.
I'm using the ADUCM350 which looks like it has 3 separate flash sections (128kB, 256kB, and a 16kB eeprom).
For logging applications the simplest and most effective wear leveling tactic is to treat the entire flash array as a giant ring buffer.
define an entry size to be some integer fraction of the smallest erasable flash unit. Say a sector is 4K(4096 bytes); let the entry size be 256.
This is to make all log entries be sector aligned and will allow you to erase any sector without cuting a log entry in half.
At boot, walk the memory and find the first empty entry. this is the 'write_pointer'
when a log entry is written, simply write it to write_pointer and increment write_pointer.
If write_pointer is on a sector boundary erase the sector at write_pointer to make room for the next write. essentially this guarantees that there is at least one empty log entry for you to find at boot and allows you to restore the write_pointer.
if you dedicate 128KBytes to the log entries and have an endurance of 20000 write/erase cycles. this should give you a total of 10240000 entries written before failure. or 1168 years of continuous logging...

how to detect XMIT FIFO is full on a UART 16550 or higher

I have read already lot of specs and code about UART, but I cannot find any indication on how to find by software interface if the transmit FIFO is full. There is an interrupt when the FIFO is empty. Then I can write at least N characters, where N is the fifo size. But when I have written these N characters, a number of them have already been sent. So I can in fact write more than N characters, but there is no FIFO full interrupt. The specs says that when the fifo is full indeed the TXREADY pin on the chip is inverted. Is there a way to find this by software ? The Line Status Register bit only says that the fifo is not empty, which does not mean it is full...
Anyone can help ? I want to write characters until the fifo is full...
Looks to me also that they neglected this, but most people get by with the thing as it is. The usual way to use it is to get an interrupt, fill the FIFO (normally very fast compared to serial data rate) and then return.
There is a situation where it seems to me that what you are asking for could be nice...if transmitting in a polling mode...you want to send 10 bytes, your polling shows the FIFO is not empty, so you have not way to know if you can send them all or not...either you wait there until it is empty, which sort of defeats the purpose of the FIFO, or you continue polling other stuff until you get back to checking for FIFO empty, and maybe that slows your overall transmission rate. Guess it is not a very usual way to operate, so nobody worries about it.
The 16550D datasheet says the following:
The transmitter holding register interrupt (02) occurs when the XMIT
FIFO is empty; it is cleared as soon as the transmitter holding
register is written to (1 to 16 characters may be written to the XMIT
FIFO while servicing this interrupt) or the IIR is read.
This means that when the Line Status Register register (port base + 5) indicates Transmitter Empty condition (in bit 5), the transmit FIFO is completely empty and you may write up to 16 bytes to the transmitter holding register (port base + 0). It is important not to write more than 16 bytes between occurrences of the transmitter empty bit being set.
If you don't need to write 16 bytes at the point when you received the IRQ (or saw the transmitter register empty bit set, if polling), you can either keep track of how many bytes you wrote since the last transmitter empty state, or, just defer writing further bytes until the next transmitter empty state.

Measuring end-to-end latency with the Paho sample pub/sub app

My aim is to measure MQTT device-to-device message latency (not throughput) and I'm looking for feedback on my code-hacks. The setup is simple; just one device serving as two end-points (old Linux PC with two terminal sessions; one running the subscriber and the other running the publisher sample app) and the default broker at tcp://m2m.eclipse.org:1883). I inserted time-capturing code-fragments into the C-language publish/subscribe sample apps on the src/samples folder.
Below are the changes. Please provide feedback.
Changes to the subscribe sample app (MQTTAsync_subscribe.c)
Inserted the lines below at the top of the msgarrvd (message arrived) function
//print arrival time
struct timeval tv;
gettimeofday (&tv, NULL);
printf("Message arrived: %ld.%06ld\n", tv.tv_sec, tv.tv_usec);
Changes to the publish sample app (MQTTAsync_publish.c)
Inserted the lines below at the top of the onSend (callback) function
struct timeval tv;
gettimeofday (&tv, NULL);
printf("Message with token value %d delivery confirmed at %ld.%06ld\n",
response->token, tv.tv_sec, tv.tv_usec);
With these changes (after subtracting the time message arrived at the subscriber from the time that the delivery was confirmed at the publisher), I get a time anywhere between 1 millisecond and 0.5 millisecond.
Questions
Does this make sense as a rough benchmark on latency?
Is this the round-trip time?
Is the round-trip time in the right ball-park? Should be less? more?
Is it the one-way time?
Should I design the latency benchmark in a different way? I need a rough measurements (I'm comparing with XMPP).
I'm using the default QoS value (1). Should I change it?
The publisher takes a finite amount of time to connect (and disconnect). Should these be added?
The 200ms latency is high ! Can you please upload your code here ?
Does this make sense as a rough benchmark on latency?
-- Yes it makes sense. But a better approach is to make an automated time subtract with subscribed message and both synchronized to NTP.
Is this the round-trip time? Is it the one-way time?
-- Messages got published - you received ACK for publisher and same message got transferred to subscribed client.
Is the round-trip time in the right ball-park? Should be less? more?
-- It should be less.
Should I design the latency benchmark in a different way? I need a rough measurements (I'm comparing with XMPP).
I'm using the default QoS value (1). Should I change it?
-- Try with QoS 0 ( fire and forget )
The publisher takes a finite amount of time to connect (and disconnect). Should these be added?
-- Yes. It needs to be added but this time should be very small.

Resources