Double system call to write() causes massive network slowdown - network-programming

In a partially distributed network app I'm working on in C++ on Linux, I have a message-passing abstraction which will send a buffer over the network. The buffer is sent in two steps: first a 4-byte integer containing the size is sent, and then the buffer is sent afterwards. The receiving end then receives in 2 steps as well - one call to read() to get the size, and then a second call to read in the payload. So, this involves 2 system calls to read() and 2 system calls to write().
On the localhost, I setup two test processes. Both processes send and receive messages to each other continuously in a loop. The size of each message was only about 10 bytes. For some reason, the test performed incredibly slow - about 10 messages sent/received per second. And this was on localhost, not even over a network.
If I change the code so that there is only 1 system call to write, i.e. the sending process packs the size at the head of the buffer and then only makes 1 call to write, the whole thing speeds up dramatically - about 10000 messages sent/received per second. This is an incredible difference in speed for only one less system call to write.
Is there some explanation for this?

You might be seeing the effects of the Nagle algorithm, though I'm not sure it is turned on for loopback interfaces.
If you can combine your two writes into a single one, you should always do that. No sense taking the overhead of multiple system calls if you can avoid it.

Okay, well I'm using TCP/IP (SOCK_STREAM) sockets. The example code is pretty straight forward. Here is a basic snippet that reproduces the problem. This doesn't include all the boiler plate setup code, error-checking, or ntohs code:
On the sending end:
// Send size
uint32_t size = strlen(buffer);
int res = write(sock, &size, sizeof(size));
// Send payload
res = write(sock, buffer, size);
And on the receiving end:
// Receive size
uint32_t size;
int res = read(sock, &size, sizeof(size));
// Receive payload
char* buffer = (char*) malloc(size);
read(sock, buffer, size);
Essentially, if I change the sending code by packing the size into the send buffer, and only making one call to write(), the performance increase is almost 1000x faster.

This is essentially the same question: C# socket abnormal latency .
In short, you'll want to use the TCP_NODELAY socket option. You can set it with setsockopt.

You don't give enough information to say for sure. You don't even say which protocol you're using.
Assuming TCP/IP, the socket could be configured to send a packet on every write, instead of buffering output in the kernel until the buffer is full or the socket is explicitly flushed. This means that TCP sends the two pieces of data in different fragments and has to defeagment them at the other end.
You might also be seeing the effect of the TCP slow-start algorithm. The first data sent is transmitted as part of the connection handshake. Then the TCP window size is slowly ramped up as more data is transmitted until it matches the rate at which the receiver can consume data. This is useful in long-lived connections but a big performance hit in short-lived ones. You can turn off slow-start by setting a socket option.
Have a look at the TCP_NODELAY and TCP_NOPUSH socket options.
An optimization you can use to avoid multiple system calls and fragmentation is scatter/gather I/O. Using the sendv or writev system call you can send the 4-byte size and variable sized buffer in a single syscall and both pieces of data will be sent in the same fragment by TCP.

The problem is that with the first call to send, the system has no idea the second call is coming, so it sends the data immediately. With the second call to send, the system has no idea a third call isn't coming, so it delays the data in hopes that it can combine the data with a subsequent call.
The correct fix is to use a 'gather' operation such as writev if your operating system supports it. Otherwise, allocate a buffer, copy the two chunks in, and make a single call to write. (Some operating systems have other solutions, for example Linux has a 'TCP cork' operation.)
It's not as important, but you should optimize your receiving code too. Call 'read' asking for as many bytes as possible and then parse them yourself. You're tying to teach the operating system your protocol, and that's not a good idea.

Related

Read Timeout TIdTCPClient

Good day. I use the TIdTCPClient component to send requests to the server and read the response. I know the size of the response for certain requests, but not for others.
When I know the size of the response, then my data reading code looks like this:
IdTCPClient1->Socket->Write(requestBuffer);
IdTCPClient1->Socket->ReadBytes(answerBuffer, expectSize);
When the size of the response is not known to me, then I use this code:
IdTCPClient1->Socket->Write(requestBuffer);
IdTCPClient1->Socket->ReadBytes(answerBuffer, -1);
In both cases, I ran into problems.
In the first case, if the server does not return all the data (less than expectSize), then IdTCPClient1 will wait for ReadTimeout to finish, but there will be no data at all in the answerBuffer (even if the server sent something). Is this the logic behind TIdTCPClient? It is right?
In the second case, ReadTimeout does not work at all. That is, the ReadBytes function ends immediately and nothing is written to the answerBuffer, or several bytes from the server are written. However, I expected that since this function in this case does not know the number of bytes to read, it must wait for ReadTimeout and read the bytes, who came during this time. For the experiment, I inserted Sleep (500) between writing and reading, and then I read all the data that arrived.
May I ask you to answer why this is happening?
Good day. I use the TIdTCPClient component to send requests to the server and read the response. I know the size of the response for certain requests, but not for others.
Why do you not know the size of all of the responses? What does your protocol actually look like? TCP is a byte stream, each message MUST be framed in such a way that a receiver can know where each message begins and ends in order to read the messages correctly and preserve the integrity of the stream. As such, messages MUST either include their size in their payload, or be uniquely delimited between messages. So, which is the case in your situation? It doesn't sound like you are handling either possibility.
When the size of the response is not known to me, then I use this code:
IdTCPClient1->Socket->Write(requestBuffer);
IdTCPClient1->Socket->ReadBytes(answerBuffer, -1);
When you set AByteCount to -1, that tells ReadBytes() to return whatever bytes are currently available in the IOHandler's InputBuffer. If the InputBuffer is empty, ReadBytes() waits, up to the ReadTimeout interval, for at least 1 byte to arrive, and then it returns whatever bytes were actually received into the InputBuffer, up to the maximum specified by the IOHandler's RecvBufferSize. So it may still take multiple reads to read an entire message in full.
In general, you should NEVER set AByteCount to -1 when dealing with an actual protocol. -1 is good to use only when proxying/streaming arbitrary data, where you don't care what the bytes actually are. Any other use require knowledge of the protocol's details of how messages are framed.
In the first case, if the server does not return all the data (less than expectSize), then IdTCPClient1 will wait for ReadTimeout to finish, but there will be no data at all in the answerBuffer (even if the server sent something). Is this the logic behind TIdTCPClient? It is right?
Yes. When AByteCount is > 0, ReadBytes() waits for the specified number of bytes to be available in the InputBuffer before then extracting that many bytes into your output TIdBytes. Your answerBuffer will not be modified unless all of the requested bytes are available. If the ReadTimeout elapses, an EIdReadTimeout exception is raised, and your answerBuffer is left untouched.
If that is not the behavior you want, then consider using ReadStream() instead of ReadBytes(), using a TIdMemoryBufferStream or TBytesStream to read into.
In the second case, ReadTimeout does not work at all. That is, the ReadBytes function ends immediately and nothing is written to the answerBuffer.
I have never heard of ReadBytes() not waiting for the ReadTimeout. What you describe should only happen if there are no bytes available in the InputBuffer and the ReadTimeout is set to some very small value, like 0 msecs.
or several bytes from the server are written.
That is a perfectly reasonable outcome given you are asking ReadBytes() to read an arbitrary number of bytes between 1..RecvBufferSize, inclusive, or read no bytes if the timeout elapses.
However, I expected that since this function in this case does not know the number of bytes to read, it must wait for ReadTimeout and read the bytes, who came during this time.
That is how it should be working, yes. And how it has always worked. So I suggest you debug into ReadBytes() at runtime and find out why it is not working the way you are expecting. Also, make sure you are using an up-to-date version of Indy to begin with (or at least a version from the last few years).
Why do you not know the size of all of the responses?
Because, in fact, I'm doing a survey of an electronic device. This device has its own network IP address and port. So, the device can respond to the same request in different ways, depending on its status. Strictly speaking, there can be two answers to some queries and they have different lengths. It is in these cases, when reading, I specify AByteCount = -1 to read any device response.
I have never heard of ReadBytes() not waiting for the ReadTimeout.
You're right! I was wrong. When specifying AByteCount = -1, I get one byte. As you said, if at least one byte arrives, it returns and ReadBytes() ends.
Also, make sure you are using an up-to-date version of Indy to begin with (or at least a version from the last few years).
I am working with C++ Builder 10.3 Community Edition, Indy version 10.6.2.5366.

How do I receive arbitrary length data using a UdpSocket?

I am writing an application which sends and receives packages using UDP. However, the documentation of recv_from states:
If a message is too long to fit in the supplied buffer, excess bytes may be discarded.
Is there any way to receive all bytes and write them into a vector? Do I really have to allocate an array with the maximum packet length (which, as far as I know, is 65,507 bytes for IPv4) in order to be sure to receive all data? That seems a bit much for me.
Check out the next method in the docs, UdpSocket::peek_from (emphasis mine):
Receives a single datagram message on the socket, without removing it from the queue.
You can use this method to read a known fixed amount of data, such as a header which contains the length of the entire packet. You can use crates like byteorder to decode the appropriate part of the header, use that to allocate exactly the right amount of space, then call recv_from.
This does require that the protocol you are implementing always provides that total size information at a known location.
Now, is this a good idea?
As ArtemGr states:
Because extra system calls are much more expensive than getting some space from the stack.
And from the linked question:
Obviously at some point you will start wondering if doubling the number of system calls to save memory is worth it. I think it isn't.
With the recent Spectre / Meltdown events, now's a pretty good time to be be reminded to avoid extra syscalls.
You could, as suggested, just allocate a "big enough" array ahead of time. You'll need to track how many bytes you've actually read vs allocated though. I recommend something like arrayvec to make it easier.
You could instead implement a pool of pre-allocated buffers on the heap. When you read from the socket, you use a buffer or create a new one. When you are done with the buffer, you put it back in the pool for reuse. That way, you incur the memory allocation once and are only passing around small Vecs on the stack.
See also:
How can I create a stack-allocated vector-like container?
How large should my recv buffer be when calling recv in the socket library
How to read UDP packet with variable length in C

libpcap: what is the efficiency of pcap_dispatch or pcap_next

I use libpcap to capture a lot packets, and then process/modify these packets and send them to another host.
First, I create a libpcap handler handle and set it NON-BLOCKING, and use pcap_get_selecable_fd(handle) to get a corresponding file descriptor pcap_fd.
Then I add an event for this pcap_fd to a libevent loop(it is like select() or epoll()).
In order to avoid frequently polling this file descriptor, each time there are packet arrival event, I use pcap_dispatch to collect a bufferful of packets and put them into a queue packet_queue, and then call process_packet to process/modify/send each packet in the queue packet_queue.
pcap_dispatch(handle, -1, collect_pkt, (u_char *)packet_queue);
process_packet(packet_queue);
I use tcpdump to capture the packets that are sent by process_packet(packet_queue), and notice:
at the very beginning, the interval between sent packets is small
after that several packets are sent, the interval becomes around 0.055 second
after 20 packets are sent, the interval becomes 0.031 second and keeps on being 0.031 second
I carefully checked my source code and find no suspicious blocks or logic which leads to so big intervals. So I wonder whether it is due to the problem of the function pcap_dispatch.
are there any efficiency problem on pcap_dispatch or pcap_next or even the libpcap file descriptor?
thanks!
On many platforms libpcap uses platform-specific implementations for faster packet capture, so YMMV. Generally they involve a shared buffer between the kernel and the application.
At the very beginning you have a time window between the moment packets start piling up on the RX buffer and the moment you start processing. The accumulation of these packets may cause the higher frequency here. This part is true regardless of implementation.
I haven't found a satisfying explanation to this. Maybe you got behind and missed a few packets, so you the time between packets resent becomes higher.
This is what you'd expect in normal operation, I think.
pcap_dispatch is pretty much as good as it gets, at least in libpcap. pcap_next, on the other hand, incurs in two penalties (at least on Linux, but I think it does in other mainstream platforms too): a syscall per packet (libpcap calls poll for error checking, even in non-blocking mode) and a copy (libpcap releases the "slot" in the shared buffer ASAP, so it can't just return that pointer). An implementation detail is that, on Linux, pcap_next just calls pcap_dispatch for one packet and with a copy callback.

Why is it not safe to use Socket.ReceiveLength?

Well, even Embarcadero states that it is not guaranteed to return accurate result of the bytes ready to read in the socket buffer, but if you look at it, when you place -1 at Socket.ReceiveBuf (this is what ReceiveLength wraps) it calls ioctlsocket with FIONREAD to determine the amount of data pending in the network's input buffer that can be read from socket s.
so, how is it not safe or bad ?
e.g: ioctlsocket(Socket.SocketHandle, FIONREAD, Longint(i));
The documentation you mention specifically says (emphasis mine)
Note: ReceiveLength is not guaranteed to be accurate for streaming socket connections.
This means that the length is not known ahead of time because it's being supplied by a stream of data. Obviously, if you don't know how big the data is that's being sent ahead of time, you can't properly set the length the client should expect.
Consider it like generic code to copy a file. If you don't know ahead of time how big the file is you'll be copying, you can't predict how many bytes you'll be copying. In the case of the socket, the stream size that's supplying the socket isn't known in advance (for instance, for data being generated real-time and sent), so there's no way to inform the client socket how much to expect.

Is transmitted bytes event exist in Linux kernel?

I need to write a rate limiter, that will perform some stuff each time X bytes were transmitted.
The straightforward is to check the length of each transmitted packet, but I think it will be to slow for me.
Is there a way to use some kind of network event, that will be triggered by transmitted packets/bytes?
I think you may look at netfilter.
Using its (kernel level) api, you can have your custom code triggered by network events, modify received messages before passing it to application, and so on.
http://www.netfilter.org/
It's protocol dependent, actually. But for TCP, you can setsockopt the SO_RCVLOWAT option to define the minimum number of bytes (watermark) to permit the read operation.
If you need to enforce the maximum size too, adjust the receive buffer size using SO_RCVBUF.

Resources