I'm currently in the process of performance profiling. We have a basic client/server application. Would the TCP transfer speed be different if I ran client/server on the same machine (localhost) vs across two computers on a LAN?
TCP transfer speed will be! because if you ran it on same computer it will forward packets locally without even touching LAN and network adapter.
But overall speed of client+server may be better on different machines, especially if you do not communicate with server too often.
When using localhost, local resources are more likely to be the performance bottleneck because of memory, disk, cpu, etc. When using two computers, its more likely the network will be the bottleneck because of latency, bandwidth, throughput, packet loss, etc.
It depends on what your application does and how it uses the network, client, and server.
Yes definitely, the latency of sending it across the network would slow the program down. The throughput wouldn't but if you're waiting for replies before sending data then this builds up because of extra latency.
I just hit this issue on a project at work. Using UDP with localhost is at least an order of magnitude faster than over a network connection (maybe two orders of magnitude), and I believe with localhost there is no MTU ceiling of 1500 as normally exists for network ports.
One unconfirmed suspicion is the built in network ports on PCs are not all the same quality, so even if they claim to be gigabit, you may not be able to really go that fast. But it may also be making lots of Windows system calls (one OS call per packet) may be a significant overhead. With TCP I can hand the OS a large chunk of data to write in a single call. With UDP I have to hand it a packet at a time, limited by the MTU size, resulting in a much larger number of OS calls. But unconfirmed as yet.
Have not tried Linux yet.
It really depends on what your application does....
As an example:
If it transfers 10GB files from client to server, then yes, it will make a difference.
I don't know if it measurable (that also depends on your LAN's speed) but from a logical point of view, of course there is a difference. Localhost will always be the fastest as the data has not be sent through another medium (like air or copper wire).
But depending on what your application does, this might or might not matter.
The transfer times would almost certainly be faster if the client and server were on the same machine. That may not actually matter to the performance of your program as a whole depending on the other resources consumed by the client and server.
Related
Problem Statement:
I have a very high bandwidth data link that is UDP based. The source of this data is not configurable, and sends on UDP a stream of datagrams. We have code that uses the standard methods for receiving data on the UDP socket that works adequately. I wanted to know if
Does there exist a command interface to extract multiple UDP datagrams at a time? to improve efficiency?
If one doesn't exist, does it make sense to create a kernel module to provide the capability?
I am a novice, and i wanted to understand what thought process has to happen when writing your own kernel module seems appropriate. I know that such a surgical procedure isn't meant to done lightly, but there must be a set of criteria where that action is prudent. Maybe not in my case, but in general.
HW / Kernel Module Perspective
A typical network adapter these days would be capable of distributing received packets across multiple hardware Rx queues thus letting the host run multiple software Rx queues bound to different CPU cores reading out packets in parallel. From a single HW/SW queue perspective, the host may poll it for new packets (see Linux NAPI), with each poll ideally yielding a batch of packets, and, alternatively, the host may still use interrupt-driven approach for Rx signalling with interrupt coalescing turned on for improved efficiency.
Existing NIC drivers in Linux kernel strive to stick with the most performant techniques, and the kernel itself should be able to leverage all of that properly.
Userland / Application Perspective
There's PACKET_MMAP interface provided by Linux kernel for improved Rx/Tx efficiency on the application side. Long story short, an application can set up a memory buffer shared between the kernel- and userspace and read out incoming packets from it, ideally in batches, or blocks, thus avoiding costly kernel-to-userspace copies and context switches so customary when using regular methods.
For added efficiency, the application may have multiple sockets bound to the NIC in separate threads / processes and demand that packet reception be load balanced across these sockets (see AF_PACKET fanout mode description).
DPDK Perspective
Kernel bypass framework that allows an application to seize full control of a network adapter by means of a vendor-specific poll-mode driver, or PMD, effectively running in userspace as part of the application and by its very nature not needing any kernel-to-userspace copies, context switches and, most likely, locking. Multi-queue receive operation, load balancing (round robin, RSS, you name it) and more cutting edge offloads are likely to be available, too (it's vendor specific).
Summary
The short of it, given the fact that multiple network acceleration techniques already exist, one need never write their own kernel module to solve the problem in question. By the looks of it, your application, which, as you say, uses standard methods, is not aware of PACKET_MMAP technique. So I'd be tempted to suggest looking at this one closely. DPDK approach might require that the application be effectively re-implemented from scratch, so I would first go for PACKET_MMAP approach as a low-hanging fruit.
I would like to understand networking services with a large user base a bit better so that I know how to approach a project I am busy with.
The following statements that I make may be incorrect but they still lead to the question that I want to ask...
Please consider Skype and TeamViewer clients. It seems that both keep persistent network connections open to their respective servers. They use these persistent connections to initiate additional connections. Some of these connections are created by means of Hole Punching if the clients are behind NATs. They are then used for direct Peer-to-Peer communications.
Now according to http://expandedramblings.com/index.php/skype-statistics/ there are 300 million users using Skype and 4.9 million daily active users. I would assume that most of that 4.9 million users will most probably have their client apps running most of the day. That is a lot of connections to the Skype servers that are open at any given time.
So to my question; Is this feasible or at least acceptable? I mean, wouldn't it be better to not have a network connection open while idle and aspecially when there are so many connections open to the servers at once? The only reason I can think is that it would be the only way to properly do Hole Punching. Techically, how is this achieved on the server side?
Is this feasible or at least acceptable?
Feasible it certainly is, you mention already two popular apps that do it, so it is very doable in practice.
As for acceptable, to start no internet authority (e.g. IETF) has ever said it is unacceptable to have long-lived connections even with low traffic.
Furthermore, the only components for which this matters are network elements that keep connection/flow state. These are for sure the endpoints and so-called middleboxes like NAT and firewalls. For the client this is only one connection, the server is usually fine tuned by the application developers (who made this choice) themselves, so for these it is acceptable. For middleboxes it's simple: they have no choice, they're designed to just work with all kind of flows, including long-lived persistent connections.
I mean, wouldn't it be better to not have a network connection open while idle and aspecially when there are so many connections open to the servers at once?
Not at all. First of all, that could be 'much' slower as you'd need to set up a full connection before each control-plane call. This is especially noticeable if your RTT is big or if the servers do some complicated connection proxying/redirection for load-balancing/localization purposes.
Next to that this would historically make incoming calls difficult for a huge amount of users. Many ISP's block/blocked unknown incoming connections from the internet by means of a firewall. Similar, if you are behind a NAT device that does not support UPnP or PCP you can't open a port to listen on for your public IP address. So you need it even aside from hole-punching.
The only reason I can think is that it would be the only way to
properly do Hole Punching. Techically, how is this achieved on the
server side?
Technically you can't do proper hole-punching as soon as the NAT devices maintain a full <src-ip,src-port,dest-ip,dest-port,protocol> (classical 5-tuple) flow match. Then the best you can do with 'hole punching' is set up a proxy between peers.
What hole-punching relies on is that the NAT flow lookup is only looking at <src-ip,src-port,protocol> upstream and <dest-ip,dest-port,protocol> downstream to do the translation. In that case both clients just set up a connection to the server, their ip and port gets translated and the server passes this to the other client. The other client can now start sending packets to that translated <ip,port> combination which should work because NAT ignores the server's ip/port. But even if the particular NAT would work like this, some security device (e.g. stateful firewall) might detect session hi-jacking and drop this anyway.
Nowadays you rather use UPnP to open up a port to listen on your public IP which is much easier if supported.
We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?
This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.
I am designing an iOS app for a customer who wants to allow real-time (with minimum lag, max 50ms) conversations between users (a sort of Teamspeak). The lag must be low because the audio can also be live music, played with instruments, so all the users need to synchronize. I need a server, which will request audio recordings to every client and send to others (and make them hear the same sound at the same time).
HTTP is easy to manage/implement and easy to scale, but very low-performing because an average HTTP request takes > 50ms... (with a mid-level hardware), so I was thinking of TCP/UDP connections kept open between clients and server.
But I have some questions:
If I develop the server in Python (using TwistedMatrix, for example), how are its performance ?
I can't develop the server in C++ because it is hard to manage (scalable) and to develop.
Anyone used Nodejs (which is easy to scale) to manage TCP/UDP connections?
If I use HTTP, will it be fast enough with Keep-Alive? Becuase usually the time required for an HTTP Request to be performed is > 50ms (because opening-closing connection is hard), and I want the total procedure to be less than that time.
The server will be running on a Linux machine.
And finally: which type of compression can you suggest me? I thought Ogg Vorbis would be nice, but if there's anything better (and can be used in iOS), I am open to changes.
Thank you,
Umar.
First off, you are not going to get sub 50 ms latency. Others have tried this. See for example http://ejamming.com/ a service that attempts to do what you are doing, but has a musically noticeable delay over the line and is therefore, in the ears of many, completely unusable. They use special routing techniques to get the latency as low as possible and last I heard their service doesn't work with some router configurations.
Secondly, what language you use on server probably doesn't make much difference, as the delay from client to server will be worse than any delay caused by your service, but if I understand your service correctly, you are going to need a lot of servers (or server threads) just relaying audio data between clients or doing some sort of minimal mixing. This is a small amount of work per connection, but a lot of connections, so you need something that can handle that. I would lean towards something like Java, Scala, or maybe Go. I could be wrong, but I don't think this is a good use-case for node, which, as I understand it, does not do multithreading well at this time. Also, don't poo-poo C++, scalable services have been built C++. You could also build the relay part of the service in C++ and the rest in whatever.
Third, when choosing a compression format, you'll have to choose one that can survive packet loss if you plan to use UDP, and I think UDP is the only way to go for this. I don't think vorbis is up to this task, but I could be wrong. Off the top of my head, I'm not sure of anything that works on the iPhone and is UDP friendly, but I'm sure there are lots of things. Speex is an example and is open-source. Not sure if the latency and quality meet your needs.
Finally, to be blunt, I think there are som other things you should research a bit more. eg. DNS is usually cached locally and not checked every http call (though it may depend on the system/library. At least most systems cache dns locally). Also, there is no such protocol as TCP/UDP. There is TCP/IP (sometimes just called TCP) and UDP/IP (sometimes just called UDP). You seem to refer to the two as if they are one. The difference is very important for what you are doing. For example, HTTP runs on top of TCP, not UDP, and UDP is considered "unreliable", but has less overhead, so it's good for streaming.
Edit: speex
What concerns the server, the request itself is not a bottleneck. I guess you have sufficient time to set up the connection, as it happens only in the beginning of the session. Therefore the protocol is not of much relevance.
But consider that HTTP is a stateless protocol and not suitable for audio streaming. There are a couple of real time streaming protocols you can choose from. All of them will work over TCP or UDP (e.g. use raw sockets), and there are plenty of implementations.
In your case, the bottleneck with latency is not the server but the network itself. The connection between an iOS device and a wireless access point (AP) eats up about 40ms if the AP is not misconfigured and connection is good. (ping your iPhone.) In total, you'd have a minimum of 80ms for the path iOS -> AP -> Server -> AP -> iOS. But it is difficult to keep that latency stable. (Typical latency of AirPlay on my local network is about 300ms.)
I think live music over iOS devices is not practicable today. Try skype between two iOS devices and look how close you can get to 50ms. I'd bet no one can do it significantly better, what concerns latency.
Update: New research result!
I have to revise my claims regarding the latency of wifi connections of the iDevice. Apparently when you first ping your device, latency will be bad. But if I ping again no later than 200ms after that, I see an average latency 2ms-3ms between AP and iDevice.
My interpretation is that if there is no communication between AP and iDevice for more than 200ms, the network adapter of the iDevice will go to a less responsive sleep mode, probably to save battery power.
So it seems, live music is within reach again... :-)
Update 2
The ping-interval required for keep alive of low latency apparently differs from device to device. The reported 200ms is for an 3rd gen. iPad. For my iPhone 4 it's more like 50ms.
While streaming audio you probably don't need to bother with this, as data is exchanged on a more frequent basis. In my own context, I have sparse communication between an iDevice and a server, but low latency is crucial. A keep alive therefore is the way to go.
Best, Peter
I am developing a custom network driver for a PHY media which doesn't support full duplex mode.
I want to use TCP/IP traffic with this network driver and on top of this half-duplex PHY media.
But TCP/IP traffic can be full duplex. I would like to implement some mechanism/algorithm in this driver so that this custom network driver will convert TCP/IP traffic to Half duplex in linux.
Please let me know if this can be achieved or how to do it.
So you are trying to write a driver which supports full duplex traffic on a card which actually does NOT support the feature...
Well..you must be aware that the networking subsystem is one of the largest subsystems in the kernel and one of the few which actually uses softirqs (because it is always looking at getting scaled appropriately in this day and age of multiprocessers) and still had to resort to some trickery (NAPI) ir order to manage the deluge of interrupt requests generated by the ever increasing rates of the present day media...why im saying all this is because I just would want to remind you of the real life complexities involved in writing a 'regular' network driver, let alone a 'pseudo full duplex' driver.
Now I believe what you pretty much want is to give an illusion of 'full duplexity' to the...TCP/IP stack ( is it ??) i.e your driver is just another full duplex driver and it (any one of this driver's clients, be it the MAC layer or something like ethtool) can go have a ball with it (in terms of dumping & retrieving packets) in the same manner as it does with (and expects results out of) a 'regular full duplex' driver...
So if this is really the case, I wonder what good giving such an illusion might be? Perhaps you are just experimenting? IN any case, TCP is by default full duplex anyways and by using a half duplex media the data rates anyway are a bit lower (although not exactly half) than those obtained by using a full duplex adapter. I don't think it even matters at the higher layers (in terms of functionality) whether the media is full or half duplex (except may be in the MAC layer?) correct me if Im wrong.
There were (and still are) quite a few half duplex media in use currently and as such there are many media which support both full duplex and half duplex traffic..I fail to see how it will affect the clients of the driver (besides lowering the over all data rate as the only tangible effect)...which means you can pretty much look at any netwrk driver in the kernel and see that it has ways to configure the adapter to use either full or half duplex (and the user space can, say ethtool as one of the ways to toggle this...)...
Anyways, you may want to have a look and perhaps take a few tips from MODBus driver (the bus being by default half duplex) here.
I'm not sure how you're relating MAC layer with TCP layer. Duplex mode is a Ethernet domain and it doesn't propagate to IP and not event to TCP, in Ethernet terms duplexity means you can send or receive a MAC frames exclusively at different times (half duplex) or at the same time (full duplex).
The upper layers of the network stack are completely (at least they should) unaware of this process. Consider the following example, you're sending a huge file over the network using FTP, assuming most normal network systems the stack would be FTP/TCP/IP/Ethernet. From FTP perspective you have a virtual session, from TCP you have a virtual pipe, from IP you just know how to reach the end system and from Ethernet perspective you just know how to reach the next node in the network.
TCP doesn't care your packets are chopped during the transmission nor if your packet is delayed within a certain threshold due an incoming packet arriving It only cares to receive a confirmation receipt that the packet made it to the final destination. I hope I can show my point.