How fast is state of the art HFT trading systems today? - low-latency

All the time you hear about high frequency trading (HFT) and how damn fast the algorithms are. But I'm wondering - what is fast these days?
Update
I'm not thinking about the latency caused by the physical distance between an exchange and the server running a trading application, but the latency introduced by the program itself.
To be more specific: What is the time from events arriving on the wire in an application to that application outputs an order/price on the wire? I.e. tick-to-trade time.
Are we talking sub-millisecond? Or sub-microsecond?
How do people achieve these latencies? Coding in assembly? FPGAs? Good-old C++ code?
Update
There's recently been published an interesting article on ACM, providing a lot of details into today's HFT technology, which is an excellent read:
Barbarians at the Gateways - High-frequency Trading and Exchange Technology

I'm the CTO of a small company that makes and sells FPGA-based HFT systems. Building our systems on-top of the Solarflare Application Onload Engine (AOE) we have been consistently delivering latency from an "interesting" market event on the wire (10Gb/S UDP market data feed from ICE or CME) to the first byte of the resultant order message hitting the wire in the 750 to 800 nanosecond range (yes, sub-microsecond). We anticipate that our next version systems will be in the 704 to 710 nanosecond range. Some people have claimed slightly less, but that's in a lab environment and not actually sitting at a COLO in Chicago and clearing the orders.
The comments about physics and "speed of light" are valid but not relevant. Everybody that is serious about HFT has their servers at a COLO in the room next to the exchange's server.
To get into this sub-microsecond domain you cannot do very much on the host CPU except feed strategy implementation commands to the FPGA, even with technologies like kernel bypass you have 1.5 microseconds of unavoidable overhead... so in this domain everything is playing with FPGAs.
One of the other answers is very honest in saying that in this highly secretive market very few people talk about the tools they use or their performance. Every one of our clients requires that we not even tell anybody that they use our tools nor disclose anything about how they use them. This not only makes marketing hard, but it really prevents the good flow of technical knowledge between peers.
Because of this need to get into exotic systems for the "wicked fast" part of the market you'll find that the Quants (the folks that come up with the algorithms that we make go fast) are dividing their algos into event-to-response time layers. At the very top of the technology heap are the sub-microsecond systems (like ours). The next layer are the custom C++ systems that make heavy use of kernel bypass and they're in the 3-5 microsecond range. The next layer are the folks that cannot afford to be on a 10Gb/S wire only one router hop from the "exchange", they may be still at COLO's but because of a nasty game we call "port roulette" they're in the dozens to hundreds of microsecond domain. Once you get into milliseconds it's almost not HFT any more.
Cheers

You've received very good answers. There's one problem, though - most algotrading is secret. You simply don't know how fast it is. This goes both ways - some may not tell you how fast they work, because they don't want to. Others may, let's say "exaggerate", for many reasons (attracting investors or clients, for one).
Rumors about picoseconds, for example, are rather outrageous. 10 nanoseconds and 0.1 nanoseconds are exactly the same thing, because the time it takes for the order to reach the trading server is so much more than that.
And, most importantly, although not what you've asked, if you go about trying to trade algorithmically, don't try to be faster, try to be smarter. I've seen very good algorithms that can handle whole seconds of latency and make a lot of money.

Good article which describes what is the state of HFT (in 2011) and gives some samples of hardware solutions which makes nanoseconds achievable: Wall Streets Need For Trading Speed: The Nanosecond Age
With the race for the lowest “latency” continuing, some market
participants are even talking about picoseconds–trillionths of a
second.
EDIT: As Nicholas kindly mentioned:
The link mentions a company, Fixnetix, which can "prepare a trade" in
740ns (i.e. the time from an input event occurs to an order being
sent).

"sub-40 microseconds" if you want to keep up with Nasdaq. This figure is published here
http://www.nasdaqomx.com/technology/

For what its worth, TIBCO's FTL messaging product is sub-500 ns for within a machine (shared memory) and a few micro seconds using RDMA (Remote Direct Memory Access) inside a data center. After that, physics becomes the main part of the equation.
So that is the speed at which data can get from the feed to the app that makes decisions.
At least one system has claimed ~30ns interthread messaging, which is probably a tweaked up benchmark, so anyone talking about lower numbers is using some kind of magic CPU.
Once you are in the app, it is just a question of how fast the program can make decisions.

Every single answer here is at least four years old and I thought I would share some perspective and experience from someone in the HFT / algorithmic trading field in 2018.
(This is not to say that any of these answers are poor as they most definitely are not however I believe it is necessary to provide insight regarding the topic that is more up to date).
To directly answer the first question: We are talking approximately 300 billionths of a second (300 nanoseconds). Recall this is latency introduced by the program itself.
There is always going to be some variance firm by firm regarding the latency of systems, however the numbers I am going to provide are the common values for internal HFT engine latency.
On average, one third of this time (300 nanoseconds) is attributed to latency introduced by the program as you stated in your question.
The remaining of the time is latency that exists due to co-location and other variables relating to the exchange, the matching engines, fibre optics, etc.
The question is about how fast high frequency trading systems are, and what the infrastructure looks like in terms of the hardware involved. The technology has advanced since 2014, however contrary to what a great deal of what the literature discusses in the field, FPGAs are not necessarily the go-to choice for the big players in the HFT space. Large companies such as Intel and Nvidia will cater to these firms with their specialized hardware to ensure they get everything they need from the trading system. With Intel obviously the system is going to be built more around CPUs and the kinds of computations best performed by CPUs, and with Nvidia the system will be more GPU oriented.
For systems on field programmable gate arrays (FPGAs), languages such as Verilog and VHDL are commonly used. However not everything is in assembly even for FPGA systems, most of it is highly optimized C++ with embedded inline assembly, this is where the speed often comes from. Note that this is the case for firms using all sorts of hardware (FPGAs, specialized Intel systems, etc.)
It is unfourtunate however that the top answer here states something completely false:
10 nanoseconds and 0.1 nanoseconds are exactly the same thing, because the time it takes for the order to reach the trading server is so much more than that.
This is completely false as the co-location aspect of high frequency trading has become completely standardized. Everyone is just as close to the matching engine as you are thus the internal latency of the system is of great importance.

These days single digit tick-to-trade in microseconds is the bar for competitive HFT firms. You should be able to do high single digits using only software. Then <5 usec with additional hardware.

According to the Wikipedia page on High-frequency trading the delay is microseconds:
High-frequency trading has taken place at least since 1999, after the U.S. Securities and Exchange Commission (SEC) authorized electronic exchanges in 1998. At the turn of the 21st century, HFT trades had an execution time of several seconds, whereas by 2010 this had decreased to milli- and even microseconds.

it will never be under a few microseconds, because of the em-w/light speed limit, and only a lucky few, that must be in under a kilometer away, can even dream to get close to that.
Also, there is no coding, to achieve that speed you must go physical.. (the guy with the article with the 300ns switch; that is only the added latency of that switch!; equal to 90m of travel thru optical and a bit less in copper)

Related

Is it possible/easy to determine how much power a program is using?

Is it possible to determine or even reasonably estimate how much power a program is using? The idea being to profile my code in terms of power consumption instead of just typical performance.
Is it enough to measure CPU use, GPU use and memory access?
There are many aspects that can influence the power consumption of an application, and these will vary a lot depending on the used hardware.
The easiest way to get an idea is to measure it. If your program is doing heavy calculations, it is quite simple to measure the difference. Just read out the usage while the app is running, and subtract the usage while not.
If your app is not of the heavy calculations kind, then the challenge is allot bigger, since a simple 1 point in time comparison will not do the trick. You could get a measuring device that can log usage over time, which log you would need to compare with the logged process activity of your machine and try to filter all other scheduled tasks (checks for updates etc) out.
Just a tip if you want to go this way, APC's UPS's come with this functionality built-in, and the PowerChute software stores a power consumption log in an Access Database (C:\Program Files\APC\PowerChute Personal Edition\EnergyLog.mdb). I'm not sure if this is true for all models, but it was a nice extra feature that came with mine (Pro 550). I would lay the data alongside an Xperf trace (the free built in profiler in Windows, look here for an overview), in order to correlate power variations with your applications activity, and filter out scheduled jobs etc...
That said, remember you will get different results on different hardware. An ssd will differ from a tradational harddisk, and the used grafix adapter can also make a difference, so you could only get a rough estimation overall, by measuring on a "typical" system. Desktop systems will consume allot more than laptops, etc... (also see this blogpost).
Power consumption profiling tools are allot more common for mobile devices. I'm not an expert in that field, but I know there are quite some tools out there.
There is a project that allows you to get power consumption of a given program of PID on Linux, requiring no hardware: scaphandre
With that you could create grafana dashboards to follow the power consumption of a given project from one release to another. Examples can be seen here.

How Scalable is ZeroMQ?

How scalable is ZeroMQ? I'm especially interested in understanding its potential for running on a large number (10,000 - 15,000) of cores.
We've tried to make it as scalable as possible but I personally tested only on up to 16 core boxes. Up to that limit we've seen almost linear scaling.
You don't mention whether your 10k or 15k cores are on the same box or not.
Let's assume they are. Every two years the number of cores on a box can, theoretically, double. So if we have 16-core boxes today, it'll be 16K cores in 20 years.
So now, your question is maybe, "will ZeroMQ help my application scale to such huge numbers of cores, so that it will scale over the next 20+ years?" The answer is "yes, but only if you use it properly". This means designing your application using inproc sockets and patterns that properly divide the work, and flow of data. You will need to adjust the architecture over time.
If your question is, "can I profitably use that many cores between multiple applications", the answer lies with your O/S more than ZeroMQ. Can your I/O layer handle the load? Probably, yes.
And if your question is, "can I use ZeroMQ across a cloud of 10K-16K boxes", then the answer is "yes, this has already been proven in practice".
Note that although ZeroMQ is multithreaded internally, it may not be wise to rely solely on that to scale it up to large numbers of cores. However, because ZeroMQ uses the same API for inter-machine, inter-process and inter-thread communication, it is easy to write application using ZeroMQ that can move seamlessly into a one-process-per-core scenario or into a grid fabric of many, many machines.
ZeroMQ already has a reputation for being the fastest structured messaging protocol around so if you were going to do benchmarks to choose a technology, ZeroMQ should definitely be one of them.
The two big reasons for using ZeroMQ are its easy-to-use cross-language API (see all the examples on the ZeroMQ Guide site) and its low overhead both in terms of bytes on the wire, and in terms of latency. For instance ZeroMQ can leverage UDP multicast to run faster than any TCP protocol, but the application programmer doesn't need to learn a new API. It is all included.

What is the significance of NCSU's double floating gate FET technology for the future of OS design?

this is my first post, and I certainly hope that it's appropriate to the forum - I've been lurking for some time, and I believe that it is, but my apologies if this is not the case.
I presume that most if not all readers here have encountered the story regarding the discovery (popularization?) of "double floating gate FETs" by NCSU, which is being hailed as a potential "universal memory."
http://www.engadget.com/2011/01/23/scientists-build-double-floating-gate-fet-believe-it-could-revo/
I've been slowly digesting a wide range of material related to software development, including OS design (I am, perhaps obviously, strictly amateur), and given that one of the most basic functions that any OS performs is managing the ever-present exchange of data between perpetual storage and volatile memory, it strikes me that if this technology matures and becomes widely available, it would represent such a sea-change in the role of the OS that it would almost necessitate rewriting our operating systems from the ground up to make full use of its potential. Am I correct in my estimation of the role of the OS, and in the potential ramifications of the new technology? Or, have I perhaps failed to understand some critical distinction regarding the logical (as opposed to physical) relationships between processor, memory, and storage?
Until we know what it'll cost in mass production, it's impossible to say. Unless it can be made cheaply enough to replace hard drives, so you erase the distinction between "main memory" and "long term storage", the impact will be minimal. I'd be surprised to see that happen though.
Even if economics allowed that to happen, I doubt it really would. Some mobile devices already use identical (battery backed RAM) for main memory and long term storage. This would eliminate the battery-backing for the memory, but doesn't seem likely to impact OS design at all. Since (at least most) use the same battery for normal operation and maintaining long-term storage, this would simply mean and extra 10% (or whatever) life from the same battery -- but only if its normal operation required about the same power as normal RAM (which strikes me as unlikely -- most flash draws extra power when writing).

What really is scaling?

I've heard that people say that they've made a scalable web application..
What really is scaling?
What can be done by developers to make their application scalable?
What are the factors that are looked after by developers during scaling?
Any tips and tricks about scaling web applications with asp.net and sql server...
What really is scaling?
Scaling is the increasing in capacity and/or usage of your application.
What do developers do to make their application scalable?
Either allow their applications to scale vertically or horizontally.
Horizontal scaling is about doing things in parallel.
Vertical scaling is about doing things faster. This typically means more powerful hardware.
Often when people talk about horizontal scalability the ideal is to have (near-)linear scalability. This means that if one $5k production box can handle 2,000 concurrent users then adding 4 more should handle 10,000 concurrent users. The closer it is to that figure the better.
The ideal for highly scalable apps is to have near-limitless near-linear horizontal scalability such that you can just plug in another box and your capacity increases by an expected amount with little or no diminishing returns.
Ideally redundancy is part of the equation too but that's typically a separate issue.
The poster child for this kind of scalability is, of course, Google.
What are the factors that are looked after by developers during scaling?
How much scale should be planned for? There's no point spending time and money on a problem you'll never have;
Is it possible and/or economical to scale vertically? This is the preferred option as it is typically much, much cheaper (in the short term);
Is it worth the (often significant) cost to enable your application to scale horizontally? Distributed/multithreaded apps are significantly more difficult and expensive to write.
Any tips and tricks about scaling web applications...
Yes:
Don't worry about problems you'll never have;
Don't worry much about problems you're unlikely to have. Chances are things will have changed long before you have them;
Don't be afraid to throw away code and start again. Having automated tests makes this far easier; and
Think in terms of developer time being expensive.
(4) is a key point. You might have a poorly written app that will require $20,000 of hardware to essentially fix. Nowadays $20,000 buys a lot of power (64+GB of RAM, 4 quad core CPUs, etc), probably more than 99% of people will ever need. Is it cheaper just to do that or spend 6 months rewriting and debugging a new app to make it faster?
It's easily the first option.
So I'll add another item to my list: be pragmatic.
My 2c definition of "scalable" is a system whose throughput grows linearly (or at least predictably) with resources. Add a machine and get 2x throughput. Add another machine and get 3x throughput. Or, move from a 2p machine to a 4p machine, and get 2x throughput.
It rarely works linearly, but a well-designed system can approach linear scalability. Add $1 of HW and get 1 unit worth of additional performance.
This is important in web apps because the potential user base is ~1b people.
Contention for resources within the app, when it is subjected to many concurrent requests, is what causes scalability to suffer. The end result of such a system is that no matter how much hardware you use, you cannot get it to deliver more throughput. It "tops out". The HW-cost versus performance curve goes asymptotic.
For example, if there's a single app-wide in-memory structure that needs to be updated for each web transaction or interaction, that structure will become a bottleneck, and will limit scalability of the app. Adding more CPUs or more memory or (maybe) more machines won't help increase throughput - you will still have requests lining up to lock that structure.
Often in a transactional app, the bottleneck is the database, or a particular table in the database.
What really is scaling?
Scaling means accommodating increases in usage and data volume, and ideally the implementation should be maintainable.
What developers do to make their application scalable?
Use a database, but cache as much as possible while accommodating user experience (possibly in the session).
Any tips and tricks about scaling web applications...
There are lots, but it depends on the implementation. What programming language(s), what database, etc. The question needs to be refined.
Scalable means that your app is prepared for (and capable of handling) future growth. It can handle higher traffic, more activity, etc. Making your site more scalable can entail various things. You may work on storing more in cache, rather than querying the database(s) unnecessarily. It may entail writing better queries, to keep connections to a minimum, and resources freed up.
Resources:
Seattle Conference on Scalability (Video)
Improving .NET Application Performance and Scalability (Video)
Writing Scalable Applications with PHP
Scalability, on Wikipedia
Books have been written on this topic. An excellent one which targets internet applications but describes principles and practices that can be applied in any development effort is Scalable Internet Architectures
May I suggest a "User-Centric" definition;
Scalable applications provide a consistent level of experience to each user irrespective of the number of users.
For web applications this means 24/7 anywhere in the world. However, given the diversity of the available bandwidth around the world and developer's lack of control over its performance and availability, we may re-define it as follows:
Scalable web applications provide a consistent response time, measured at the server TCP port in use, irrespective of the number of requests.
To achieve this the developer must avoid or remove all performance bottle-necks. Currently the most challenging issue is the scalability of distributed RDBMS systems.

Evaluate software minimum requirements

Is there a way to evaluate the minimum requirements of a software? I mean, how can I discover, for example, the minimum amount of RAM that my application will need?
Thanks!
A profiler will not help you here. Neither will estimating the size of data structures.
A profiler can certainly tell you where your code is spending the most CPU time, but it will not tell you if you are missing performance targets - e.g. if your users will be happy, or unhappy with the performance of your application on any given system.
Simply computing the size of data structures, and how many may be allocated at any one time will not at all give you an accurate picture of memory usage over time. The reason is that memory usage is determined by many other factors including how much I/O your application does, what OS services your application uses, and most importantly the temporal nature of how your application uses memory.
The most effective way to understand minimum requirements is to
Make sure you have an effective way of measuring performance using metrics that are important to your user. the best metric is response time. Depending on your app, a rate such as throughput or operations per second may be applicable. Your measurements could be empirical (e.g. just try it) but that is least effective. This is best done with some kind of instrumentation. On windows, the choice is [ETW][1]. Other operating systems have other suitable mechanisms.
Have some kind of automated method of exercising your application. This will let you make repeated and reliable measurements.
Measure your application using various memory sizes and see where performance begins to suffer. This may also expose performance bugs that prevent your application from performing well. If you have access to platforms of various performance levels, use those as well. You didn't indicate what your app does, but testing on a netbook with 1GB of memory is great for many (not all) client applications.
You can do the same with the CPU and other components such as disk, networking or the GPU.
Also note that there is no simple answer here - doing an effective job at setting minimum requirements is real work. This is especially true if your application is participatory sensitive to one platform aspect or another.
There are other factors as well - for example, your app may run fine in one configuration until the user opens another application that may be memory hungry or a CPU pig. Users rarely only have one application open.
This means that in addition to specifying minimum requirements you must do an effective job in setting user expectations - that is explaining when your application will perform well, and when it won't, and what the factors are that impact performance.
[1]: http://msdn.microsoft.com/en-us/library/ms751538.aspxstrong text
Ideally, you'd decide on the minimum requirements of a piece of software based on your target audience, and then test your software during development on that configuration to ensure it delivers a satisfactory experience.
You can look at a system running your software and see how much memory is being consumed by your application, and use that to guide how much memory is being consumed. CPU is a little bit more complex - you could try to model your CPU requirements, but doing this accurately can be challenging.
But ultimately, you need to test your app on the base system you are targeting.
Given the data structures used by the application, estimate how much space they will take up in normal use. Using that estimation, set up a number of machines (virtual or physical) to test the estimate in different scenarios (i.e. different target operating systems, different virtual memory settings, etc).
Then measure the performance of the application in the different scenarios. Your minimum settings will be the machine that performs the least adequately while still being acceptable.
You could try using a performance profiler on your software while stress testing it.
You could use virtualization to repeatedly run a representative test suite with different amounts of RAM in the virtual machine...when the performance falls below acceptable levels due to swapping, you've found the memory requirement.

Resources