System trading applications - caching historical data in local database? - trading

Most trading applications receive datafeed from commerical providers such as IQFeed or brokerages that support trading API. Is there merit in storing it in the local database? Intraday datafeed is just massive in size, and the database would grow exponentially with 1 minute data for just 50 stocks, never mind tick-by-tick data. I suspect this would be a nightmare for database backup and may impact performance.
If you get historical data in text files on DVD or online, then storing it in the database is the only logical choice, but would it be still a good idea if you get it through API?

Its all about storage space really. You can definitely do it through API, but make sure you don't do it using the same application that is doing the automated trading for you.
As you said Tick Data is pretty much out of question, for a 1 minute data that would mean approximately 400 bars/day and 20000 bars for 50 symbols.
The calculation space can be calculated based on that, if you are storing OLHC it can be achieved with four values of type Int.
As the other answer pointed out, performance may be an issue with more and more symbols but shouldn't be a problem with 50 symbols on 1 minute bars.

This is a performance question. If the API is fast enough, then use that. If it's not and caching will help, then cache it. Only your application and your usage patterns can determine how much truth and necessity apply to these statements.

Related

ios: large number of database records - possible?

I may need to process a large number of database records, stored locally on an iPad, using Swift. I've yet to choose the database (but likely sqlite unless suggestions point to other). The records could number up to 700,000 and would need to be processed by way of adding up totals, working out percentages etc.
Is this even possible on an iPad with limited processing power? Is there a software storage limit? Is the iPad up to processing such large amounts of data on the fly?
Another option may be to split the data into smaller chunks or around 30,000 records and work with that. Even then I am not sure its a practical thing to attempt.
Any advice on how, or if, to approach this and what limitations may apply?

NoSQL (BigTable...) and TimeSeries Data

I work in an organization that collects/stores a lot of time series data (time=value,time=value...). Today we use a historian to collect and process this data. The main advantage of using a historian was to compress the data and be more efficient in terms of data storage. However, with technologies such as Big Data, NoSQL it seems the effort to compress data (because of storage $$) is fading and the trend is to store "lots" of data.
Has anyone experimented with replacing a time-series historian with
a BigData solution? I'm aware of OpenTSDB, has anyone used this in a
non IT role?
Would a NoSQL database (Cassandra...) be a good fit for time-series
data? If so, what might an implementation look like?
Is the importance on just collecting or storing or is speed or ease of analysis essential?
For most reasonable data sizes standard SQL will suffice.
Above that and especially for analysis you would preferably want an in-memory and column oriented database. At the highest end this means kdb by kx.com which is used by all major banks ($$ expensive). However you ask specifically about open source, I"d consider monetdb or mysql in memory depending on your data size and access requirements.
Cassandra is one of the more appropriate choices from the nosql bunch and people have tried using it already:
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://synfin.net/sock_stream/technology/advanced-time-series-metric-data-with-cassandra
I found I was spending a lot of time hacking around at the smallest data level to get things to work and creating a lot of verbose code. Which was then going to spread my data over multiple servers and try to make up for the inefficient storage by using multiple machines. When I evaluated it, it's time support and functions for manipulating time were poor and I couldn't do much more than just pull out ranges easily. For those reasons I moved on from cassandra.

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

Suitability of Amazon SimpleDB for large temporal data sets eminating from thousands of separate devices

I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.

Resources