how to handle very large amount of data in running project? - ruby-on-rails

I have a project in rails in which we have table product which stores very large amount of records and data grows everyday when we do query on that table it takes almost 2 to 4 minutes.
My question is how I can handle it project is already built it is no longer in state where we can redesign its architecture ?
I guess this is problem of handling big data but I am not aware of standards or practice to do that.
I tried to apply Indexing it helps but still data is too much I need another technique to handle it may be I need to split it into two tables but not sure.
the goal is to keep storing incoming data daily and do queries in milliseconds not in minutes.

If you're committed to postgresql tables, as it seems you are, here's a course of action you should take.
Start by reading this. https://stackoverflow.com/tags/query-optimization/info
Identify the three queries in your application that take the most elapsed time. Those could be long-running report-generation queries, or frequently-run transactional queries.
Use postgresql's EXPLAIN ANALYZE command to figure out how your server satisfies them. You'll have to learn a lot the first few times you gaze upon this SQL Execution Plan stuff. But it is worth your time to learn.
Come back here and ask query-optimization questions as you need to.
Add or adjust table indexes.
Repeat these steps for the next three slowest queries.
Plan on doing this every few months as a routine maintenance operation as long as your app continues to grow (hopefully for many years 😀).
Investigate your slowest queries before you commit to restructuring your data. For example, splitting your data into tables by month, or whatever, will be a lot of work and will add testing and ops complexity. Be sure that is justified before doing it, and that it will actually help. Many properly indexed SQL tables scale up to hundreds of megarows very well indeed.

Related

Do dynamic tables make sense for my use case? Database/Model architecture

My Rails app allows users to setup a data feed (typically a REST API), and pulls in results at specific intervals to allow the user to later filter/sort/chart/export the data. An example could be pulling a stock price every 15 minutes and saving its value and a timestamp as a row in a table.
Since there could be many users with many feeds setup, I'm trying to determine the best way to handle all of this data in Rails.
I feel like I should stay away from one large mega table with a feed_id on each row since there could be millions and millions of rows very quickly (50 users with 5 feeds running every 15 minutes would be 25,000 rows per day). Will this get unwieldy too quickly or am I underestimating the power of Rails/Postgres? What is the limit?
Another option I came up with was giving each feed its own table – create a table when the feed is added and save the data there. In discussions I've read it seems like dynamic table creation is frowned upon except in special circumstances and I'm wondering if this one fits the mold.
The last option would be adding a second database - potentially NoSQL like MongoDB. I'd rather keep everything in one DB if possible but if that really will yield the best performance and reliability I'd give it a go.
I would love to hear people's experience and opinions in tackling something to this with Rails.
25,000 rows per day makes about 10 million per year. In this case you're well within limits of PostgreSQL for many years. Stock prices are mostly numeric, so, if I were you, I'd have a simple SQL table for all this data. Just avoid extra-long rows (texts) and you should be fine.
In future you could further extend your solution with partitioning (i.e. monthly or yearly) or move older data to some archive.

Which approach promises better performance - a megaquery or several targeted queries?

I am creating an SSRS report that returns data for several "Units", which are all to be displayed on a row, with Unit 1 first, to its right Unit 2 data, etc.
I can either get all this data using a Stored Proc that queries the database using an "IN" clause, or with multiple targeted ("Unit = Bla") queries.
So I'm thinking I can either filter each "Unit" segment with something like "=UNIT:[Unit1]" OR I can assign a different Dataset to each segment (witht the targeted data).
Which way would be more "performant" - getting a big chunk of data, and then filtering the same thing in various locations, or getting several instances/datasets of targeted data?
My guess is the latter, but I don't know if maybe SSRS is smart enough to make the former approach work just as well or better by doing some optimizing "behind the scenes"
I think it really depends on how big the big chunk of data is. My experience has been that SSRS can process quite a large amount of data after it comes back from the database, and it does it quickly. If the report is going to aggregate the data in the end, I try to do as much of that as I can on the database end. The reason, usually the database server has more resources to do all that work. But, if the detail is needed, and you can aggregate on the report server end easily enough, pull 10K records and do it to it.
I lean toward hitting the database as few times as possible, but sometimes it just makes sense to get the data I need with individual queries. I have built reports with over 20 datasets, each for very specific measures that just didn’t union up really well. Breaking it up like this took the report run time from 3 minutes, to 20 seconds.
Not a great answer if you were looking for which exact solution to go with. It depends on the situation. Often, trial and error gets you to the answer for the report in question.
SSRS is not going to do any "optimizing" and the rendering requirements sound trivial, so you should probably consider this as SQL query issue, not really SSRS.
I would expect the single SELECT with an IN clause will be faster, as it will require fewer I/Os on the database files. An SP is not required, you can just write a SELECT statement.
A further benefit is that you will be left with N-times less code to maintain (where N = the number of Units), and can guarantee the consistency of the code/logic across Units.

Ruby on Rails database and application design

We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.

Suitability of Amazon SimpleDB for large temporal data sets eminating from thousands of separate devices

I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.

Resources