I am trying to understand fundamentals of data warehousing.
While loading fact tables, i found two recommendations.
Separate inserts, updates during load
Drop Indexes and build them after load.
What is the advantage of following them?
Simple answers without going into details:
Typically you want to different things with new data (inserts) or changed data (update or insert depending on how you treat changes)
An index needs to be recreated when you alter a table resulting in a lot of slow index recreations after each row of data is loaded. When you load a lot of data (which is a typical scenario in a data warehouse) you significantly slow down the loading process for no good reason. Therefore it is strongly advised to drop the indexes before you load a lot of data and recreate them only once when done loading
Related
I have an application that, at its core, is a sort of data warehouse and report generator. People use it to "mine" through a large amount of data with ad-hoc queries, produce a report page with a bunch of distribution graphs, and click through those graphs to look at specific result sets of the underlying items being "mined." The problem is that the database is now many hundreds of millions of rows of data, and even with indexing, some queries can take longer than a browser is willing to wait for a response.
Ideally, at some arbitrary cutoff, I want to "offline" the user's query, and perform it in the background, save the result set to a new table, and use a job to email a link to the user which could use this as a cached result to skip directly to the browser rendering the graphs. These jobs/results could be saved for a long time in case people wanted to revisit the particular problem they were working on, or emailed to coworkers. I would be tempted to just create a PDF of the result, but it's the interactive clicking of the graphs that I'm trying to preserve here.
None of the standard Rails caching techniques really captures this idea, so maybe I just have to do this all by hand, but I wanted to check to see if I wasn't missing something that I could start with. I could create a keyed model result in the in-memory cache, but I want these results to be preserved on the order of months, and I deploy at least once a week.
Considering Data loading from lots of join tables. That's why it's taking time to load.
Also you are performing calculation/visualization tasks with the data you fetch from DB, then show on UI.
I like to recommend some of the approaches to your problem:
Minimize the number of joins/nested join DB queries
Add some direct tables/columns, ex. If you are showing counts of comments of user the you can add new column in user table to store it in user table itself. You can add scheduled job to update data or add callback to update count
also try to minimize the calculations(if any) performing on UI side
you can also use the concept of lazy loading for fetching the data in chunks
Thanks, hope this will help you to decide where to start 🙂
By using both the synchronous=OFF and journal_mode=MEMORY options, I am able to reduce the speed of updates from 15 ms to around 2 ms which is a major performance improvement. These updates happen one at a time, so many other optimizations (like using transactions about a bunch of them) are not applicable.
According to the SQLite documentation, the DB can go 'corrupt' in the worst case if there is a power outage of some type. However, is not the worst thing that can happen is for the data to be lost, or possibly part of a transaction to be lost (which I guess is a form of corruption). Is it really possible for arbitrary corruption to occur with either of these options? If so, why?
I am not using any transactions, so partially written data from transactions is not a concern, and I can handle loosing data once in a blue moon. But if 'corruption' means that all the data in the DB can be randomly changed in an unpredictable way, that would be a strong reason to not use these options.
Does any one know what the real worst-case behavior would be on iOS?
Tables are organized as B-trees with the rowid as the key.
If some writes get lost while SQLite is updating the tree structure, the entire table might become unreadable.
(The same can happen with indexes, but those could be simply dropped and recreated.)
Data is organized in pages (typically 1 KB or 4 KB). If some page update gets lost while some tree is being reorganized, all the data in these pages (i.e., some random rows from the table with nearby rowid values) might become corruped.
If SQLite needs to allocate a new page, and that page contains plausible data (e.g., deleted data from the same table), and the writing of that page gets lost, then you have incorrect data in the table, without the ability to detect it.
I searched a lot but couldnt pretty much find what I was specifically looking for. The Question is simple and straightforward.
I have a database table, which gets populated every second!
Next, I have almost defined the Analysis Methods/classes in the Apache Storm Spout/Bolts classes.
All I wish to do is, send those new rows being inserted every second to the Spout class as a stream input.
How Do I do this?
Thanks,
There are several ways you could accomplish this, but without knowing more about the nature of the data it's hard to give a good answer. One way would be to use another table to track which records have already been processed by storm based on some field in the original table. For instance, if you used a timestamp column you could track the maximum timestamp you have already processed. There are some potential race conditions you have to be careful of with both the reading/updating of the metadata table as well as the actual data table, but both of those can be managed with transactions and proper time synchronization.
Teradata provide functionality of Queue tables. These tables support "select and consume" operation, which means it will remove rows from table as soon as you select them. For more information: http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1146_111A/ch01.032.045.html#ww798205
This approach assumes that table in Teradata is used as buffer and nobody else needs it.
If you need to have both: permanent full table (for some other application) as well as streaming this data to Storm, you may want to modify your loading process in a way to populate permanent table as well as queue table. In this case other applications can use whole data depth in permanent table, and Storm will consume data from queue table with minimal space impact.
We have a core data DB (sqlite store) which, for some users, is about 100-150 MB. I wouldn't think that would be too big for a storage system to deal with (even on a mobile device), but we've found that with that size core data DB, ANY lightweight migration takes ~10+ seconds to complete. Even something as simple as adding a completely new independent entity (not related to any other entity). With raw sqlite this would be a create table statement. So, my question is whether anyone else has seen this and, if so, have they found a workaround to make such simple migrations faster? Specifically, I'm looking for a way to handle adding a new independent entity to an existing core data DB that's ~100-150 MB and have it be quick (i.e., under 5 seconds).
I believe that core data migrations ALWAYS have to read all of the data from the source and write it all to a destination for a migration (which is terrible BTW), but I'm hoping someone can prove me wrong. :) I couldn't find any way to do it with a custom migration either.
I've considered munging the DB with sql directly to basically make the model look like what CoreData would expect (I've done stuff like this to manually "downgrade" a core data DB for debugging purposes), but we really want to avoid doing something like that in production.
UPDATE:
For reference, this is the current approach we are taking. This is not a generic solution, but will work for our use case. Unless I get a better answer I'll add this as an answer at some point in the future and accept it.
We're going to deal with this by essentially making the DB smaller. There are 2 out of 15 entities that take up the bulk of the space in the DB (~95%). We're going to create completely separate data models each with one of those entities. This is done without changing the main model at all (hence, no core data migration). We'll then make a task that runs with background priority in GCD that, if any of those entities are found in the main DB, they're moved to the appropriate separate DB and removed from the main DB. This is done in batches with some sleeps between batches so it's less resource intensive and doesn't affect normal app operation. We'll modify the code that accesses those entities to try to get them from the new DB and fall back to the main DB if they're not in there.
In a future update after we find that all, or at least most, of our users have updated their data in the new DBs we'll drop those entities from the main DB.
This leaves us with a small main DB that can have migrations applied quickly and two large DBs that have migrations done more slowly. Those large DBs, in our case, should change less often (maybe never?) and even if they do change there are limited places in the app that require them so we can work around it in the UI (e.g., report some feature as unavailable until we move data).
A 10-20 second delay for an update to a huge dataset seems perfectly reasonable to me. Just don't do it on the main thread.
This means you'll have to modify the boilerplate Core Data stack setup that you get in the usual Xcode templates. Instead of always setting up the stack on the main thread at launch time, check to see if migration is needed. If so, put up appropriate UI, do the migration in a background thread, and be ready to invoke beginBackgroundTaskWithExpirationHandler: if needed.
I'm currently building a little admin platform with statistics and graphs with highchart and highstock. Right now the graphs are always fetching the data from the database everytime the load. But since the data will grow substantially in the future this is very inefficient and slows the database down. My question now is what the best approach it would be to store or precompute the data so the graphs don't have to fetch it from the database everytime they load.
If the amount of data is going to be really huge and you can sacrifice some realtime-ish manner of presenting it, the best way would be to compute the data for charts in some seperate database table and show the charts out of it. You can setup a background process (using whenever or delayed_job or whatever you like) to periodically update the pre-processed data table with fresh values.
Another option would be caching the chart response by any means you like (using built-in Rails caching, writing your custom cache etc) to deliver the same data to a big number of users with reduced DB hit.
However, in general, preprocessing seems to be the winner as stats tables usually contain very "sparse" data which can be pre-computed to a much smaller set to display on a chart yet having the option to apply some filtering / sorting if needed.
EDIT just forgot to mention there might be some room for optimization on the database side. For instance, if you can limit the periods the user is able to view data for (= cap the amount to be queried per used), than proper indexing and DB setup can deliver substantial performance without man-in-the middle things like caching or precomputing.