I am a beginner to DataWarehousing. We have created a data mart, a star schema design to load quarterly data. We have been loading the current data as and when approved by the business for that quarter.
Now we have a requirement to go back and load historical data (for 3 years which is around 40GB). The dimensions for loading this data will be the same ones used for qaurterly load. However, can we load this historical data into the same fact table or do we have to create a duplicate fact table to load the historical data alone? Is that a DW standard? I am trying to find the ways to do this as per the standards.
The current fact table is date partitioned on load_cycle_date which specifies the quarter the data was loaded for.
Thanks much!
I don't see why getting historical data and using older load_cycle_dates won't fit in your existing table. This assumes you're able to transform them into this format. This is based on how much the data structures have changed over the years.
There are other areas you need to look into:
Do you have adequate historical values for all your dimensions? Example: Client Rating. There may be clients who ended up with a "Bad" rating, but that wasn't the case previously. There would need to be records for each change. The alternative would be to pull the data from backups.
Approval Process - Often a lot of data discrepancies aren't discovered until this is started. As a result of this, there may have been changes to the app that makes these corrections. You may find that some reports run as of this "prior data warehouse" data, will not be accurate.
There's no reason you shouldn't be able to do this for one quarter and test it out. It's the only way you'll know for sure. The current data warehouse I work with went through the same process of adding data before the warehouse was started. Conversions are very common.
Related
My Rails app allows users to setup a data feed (typically a REST API), and pulls in results at specific intervals to allow the user to later filter/sort/chart/export the data. An example could be pulling a stock price every 15 minutes and saving its value and a timestamp as a row in a table.
Since there could be many users with many feeds setup, I'm trying to determine the best way to handle all of this data in Rails.
I feel like I should stay away from one large mega table with a feed_id on each row since there could be millions and millions of rows very quickly (50 users with 5 feeds running every 15 minutes would be 25,000 rows per day). Will this get unwieldy too quickly or am I underestimating the power of Rails/Postgres? What is the limit?
Another option I came up with was giving each feed its own table – create a table when the feed is added and save the data there. In discussions I've read it seems like dynamic table creation is frowned upon except in special circumstances and I'm wondering if this one fits the mold.
The last option would be adding a second database - potentially NoSQL like MongoDB. I'd rather keep everything in one DB if possible but if that really will yield the best performance and reliability I'd give it a go.
I would love to hear people's experience and opinions in tackling something to this with Rails.
25,000 rows per day makes about 10 million per year. In this case you're well within limits of PostgreSQL for many years. Stock prices are mostly numeric, so, if I were you, I'd have a simple SQL table for all this data. Just avoid extra-long rows (texts) and you should be fine.
In future you could further extend your solution with partitioning (i.e. monthly or yearly) or move older data to some archive.
I am building an iOS application that will randomly generate sentences (think Mad Libs) where the data used for generation is in multiple tables. This will be used to generate scenarios for training lifeguards. Each table contains an item name, the words that will be used when selected, and different values that determine what can go togeather.
Using two of the 10 tables shown above, the application may pick a location of Deep Water. Then it needs to pick an appropriate activity for in the water, such as Breath holding, but not Running.
I have been looking at Core Data for storage but that seems to be more for data that is changing often by the user and users would never change the data stored. I do want to be able to update the tables myself fairly easily. What would be the optimal solution to do this? The ways I think of are:
Some kind of SQL DB, though my tables again aren't changing and
aren't relationshipable.
2-D arrays written into the source code. Not pretty to work with or read, but my knowledge of regex makes converting from TSV to array fairly easy.
TSV files attached to the project. Better organization itself but take some research on how to access.
Some other method Apple has that I do not know about.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project to implement an historian.
I can't really find a difference between an historian and a data warehouse.
Any details would be useful.
Data Historian
Data historians are groups of tables within a database that store historical information about a process or information system.
Data historians are used to keep historical data regarding a manufacturing system. This data can be changes in state of a data point, current values and summary data for these points. Usually this data comes from automated systems like PLCs, DCS or other process controlling system. However some historian data can be human entered.
There are several historians available for commercial use. However, one of the most common historians have tended to be custom developed. The commercial versions would be products like OsiSoft’s PI or GE’s Data Historian.
Some examples of data that could be stored in a data historian are items (or tags) like:
- Total products manufactured for the day
-Total defects created on a particular crew shift
-Current temperature of a motor on the production line
-Set point for the maximum allowable value being monitored by another tag
-Current speed of a conveyor
-Maximum flow rate of a pump over a period of time
-Human entered marker showing a manual event occured
-Total amount of a chemical added to a tank
These items are some of the important data tags that might be captured. However, once captured the next step is in presentation or reporting of that data. This is where the work of analysis is of great importance. The data/time stamp of one tag can have a huge correlation to another/other tag(s). Carefully storing this in the historians’ database is critical to good reporting.
The retrieval of data stored in a data historian is the slowest part of the system to be implemented. Many companies do a great job of putting data into a historian, but then do not go back and retrieve any of the data. Many times this author has gone into a site that claims to have a historian only to find that the data is “in there somewhere”, but has never had a report run against the data to validate the accuracy of the data.
The rule-of-thumb should be to provide feedback on any of the tags entered as soon as possible after storage into the historian. Reporting on the first few entries of a newly added tag is important, but ongoing review is important too. Once the data is incorporated into both a detailed listing and a summarized list the data can be reviewed for accuracy by operations personnel on a regular basis.
This regular review process by the operational personnel is very important. The finest data gathering systems that might historically archive millions of data points will be of little value to anyone if the data is not reviewed for accuracy by those that are experts in that information.
Data Warehouse
Data warehousing combines data from multiple, usually varied, sources into one comprehensive and easily manipulated database. Different methods can then be used by a company or organization to access this data for a wide range of purposes. Analysis can be performed to determine trends over time and to create plans based on this information. Smaller companies often use more limited formats to analyze more precise or smaller data sets, though warehousing can also utilize these methods.
Accessing Data Through Warehousing
Common methods for accessing systems of data warehousing include queries, reporting, and analysis. Because warehousing creates one database, the number of sources can be nearly limitless, provided the system can handle the volume. The final result, however, is homogeneous data, which can be more easily manipulated. System queries are used to gain information from a warehouse and this creates a report for analysis.
Uses for Data Warehouses
Companies commonly use data warehousing to analyze trends over time. They might use it to view day-to-day operations, but its primary function is often strategic planning based on long-term data overviews. From such reports, companies make business models, forecasts, and other projections. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only.
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.
I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.