After searching in the web about how to use ODI to extract, load and transform data in the Data Warehouse, i'm still confused about one step.
After loading the tables in the Staging area, and doing the mapping to load the data, how can we in the second step load data in a datawarehouse(fact and dimension tables)?
Thank you.
Related
Currently have a tricky problem and need ideas for most efficient way to go about solving.
We periodically iterate through large CSV files (~50000 to 2m rows), and for each row, we need to check a database table for matching columns.
So for example, each CSV row could have details about an Event - artist, venue, date/time etc, and for each row, we check our database (PG) for any rows that match the artist, venue and date/time the most, and then perform operations if any match is found.
Currently, the entire process is highly CPU, memory and time intensive pulling row by row, so we perform the matching in batches, but still seeking ideas for an efficient way to perform the comparison both memory-wise, and time-wise
Thanks.
Load the complete CSV file into a temporary table in your database (using a DB tool, see e.g. How to import CSV file data into a PostgreSQL table?)
Perform matching and operations in-database, i.e. in SQL
If necessary, truncate the temporary table afterwards
This would move most of the load into the DB server, avoiding all the ActiveRecord overhead (network traffic, result parsing, model instantiation etc.)
I trying to know the difference between Data Mart and DSS
When I check the info in Internet about DSS vs DWH. I found that .
"Data warehouse is often the componet taht stores data for a DSS".
The problem is that as long as i know DWH is too the componet that stores data for a Data Mart.
so
What is the difference between a DSS and a Data Mart?
Thanks in advance , Enrique
More appropriate question would be: What is similar with Data Mart and DSS?
Data mart is subject oriented set of related tables where you have one fact table (transactions) and multiple dimension tables (categories). Example: Data mart of sales. Fact table (salesID,agentID,categoryID,dateID,amount,quantity). Dimension Agent (AgentID, AgentName, AgentType, etc)
Data Warehouse (it's database) is centralised repository of aggregated data from one or multiple source aimed to serve for reporting purpose. It's usually denormalized. It could be based on data marts or one logical data model in 3rn normalisation form.
DSS is information system, it's not database neither entity. It lies on data, but it also have it own's model and user interface. Model is critical for decision recommendation engine.
What may led you to misunderstands is because some of DSS lies on DWHs, specifically on Kimball (Data Marts) types of DWHs.
I have a service that generates a large map through multiple iterations and calculations from multiple tables. My problem is I cannot use pagination offset to slice the data because the data is coming from multiple tables and different modifications happen on the data. To display this on the screen; I have to send the map with 10-20,000 records to the view and that is problematic with this large dataset.
At this time I have on-page pagination but this is very slow and inefficient.
One thing I thought is to dump it on a table and query it each time but then I have to deal with concurrent users.
My question is what is the best approach to display this list when I cannot use database slicing (offset, max)?
I am using
grails 1.0.3
datatables and jquery
Maybe SlickGrid! is an option for you. One of there examples works with 50000 rows and it seems to be fast.
Christian
I end up writing the result of the map in a table and use the data slicing on that table for pagination. It takes some time to save the data but at least I don't have to worry about the performance with the large data. I use time-stamp to differentiate between requests. each requests will be saved and retrieved with its time stamp.
I'm wondering what's the best way to handle a huge matrix in Rails 3. This matrix would store distances between points (it's symetric).
Points could be added anytime so the matrix could be frequently updated.
I see two ways:
storing values in database and get distances through db requests (easy but a bit slow)
storing values in a file and put this file in cache (could be hard to update)
Thoughts?
PS: I'm packaging this for a new release of my gmaps4rails gem (dedicated to make gmaps easy for rails users)
If you have to store a unique and big matrix, I would recommend doing it in a separate table (column/line/value). It will scale better than with a file, and :
You can access and update individual cell more easily
You mentioned using a file to cache your matrix, but if the need arise you can also fetch your entire table to cache your matrix
You can update rows, columns, and sub-matrixes with well formed queries
If you encounter performance problems when making your matrix grow, take a look at the activerecord-import library. It will help you batch insert data in your matrix.
I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
"Extract" data from your OLTP database
Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
Insert the data ("Load") in to your dimension table.
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.