How can we load fact table in star schema using informatica powercenter ? Can you please provide any example for mappings/tranformations for this.
to load fact table ,if there is star schema dimentions table are independant at that time lookup on every dimention which you have to load, override the query with only active records check the condition with only natural key means your primary key in dimention after that on that basis take the surrogate key which artifically made by us for loading dimention table and also take which field you want to load in to that fact table.
Take the Staging tables as source tables and take the dimensions as lookups then load the data into fact table.
eg. http://www.folkstalk.com/2012/11/how-to-load-rows-into-fact-table-in.html
I was not able to find one when I was learning, hence adding this screenshot as a reference for new learners.
the mapping basically looks up at each of the dimension tables, and loads the dimension keys into fact as Foriegn keys and rest of the active records should come from SQ, I have used SQL override to perform all the joins and conditions required for loading the fact records.
Related
Here's the situation, in the source database we have more than 600K active rows for a dimension but in reality the business only uses 100 of them.
Unfortunately the list of values that they might use is not known and we can't manually filter on those values to populate the dimension table.
I was thinking, what if I include the dimension columns for that table in the fact table and then when we send that to staging area, just seperate it from the fact and send it to it's own table.
This way, I will only capture the values that are actually used.
P.S. They have a search function in the application that help users navigate through 600K values. it's not like a drop-down field !
Do you have a better recommendation?
Yes - you could build the Dimension from the fact staging table. A couple of things to consider:
If the only attribute for the Dimension is the field in the fact staging table then you can keep this as a degenerate dimension in the fact table; no need to build a dimension table for it - unless you have other requirements that require a standalone dimension table, such as your BI tool needs it.
If there are other attributes you need to include in the dimension then you are still going to need to bring in the source dimension table - but you can filter it using the the values in the fact staging table and only load the used values into your dimension
I have a "catalog" that I am trying to display information on. This information will be pulled from a few different tables that a user will be able to set a preference to hide a record from the respective table on their "catalog". I am running a Postgres database
So, my question is:
Would it be better (performance wise) to create a new table (table_a_to_catalog) where it would store the table_a_id and the catalog_id for the record from table_a that the user wants to hide for that catalog. Then have another table (table_b_to_catalog) to hold that connection...and so on...
OR
Would it be better to store the hide preference as a json value in the record of the catalog? So it would be something like {"table_a" => [id1, id2, id3], "table_b" => [id1, id2, id3]}
It really depends on the usecase of this catalog... If the information is readonly and you are running a job once a day to update the said catalog then json would be better. However, if you want to update information on the catelog live and and allow it to be editable then having a separate table would be best.
As for personal preference, I think keeping data in table allows more flexibility when you want to use the data for other features
Having very large tables negatively impacts for performance. Keeping "hide" view data in a postgres table means having a DB entry for each hidden entry in each catalog. Each client application will need to filter that table for information relevant to their user, and with many users this could take considerable time.
If one simply adds a field to the user table, containing an hstore, JSON or CSV of view data (e.g. hide preferences), that will reduce the initial load time marginally. JSON would make more sense if "hiding" means simply not displaying it client-side, wheras hstore makes more sense if you wish to not send the data to the client to begin with.
I say marginally because many other factors (caching) will impact performance more than this. You may want to look into using Redis for the application runtime and Postgres for data warehousing.
First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.
We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...
I am currently building a Rails app where there is a "documents" data table that stores references to pdfs living on an S3 server. These documents could have 100 different types. Each type can have up to 20 attributes or meta info.
My dilemma is do I make 100 relational tables for every doc type or just create one key/value data table with a reference to the doc_id.
My gut tells me to go key/value for flexibility for searching and supporting more and more document types over time without having to create new migrations. However, I know there are pitfalls with this technique. My first concern of course is the size of the table. The key/value table could end up with millions of rows.
On the other hand, having 100 attribute tables would be nightmare to query against in a full text search situation.
So bottom line is, by going with key/value, is performance on a 3 column Postgres table with potentially millions of rows a scaling problem? Also what about joins on the value field?
This data would almost never change by the way. So it would be 90% reads.
Consider a single table with an hstore column. It is a PostgreSQL data type designed for storing key/value pairs.
http://www.postgresql.org/docs/9.1/static/hstore.html
There are also multiple Ruby gems that add hstore support to ActiveRecord. Here is one that I wrote: https://github.com/JackC/surus You can search ruby gems for about a dozen more alternatives as well.