Let's say there is a data warehouse created from a shop data. A fact is a single purchase of a product. There is a dimension that describes a customer.
There is a need to create a measure that stores a number of distinct customers. Can this measure be created based on customer identifier in the dimension table, or it needs to be fact table? In which cases one or the other solution is better?
Below I post a visualization based on an AdventureWorks2016 database:
I have a college assignment requirement to built a Data warehouse for product Inventory management which can help inventory management understand in-hand value and using historical data they can predict when to bring new inventory. I have been reading to find out best way to do it using Cubes or Data mart. My question here is do I have to create a Data warehouse first and on top of that built Cube, Data mart or I can directly extract transactional data into Cube/Data Mart.
Next, Is it mandatory to built a Star Schema(or other DW schema) for doing this assignment as after reading multiple articles my understanding is OLAP cube can have multiple facts surrounded by Dimensions.
Your question is far bigger than you know!
As a general principle, you would have a staging database(s) which lands the data from one or more OLTP systems. then the staging database(s) would feed data to a datawarehouse (DWH). On top of a DWH would be built a number of Marts, these typically are subject area specific.
There are several DWH methodologies
Kimball Star Schema - you mention star schema above, this broadly is Kimball Star Schema. Proposed by Ralph Kimball. Also I would include here Snowflake Schemas, which are a variation on Star Schemas.
Inmon Model - Proposed by Bill Inmon
Data Vault - proposed by Dan Linstedt. Has a large user base in the Benelux countries. There are variations on the Data Vault.
It's important not to get confused between a DWH methodology and the technology to implement a DWH, though sometimes there are some technologies that lend themselves to particular methodologies. For example OLAP cubes work easily with Kimball star schemas. There is no particular need to use a relational technology for particular databases. Some NoSQL databases (like Cassandra) lend themselves to staging databases well.
To answer your specific questions
Do I have to create a Data warehouse first and on
top of that built Cube, Data mart or I can directly extract
transactional data into Cube/Data Mart.
OLAP Cubes are optional if you have a specific Mart that is tailored to your reporting but it depends on your reporting and analysis requirements and the speed of access.
A Data Mart could actually be built only using an OLAP cube, coming straight from the DWH.
Specifically on inventory management, all of these DWH methodologies would be suitable.
I can't answer your last question, as that seems to be the point of the assignment and you havn't given enough information to answer the question, but you need to do some research into dimensional modelling, so I hope this has pointed you in the right direction!
The answer is yes, a star model will always help a better analysis, but it is relational, a cube is multidimensional (where it performs all data crossings) and often uses as a data source to star models (recommended).
OLAP cubes are generally used for fast analysis and summaries of data.
So, by standard, I recommend you make all the star models you need and then generate the OLAP cubes for your analysis.
AS this is a 'homework' question, I would guess that the lecturer is looking for pros/cons between Kimball and Inmon which are the two 'default' designs for end-user reporting. In the real world DataVault can also be applied as part of the DWH strategy but it plays a different purpose and is not recommended for end-user consumption.
DataVault is a design pattern to bring data in from source systems unmolested. Data will inevitably need to be cleaned before being presented to the end-user solution and DV allows the DWH ETL process to be re-run if any issues are found or the business requirements change, especially if the granularity level goes down (e.g. the original fact table was for sales and the dimension requirements were for salesman and product category, now they want fact-sales by sales round and salesman for product subcategory and category. Without DV you do not have the granular data to replay the historical information and rebuild the DWH)
I am really confused between the definition of ROLAP and a Data warehouse. When we load aggregate data in relational tables can we call this ROLAP? Or is ROLAP a reporting tool?
Data warehouse: Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence.
Many types of business data are analyzed via data warehouses. The need for a data warehouse often becomes evident when analytic requirements run afoul of the ongoing performance of operational databases. Running a complex query on a database requires the database to enter a temporary fixed state. This is often untenable for transactional databases.
A data warehouse is employed to do the analytic work, leaving the transactional database free to focus on transactions. The other benefits of a data warehouse are the ability to analyze data from multiple sources and to negotiate differences in storage schema using the ETL process.
ROLAP: Cubes in a data warehouse are stored in three different modes. A relational storage model is called Relational Online Analytical Processing mode or ROLAP, while a Multidimensional Online Analytical processing mode is called MOLAP. When dimensions are stored in a combination of the two modes then it is known as Hybrid Online Analytical Processing mode or HOLAP.
The advantages of ROLAP model is it can handle a large amount of data and can leverage all the functionalities of the relational database. The disadvantages are that the performance is slow and each ROLAP report is an SQL query with all the limitations of the genre. It is also limited by SQL functionalities. ROLAP vendors have tried to mitigate this problem by building into the tool out-of-the-box complex functions as well as providing the users with an ability to define their own functions.
A datawarehouse mainly focuses on the structure and organization of the data whereas ROLAP (or an OLAP) concentrates on the usage of the data. A data warehouse mainly serves as a repository to store data (historical) that can be used for analysis. An OLAP is the processing that can be used to analyze and evaluate data that is stored in a warehouse.
I'm a relatively noob programmer. I am creating web based GIS tool where users can upload custom datasets ranging from 10 rows to 1million. The datasets can have variable columns and datatypes. How do you manage these user submitted datasets?
Is the creation of a table per dataset a bad idea? (BTW - i'll be using postgresql as the database).
My apologies if this is already answered somewhere, but my search did not turn up any good results. I may be using bad keywords in my search.
Thanks!
creating a table per dataset is not a 'bad' idea at all. swivel.com was a very similar app to what you are describing and we used table per dataset and it worked very well for graph generation on user uploaded datasets and comparing data across datasets using joins. we had over 10k datasets and close to a million graphs and some datasets were very large.
you also get lots of free usage out of your orm layer, for instance we could use active record for working with a dataset (each dataset is a generated model class with its table set to the actual table)
pitfall wise is you gotta do a LOT of joins if you have any kind of cross dataset calculations.
My coworkers and I recently tackled a similar problem where we had a poor data model in MySQL and were looking for better ways to implement it. We weighed a few different options, including MongoDB, and ended up using the entity attribute value model. The EAV model is essentially a 3-column model. It allowed us to a single model to represent a variable number of columns and data types.
You can read a little about our problem here, but it sounds like it might be a good fit for you too.
The terms are used all over the place, and I don't know of crisp definitions. I'm pretty sure I know what a data mart is. And I've created reporting cubes with tools like Business Objects and Cognos.
I've also had folks tell me that a datamart is more than just a collection of cubes.
I've also had people tell me that a datamart is a reporting cube, nothing more.
What are the distinctions you understand?
Cube can (and arguably should) mean something quite specific - OLAP artifacts presented through an OLAP server such as MS Analysis Services or Oracle (nee Hyperion) Essbase. However, it also gets used much more loosely. OLAP cubes of this sort use cube-aware query tools which use a different API to a standard relational database. Typically OLAP servers maintain their own optimised data structures (known as MOLAP), although they can be implemented as a front-end to a relational data source (known as ROLAP) or in various hybrid modes (known as HOLAP)
I try to be specific and use 'cube' specifically to refer to cubes on OLAP servers such as SSAS.
Business Objects works by querying data through one or more sources (which could be relational databases, OLAP cubes, or flat files) and creating an in-memory data structure called a MicroCube which it uses to support interactive slice-and-dice activities. Analysis Services and MSQuery can make a cube (.cub) file which can be opened by the AS client software or Excel and sliced-and-diced in a similar manner. IIRC Recent versions of Business Objects can also open .cub files.
To be pedantic I think Business Objects sits in a 'semi-structured reporting' space somewhere between a true OLAP system such as ProClarity and ad-hoc reporting tool such as Report Builder, Oracle Discoverer or Brio. Round trips to the Query Panel make it somewhat clunky as a pure stream-of-thought OLAP tool but it does offer a level of interactivity that traditional reports don't. I see the sweet spot of Business Objects as sitting in two places: ad-hoc reporting by staff not necessarily familiar with SQL and provding a scheduled report delivered in an interactive format that allows some drill-down into the data.
'Data Mart' is also a fairly loosely used term and can mean any user-facing data access medium for a data warehouse system. The definition may or may not include the reporting tools and metadata layers, reporting layer tables or other items such as Cubes or other analytic systems.
I tend to think of a data mart as the database from which the reporting is done, particularly if it is a readily definable subsystem of the overall data warehouse architecture. However it is quite reasonable to think of it as the user facing reporting layer, particularly if there are ad-hoc reporting tools such as Business Objects or OLAP systems that allow end-users to get at the data directly.
The term "data mart" has become somewhat ambiguous, but it is traditionally associated with a subject-oriented subset of an organization's information systems. Data mart does not explicitly imply the presence of a multi-dimensional technology such as OLAP and data mart does not explicitly imply the presence of summarized numerical data.
A cube, on the other hand, tends to imply that data is presented using a multi-dimensional nomenclature (typically an OLAP technology) and that the data is generally summarized as intersections of multiple hierarchies. (i.e. the net worth of your family vs. your personal net worth and everything in between) Generally, “cube” implies something very specific whereas “data mart” tends to be a little more general.
I suppose in OOP speak you could accurately say that a data mart “has-a” cube, “has-a” relational database, “has-a” nifty reporting interface, etc… but it would be less correct to say that any one of those individually “is-a” data mart. The term data mart is more inclusive.
Data mart is a collection of data of a specific business process. It is irrelevant how the data is stored. A cube stores data in a special way, multiple-dimension, unlike a table with row and column. A cube in a olap database is like a table to traditional database. A data mart can have tables or cubes. Cubes make the analysis faster because it pre-calculates aggregations ahead of time.
As the name suggests, a cube is a structured multidimensional data-set, (typically three dimensions each representing three sides of a cube). A data mart is just a container and not a structure by itself, although it contains data-sets flatly organized (as tables) in dimensions and facts.
The structure of a cube makes it easy to visualize or conceptualize data along various dimensions of a cube. Thus most business analysts or developers find it easy to query and interact with the cube.
Since a data mart is just a container with a bunch of tables; users need to first conceptualize and understand dimensional structures before querying and analyzing data.
Data mart traditionally has meant static data, usually date/time oriented, used by analysts for statistics, budgeting, performance and sales reporting, and other planning activities.
A Cube is an OLAP database that pretty exhaustively converts OLTP data into a static, date/time-oriented schema that uses a query language that is not SQL, but built specifically for answering data mart type questions. It uses terms like measures, dimensions, star-schema, etc. rather than tables, columns, and rows. The best familiar analogy might be pivot-tables in a spreadsheet.
Remember:
Data Warehousing is the process of taking data from legacy and transaction database systems and transforming it into organized information in a user-friendly format to encourage data analysis and support fact-based business decision making.
A Data Warehouse is a system that extracts, cleans, conforms, and delivers
source data into a dimensional data store and then supports and implements
querying and analysis for the purpose of decision making.
KIMBALL e.g. consistently has defined data mart as a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions.
Data marts are based on the source of data, not on a department’s view of data.
Data marts contain all atomic detail needed to support drilling down to the lowest level.
Data marts can be centrally controlled or decentralized.
CORRECT DEFINITION
Process based
Atomic Data Foundation
Data Measurement
MISGUIDED DEFINITION
Department Based
Aggregate Data Only
User Question Based
To me, a datamart is just place where data gets dumped in a relatively flat, unusable format.
Cube is taking that data and making it dance.
I agree with Matthew. We tend to use the term 'Data Mart' for any data source that stores generic data and mappings used across various applications in an enterprize. We don't store measurable data in a data mart, so I see a data mart as one of multiple data sources for a cube. This, however, is how we do it. I am sure there is nothing preventing you from storing measurable data in a data mart.