What is the difference between data mining and data minimization.
Especially in the context of Internet of things.Most of the information gathered by me was about the data minimizing role in the internet of things,but no one has discussed their co relation.
Related
For extracting causal DAG from a time series data, I have read some papers that utilize MLP/LSTM as well as other algorithms. But due to ease of use, I want to use the TETRAD software. But I am not understanding I how to input a time series data like a stock exchange or hospital emergency data in the software. For example, here is a sample of the data I am using. I am facing problem to make the software understand that the data has temporal aspects and it needs to model the data as a time series data to extract the causal relationship DAG. I am not finding proper instructions on which algorithm to use in TETRAD for time series causal relationship extraction as well as how to model the data. As causal inference is not my primary field of study, any guidance will be helpful.
I tried using the FGES algorithm to extract the relationship but I am not sure if that is the correct way.
What is the difference between a data warehouse and a MOLAP server?
Is the data stored at both the data warehouse and on the MOLAP server?
When you pose a query, do you send it to the data warehouse or the MOLAP server?
With ROLAP, it kind of makes sense that the ROLAP server pose SQL queries to the data warehouse (which store fact and dimension tables), and then do the analysis. However, I have read somewhere that ROLAP gathers its data directly from the operational database (OLTP), but then, where/when is the data warehouse used?
The 'MOLAP' flavour of OLAP (as distinguished from 'ROLAP') is a data store in its own right, separate from the Data Warehouse.
Usually, the MOLAP server gets its data from the Data Warehouse on a regular basis. So the data does indeed reside in both.
The difference is that a MOLAP server is a specific kind of database (cube) that precalculates totals at levels in hierarchies, and furthermore structures the data to be even easier for users to query and navigate than a data warehouse (with the right tools at their disposal).
Although a data warehouse may be dimensionally modelled, it is still often stored in a relational data model in an RDBMS.
Hence MOLAP cubes (or other modern alternatives) provide both performance gains and a 'semantic layer' that makes it easier to understand data stored in a data warehouse.
The user can then query the MOLAP server rather than the Data Warehouse. That doesn't stop users querying the Data Warehouse directly, if that's what your solution needs.
You're right that when the user queries a ROLAP server, it passes on the queries to the underlying database, which may be an OLTP system, but is more often going to be a data warehouse, because those are designed for reporting and query performance and understandability in mind. ROLAP therefore provides the user-friendly 'semantic layer' but relies on the performance of the data warehouse for speed of queries.
I am really confused between the definition of ROLAP and a Data warehouse. When we load aggregate data in relational tables can we call this ROLAP? Or is ROLAP a reporting tool?
Data warehouse: Data warehousing is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed for greater business intelligence.
Many types of business data are analyzed via data warehouses. The need for a data warehouse often becomes evident when analytic requirements run afoul of the ongoing performance of operational databases. Running a complex query on a database requires the database to enter a temporary fixed state. This is often untenable for transactional databases.
A data warehouse is employed to do the analytic work, leaving the transactional database free to focus on transactions. The other benefits of a data warehouse are the ability to analyze data from multiple sources and to negotiate differences in storage schema using the ETL process.
ROLAP: Cubes in a data warehouse are stored in three different modes. A relational storage model is called Relational Online Analytical Processing mode or ROLAP, while a Multidimensional Online Analytical processing mode is called MOLAP. When dimensions are stored in a combination of the two modes then it is known as Hybrid Online Analytical Processing mode or HOLAP.
The advantages of ROLAP model is it can handle a large amount of data and can leverage all the functionalities of the relational database. The disadvantages are that the performance is slow and each ROLAP report is an SQL query with all the limitations of the genre. It is also limited by SQL functionalities. ROLAP vendors have tried to mitigate this problem by building into the tool out-of-the-box complex functions as well as providing the users with an ability to define their own functions.
A datawarehouse mainly focuses on the structure and organization of the data whereas ROLAP (or an OLAP) concentrates on the usage of the data. A data warehouse mainly serves as a repository to store data (historical) that can be used for analysis. An OLAP is the processing that can be used to analyze and evaluate data that is stored in a warehouse.
People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.
The terms are used all over the place, and I don't know of crisp definitions. I'm pretty sure I know what a data mart is. And I've created reporting cubes with tools like Business Objects and Cognos.
I've also had folks tell me that a datamart is more than just a collection of cubes.
I've also had people tell me that a datamart is a reporting cube, nothing more.
What are the distinctions you understand?
Cube can (and arguably should) mean something quite specific - OLAP artifacts presented through an OLAP server such as MS Analysis Services or Oracle (nee Hyperion) Essbase. However, it also gets used much more loosely. OLAP cubes of this sort use cube-aware query tools which use a different API to a standard relational database. Typically OLAP servers maintain their own optimised data structures (known as MOLAP), although they can be implemented as a front-end to a relational data source (known as ROLAP) or in various hybrid modes (known as HOLAP)
I try to be specific and use 'cube' specifically to refer to cubes on OLAP servers such as SSAS.
Business Objects works by querying data through one or more sources (which could be relational databases, OLAP cubes, or flat files) and creating an in-memory data structure called a MicroCube which it uses to support interactive slice-and-dice activities. Analysis Services and MSQuery can make a cube (.cub) file which can be opened by the AS client software or Excel and sliced-and-diced in a similar manner. IIRC Recent versions of Business Objects can also open .cub files.
To be pedantic I think Business Objects sits in a 'semi-structured reporting' space somewhere between a true OLAP system such as ProClarity and ad-hoc reporting tool such as Report Builder, Oracle Discoverer or Brio. Round trips to the Query Panel make it somewhat clunky as a pure stream-of-thought OLAP tool but it does offer a level of interactivity that traditional reports don't. I see the sweet spot of Business Objects as sitting in two places: ad-hoc reporting by staff not necessarily familiar with SQL and provding a scheduled report delivered in an interactive format that allows some drill-down into the data.
'Data Mart' is also a fairly loosely used term and can mean any user-facing data access medium for a data warehouse system. The definition may or may not include the reporting tools and metadata layers, reporting layer tables or other items such as Cubes or other analytic systems.
I tend to think of a data mart as the database from which the reporting is done, particularly if it is a readily definable subsystem of the overall data warehouse architecture. However it is quite reasonable to think of it as the user facing reporting layer, particularly if there are ad-hoc reporting tools such as Business Objects or OLAP systems that allow end-users to get at the data directly.
The term "data mart" has become somewhat ambiguous, but it is traditionally associated with a subject-oriented subset of an organization's information systems. Data mart does not explicitly imply the presence of a multi-dimensional technology such as OLAP and data mart does not explicitly imply the presence of summarized numerical data.
A cube, on the other hand, tends to imply that data is presented using a multi-dimensional nomenclature (typically an OLAP technology) and that the data is generally summarized as intersections of multiple hierarchies. (i.e. the net worth of your family vs. your personal net worth and everything in between) Generally, “cube” implies something very specific whereas “data mart” tends to be a little more general.
I suppose in OOP speak you could accurately say that a data mart “has-a” cube, “has-a” relational database, “has-a” nifty reporting interface, etc… but it would be less correct to say that any one of those individually “is-a” data mart. The term data mart is more inclusive.
Data mart is a collection of data of a specific business process. It is irrelevant how the data is stored. A cube stores data in a special way, multiple-dimension, unlike a table with row and column. A cube in a olap database is like a table to traditional database. A data mart can have tables or cubes. Cubes make the analysis faster because it pre-calculates aggregations ahead of time.
As the name suggests, a cube is a structured multidimensional data-set, (typically three dimensions each representing three sides of a cube). A data mart is just a container and not a structure by itself, although it contains data-sets flatly organized (as tables) in dimensions and facts.
The structure of a cube makes it easy to visualize or conceptualize data along various dimensions of a cube. Thus most business analysts or developers find it easy to query and interact with the cube.
Since a data mart is just a container with a bunch of tables; users need to first conceptualize and understand dimensional structures before querying and analyzing data.
Data mart traditionally has meant static data, usually date/time oriented, used by analysts for statistics, budgeting, performance and sales reporting, and other planning activities.
A Cube is an OLAP database that pretty exhaustively converts OLTP data into a static, date/time-oriented schema that uses a query language that is not SQL, but built specifically for answering data mart type questions. It uses terms like measures, dimensions, star-schema, etc. rather than tables, columns, and rows. The best familiar analogy might be pivot-tables in a spreadsheet.
Remember:
Data Warehousing is the process of taking data from legacy and transaction database systems and transforming it into organized information in a user-friendly format to encourage data analysis and support fact-based business decision making.
A Data Warehouse is a system that extracts, cleans, conforms, and delivers
source data into a dimensional data store and then supports and implements
querying and analysis for the purpose of decision making.
KIMBALL e.g. consistently has defined data mart as a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions.
Data marts are based on the source of data, not on a department’s view of data.
Data marts contain all atomic detail needed to support drilling down to the lowest level.
Data marts can be centrally controlled or decentralized.
CORRECT DEFINITION
Process based
Atomic Data Foundation
Data Measurement
MISGUIDED DEFINITION
Department Based
Aggregate Data Only
User Question Based
To me, a datamart is just place where data gets dumped in a relatively flat, unusable format.
Cube is taking that data and making it dance.
I agree with Matthew. We tend to use the term 'Data Mart' for any data source that stores generic data and mappings used across various applications in an enterprize. We don't store measurable data in a data mart, so I see a data mart as one of multiple data sources for a cube. This, however, is how we do it. I am sure there is nothing preventing you from storing measurable data in a data mart.