Automatic denormalizing by query - denormalization

I wonder if it's possible to create a logic that automatically creates a denormalized table and it's data (and maintains it) by a specific SQL-like query.
Given a system where the user can maintain his data model and data. All data are stored in "relational" tables, but those tables are only used for the user to maintain his data. If he wants to display data on a webpage he has to write a query (SQL) which will automatically turn into a denormalized table and also be kept up-to-date when updating/deleting the relational data.
Let's say I got a query like this:
select t1.a, t1.b from t1 where t1.c = 1
The logic will automatically create a denormalized table with a copy of the needed data according to the query. It's mostly like a view (I wonder if views will be more performant than my approach). Whenever this query (give it a name) is needed by some business logic it will be replaced by a simple query on that new table.
Any update in t1 will search for all queries where t1 is involved and update the denormalized data automatically, but for performance win it will only update the rows infected (in this example just one row). That's the point where I'm not sure if it's achievable in an automatic way. The example query is simple, but what if there are queries with joins, aggregation or even sub queries?
Does an approach like this exist in the NoSQL world and maybe can somebody share his experience with it?
I would also like to know whether creating one table per query does conflict with any best practises when using NoSQL databases.
I have an idea how to solve simple queries just by finding the involved entity by its primary key when updating data and run the query on that specific entity again (so that joins will be updated, too). But with aggregation and sub queries I don't really know how to determine which denormalized table's entity is involved.

Related

Single vs. multiple ID columns in data warehouse/lake

I have setup a time-series / events database using the AWS Firehose -> S3/Glue -> Athena stack. It is being used to track various user actions - session started, action performed etc. across a number of our products. My question is about how best to store different types of IDs in this system.
The existing schema is one big 'fact table' with a bunch of different columns. Two of the most important columns are event_type_id and object_id. To use StackOverflow as an example, two events might be:
question_asked - in this case I would be storing the question id in the object_id column.
tag_created - in this case I would be storing the tag id in the object_id column.
My question is - is storing multiple different types of IDs in the same column bad practice? It's working OK for us at the moment, but it does require the person/system performing queries to know what type of object the object_id column refers to, based on the event they are querying.
If bad practice, what other approaches might be better? Multiple columns where they are NULL if not relevant for the event in that row? Or is this where dimension tables would be a better fit?
This isn't necessarily bad practice, depending on how you use it.
It sounds like you're aware of the potential pitfalls of such an approach (i.e. users of the data have to be aware of the context - in this case "event type" - to use the values correctly), so as you're using Athena you could mitigate that by creating views over source table for different event types, inserting a WHERE clause filter on event type and possibly renaming object_id to something more context specific e.g. question_id.
This makes it easier for users to work with the data and understand exactly what the values are they're working with.
In a big data environment I wouldn't recommend creating dimension tables if it can be avoided as JOINs between tables start to get expensive. Having multiple columns for different ids is possible but then you create new problems for users such as having to account for NULL values in an Id column, and this also potentially makes it harder to add new event types and ids as you have to change the schema to accommodate them.

Bitemporal master-satellite relationship

I am a DB newbie to the bitemporal world and had a naive question.
Say you have a master-satellite relationship between two tables - where the master stores essential information and the satellite stores information that is relevant to only few of the records of the master table. Example would be 'trade' as a master table and 'trade_support' as the satellite table where 'trade_support' will only house supporting information for non-electronic trades (which will be a small minority).
In a non-bitemporal landscape, we would model it as a parent-child relationship. My question is: in a bitemporal world, should such a use case be still modeled as a two-table parent-child relationship with 4 temporal columns on both tables? I don't see a reason why it can't be done, but the question of "should it be done" is quite hazy in my mind. Any gurus to help me out with the rationale behind the choice?
Pros:
Normalization
Cons:
Additional table and temporal columns to maintain and manage via DAO's
Defining performant join conditions
I believe this should be a pretty common use-case and wanted to know if there are any best practices that I can benefit from.
Thanks in advance!
Bitemporal data management and foreign keys can be quite tricky. For a master-satellite relationship between bitemporal tables, an "artificial key" needs to be introduced in the master table that is not unique but identical for different temporal or historical versions of an object. This key is referenced from the satellite. When joining the two tables a bitemporal context (T_TIME, V_TIME) where T_TIME is the transaction time and V_TIME is the valid time must be given for the join. The join would be something like the following:
SELECT m.*, s.*
FROM master m
LEFT JOIN satellite s
ON m.key = s.master_key
AND <V_TIME> between s.valid_from and s.valid_to
AND <T_TIME> between s.t_from and s.t_to
WHERE <V_TIME> between m.valid_from and m.valid_to
AND <T_TIME> between m.t_from and m.t_to
In this query the valid period is given by the columns valid_from and valid_to and the transaction period is given by the columns t_from and t_to for both the master and the satellite table. The artificial key in the master is given by the column m.key and the reference to this key by s.master_key. A left outer join is used to retrieve also those entries of the master table for which there is no corresponding entry in the satellite table.
As you have noted above, this join condition is likely to be slow.
On the other hand this layout may be more space efficient in that if only the master data (in able trade) or only the satellite data (in table trade_support) is updated, this will only require a new entry in the respective table. When using one table for all data, a new entry for all columns in the combined table would be necessary. Also you will end up with a table with many null values.
So the question you are asking boils down to a trade-off between space requirements and concise code. The amount of space you are sacrificing with the single-table solution depends on the number of columns of your satellite table. I would probably go for the single-table solution, since it is much easier to understand.
If you have any chance to switch database technology, a document oriented database might make more sense. I have written a prototype of a bitemporal scala layer based on mongodb, which is available here:
https://github.com/1123/bitemporaldb
This will allow you to work without joins, and with a more flexible structure of your trade data.

Grid sorting based on properties on the view model

I am developing an ASP.NET MVC4 web application. It uses the entity framework for data access. Many of the pages contain grids. These need to support paging, sorting, filtering and grouping. For performance the grid filtering, sorting, paging etc needs to occur on the database (i.e. the entity framework needs to generate a suitable SQL query). One complication is that the view model to represent the grid rows is built by combining the data from multiple business entities (tables). This could be simply getting the data from an entity a couple of levels down or by calculating it based on the values of related business entities. What approach is recommended to handle this scenario? Does anyone know of a good example on the web? Most have a simple mapping between the view model and business domain model.
Update 28/11 - To further clarify the initial display of the grid and paging works performs well. (See comment below) The problem is how do you handle sorting/ordering (and filtering) when the column that the user clicked on does not map directly to a column on the underlying business table. I am looking for a general solution to achieving this as the system will have approx 100 grids with a number of columns each and trying to handle this on a per column basis will not be maintainable.
If you want to be able to order a calculated field that isn't pre calculated in the database or do any Database Operation against it, then you are going to have to precalulate the value and store it in the database. I don't know anyway around that.
The only other solution is to move the paging and sorting etc to the web server, I am sure you don't really want to do that as you will have to calculate ALL the values to find what order they go in.
So if you want to achieve what you want - I think you will have to do the following, I would love to hear alternate solutions though:
Database Level Changes:
Add a Nullable Column for each calculated field you have in your View Model.
Write a SQL Script the calculates these values.
Set the column to Not Null if necessary
App Level Changes:
In your Add and Edit Pages you will have to calculate these values and Commit them with the rest of the data
You can now query against these at a Database level and use Queryable as you wanted.

How to make sure that it is possible to update a database table column only in one way?

I am using Ruby on Rails v3.2.2 and I would like to "protect" a class/instance attribute so that a database table column value can be updated only one way. That is, for example, given I have two database tables:
table1
- full_name_column
table2
- name_column
- surname_column
and I manage the table1 so that the full_name_column is updated by using a callback stated in the related table2 class/model, I would like to make sure that it is possible to update the full_name_column value only through that callback.
In other words, I should ensure that the table2.full_name_column value is always
"#{table1.name_column} #{table1.surname_column}"
and that it can't be another value. So, for example, if I try to "directly" update the table1.full_name_column, it should raise something like an error. Of course, that value must be readable.
Is it possible? What do you advice on handling this situation?
Reasons to this approach...
I want to use that approach because I am planning to perform database searches on table1 columns where the table1 contains other values related to a "profile"/"person" object... otherwise, probably, I must make some hack (maybe a complex hack) to direct those searches to the table2 so to look for "#{table1.name_column} #{table1.surname_column}" strings.
So, I think that a simple way is to denormalize data as explained above, but it requires to implement an "uncommon" way to handling that data.
BTW: An answer should be intend to "solve" related processes or to find a better approach to handle search functionalities in a better way.
Here's two approaches for maintaining the data on database level...
Views and materialized tables.
If possible, the table1 could be VIEW or for example MATERIALIZED QUERY TABLE (MQT). The terminology might differ slightly, depending on the used RDMS, I think Oracle has MATERIALIZED VIEWs whereas DB2 has MATERIALIZED QUERY TABLEs.
VIEW is simply an access to data that is physically in some different table. Where as MATERIALIZED VIEW/QUERY TABLE is a physical copy of the data, and therefore for example not in sync with source data in real time.
Anyway. these approaches would provide read-only access to data, that is owned by table2, but accessible by table1.
Example of very simple view:
CREATE VIEW table1 AS
SELECT surname||', '||name AS full_name
FROM table2;
Triggers
Sometimes views are not convenient as you might actually want to have some data in table1 that is not available from anywhere else. In these cases you could consider to use database triggers. I.e. create trigger that when table2 is updated, also table1 is updated within the same database transaction.
With the triggers the problem might be that then you have to give privileges to the client to update table1 also. Some RDMS might provide some ways to tune access control of the triggers, i.e. the operations performed by TRIGGERs would be performed with different privileges from the operations that initiate the TRIGGER.
In this case the TRIGGER could look something like this:
CREATE TRIGGER UPDATE_NAME
AFTER UPDATE OF NAME, SURNAME ON TABLE2
REFERENCING NEW AS NEWNAME
FOR EACH ROW
BEGIN ATOMIC
UPDATE TABLE1 SET FULL_NAME = NEWNAME.SURNAME||', '||NEWNAME.NAME
WHERE SOME_KEY = NEWNAME.SOME_KEY
END;
By replicating the data from table2 into table1 you've already de-normalized it. As with any de-normalization, you must be disciplined about maintaining sync. This means not updating things you're not supposed to.
Although you can wall off things with attr_accessible to prevent accidental assignment, the way Ruby works means there's no way to guarantee that value will never be modified. If someone's determined enough, they will find a way. This is where the discipline comes in.
The best approach is to document that the column should not be modified directly, block mass-assignment with attr_accessible, and leave it at that. There's no concept of a write-protected attribute, really, as far as I know.

Ruby dynamically tied to table

I've got a huge monster of a database (Okay that's not quite true, but there are over 8 million records in one product table)..
This table is fed by 13 suppliers.
Even with the best indexing I could come up with, searching for the top 10,000 records that are ready for supplier 8, is crazy slow.
What I'd like to do is create a product table for each supplier and parse the table into smaller tables.
Now in c++ or what have you, I'd just switch the table that I'm working with inside the class.
In ruby, it seems I'll have to create a new class for each table, and do a migration.
Also as I plan to have some in session tables #, I'd be interested in getting ruby to work with them..
Oh.. 8 million and set to grow to 20 million in the next 6 months.
A question posed, was what's my db engine.. Right now it's sql, but I'm open to pushing my db to another engine, if it will mean I can use temp tables, and "partitioned" tables.
One additional point to indexing.. Indexing on fields that change frequently isn't practical. Like price and quantity.. I'd have to re-index the changed items, each time I made a change.
By Ruby, I am assuming you mean that inheriting from the ActiveRecord::Base class in a Ruby on Rails application. By convention, you are correct in that each class is meant to represent a separate table.
You can easily execute arbitrary SQL using the "ActiveRecord::Base.connection.execute" method, and passing a string that is your SQL query. This would bypass having to create separate Ruby classes that would represent transient tables. This is not the "Rails approach", however it does address your question of allowing switching of the tables inside a class file.
More information on ActiveRecord database statements can be found here: http://api.rubyonrails.org/classes/ActiveRecord/ConnectionAdapters/DatabaseStatements.html
However, as other people have pointed out, you should be able to optimize your query such that splitting across multiple tables is not necessary. You may want to analyze your SQL query's execution plan using various tools to optimize the execution. If you are using MySQL view check out their query execution planning functionality: http://dev.mysql.com/doc/refman/5.5/en/execution-plan-information.html
By introducing indexes, or changing join methods between tables, etc you should be able to return reduce your query execution time.

Resources