Does the dimension table support concurrent writing? - data-warehouse

I have created a dimension table in dolphindb database. Now I want to write the data in it concurrently. Does a dimension table support the concurrent writing?

Starting from the version 1.30.13 , a new configuration of enableConcurrentDimensionalTableWrite is added, which allows the dimension table to do parallel writing.

Related

Index and DB memory measuring of HSQLDB with JDBC

When working with the in memory version of HSQLDB with JDBC, is it possible to measure the size that the database (all tables + indexes) takes?
I would ideally want to access the sizes of the tables and indexes independently.
Is there a query suitable for this task or any other way that this can be measured?
You can calculate a rough estimate of memory use for each table based on the documentation: http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#dec_table_mem_use
Use the CARDINALITY column of the INFORMATION_SCHEMA.SYSTEM_TABLESTATS view for the row count of each table.

pub/sub on daily table

I am looking to implement pub/sub&Dataflow connection from my AppEngine to Bigquery
as I am trying to understand how exactly to define it
my problem is that I have a daily table a new table on Bigquery that opens once a day,
When I try to set the dataflow it only gives me an option to choose one table
Based on what you said in the comment section, you could use Daily Sharded tables or Time/day Partitioned tables.
According to the documentation, you can stream in both types. However, I must point out some differences, you would have to consider.
Time/Day Partitioned Tables:
These tables are internally divided into into segments/partitions, which are easier to manage and improve query performance. You can have more information about it here.
There are quotas, such as maximum number of partitions per table, which you have to check if they attend your needs.
When querying against Day/Time partitioned tables you can use the pseudo columns _PARTITIONTIME or _PARTITIONDATE, each one has its own format, which you can read more here.
You can stream individual rows using insertAll requests.
According to the documentation, Partitioned tables perform better than sharded tables because you don't need a copy of the metadata and verify permissions for each table.
Daily Sharded table:
There is not a pseudo column you can use to manage/query your database.
There is not a limit for the amount of tables you can create, you can read more about the quotas here.
You create daily tables using templates, such as <targeted_table_name> + <templateSuffix>, all with the same schema.
If you choose Partition table, you could create a date partition table and stream into it. Although, if you prefer Sharded table, you can use a template to create tables.
In addition, I would encourage you to read more about the differences and characteristics of each of one here.

How to use kdb+ to track an arbitrary number of IOT scalar streams?

I am trying to use kdb+ to capture and do aggregations on a number of sensory streams collated from iot sensors.
Each sensor has a unique identifier a time component (.z.z) and a scalar value:
percepts:([]time:`datetime$(); id:`symbol$(); scalar:`float$())
However because the data is temporal in nature, it would seem logical to maintain separate perceptual/sensory streams in different columns, i.e.:
time id_1 id_2 ...
15 0.15 ...
16 ... 1.5
However appending to a table indicatively only supports row operations in the insert fashion i.e. percepts insert (.z.z; `id_1; 0.15)
Seen as though I would like to support an large and non-static number of sensors in this setup, it would seem like an anti-pattern to append rows of the aforementioned format, before doing a transformation thereafter to turn the rows into columns based on their id. Would it be possible/necessary to create a table with a dynamic (growing) number of columns based upon new feature streams?
How would one most effectively implement logic that allows the insertion of columnar time series data averting the need to do a transform on row based data?
You can add data to a specific column. For that make following changes:
Make time column as key either permanently or during an update operation.
Use upsert to add data and pass data in table format.
Update function that I have mentioned below is specific to your example but you can make it more generic. It takes sensor name and sensor data as input. It performs 3 steps:
It first checks if the table is empty, in that case, set table schema as input dataset schema(which according to your example should be time and sensor name columns) and also make time as a primary key.
If the table has data but the column is missing for new sensor then first add a column with null float values and then upsert the data.
If a column is already there then just upsert the data.
q)t:() / table to store all sensors data
q)upd:{[s;tbl] `t set $[0=count t;`time xkey 0#tbl;not s in cols t;![t;();0b;enlist[s]!enlist count[t]#0Nf];t] upsert tbl}
q)upd[`id1;([]time:1#.z.z;id1:1#14.4)]
q)upd[`id2;([]time:1#.z.z;id2:1#2.3)]
time id1 id2
--------------------------------
2019.08.26T13:35:43.203 14.4
2019.08.26T13:35:46.861 2.3
Some points regarding your design:
If all sensors are not sending data for each time entry then the table will have a lot of null values (similar to the sparse matrix) which would be waste of memory and will have some impact on queries as well.
In that case, you can consider other design depending on your use case. For example, instead of storing each time entry, store data in time buckets. Another option is to group related sensors in a different table instead of storing all in one.
Another point you need to consider is you will have a fat table if you keep on adding sensors to it and that has its own issues. Also, it will become a single bottleneck point which could be an issue in the future and scaling it would be hard.
For small sensor sets, the current design is good but if you are planning to add many sensors in future then look into other design options.

Surrogate keys in fact-less fact tables

Why do you need surrogate keys in fact-less fact tables (or many to many dimensional relation tables)
Few circumstances when assigning a surrogate key to the rows in a fact table is beneficial:
Sometimes the business rules of the organization legitimately allow multiple identical rows to exist for a fact table. Normally as a designer, you try to avoid this at all costs by searching the source system for some kind of transaction time stamp to make the rows unique. But occasionally you are forced to accept this undesirable input. In these situations it will be necessary to create a surrogate key for the fact table to allow the identical rows to be loaded.
Certain ETL techniques for updating fact rows are only feasible if a surrogate key is assigned to the fact rows. Specifically, one technique for loading updates to fact rows is to insert the rows to be updated as new rows, then to delete the original rows as a second step as a single transaction. The advantages of this technique from an ETL perspective are improved load performance, improved recovery capability and improved audit capabilities. The surrogate key for the fact table rows is required as multiple identical primary keys will often exist for the old and new versions of the updated fact rows between the time of the insert of the updated row and the delete of the old row.
A similar ETL requirement is to determine exactly where a load job was suspended, either to resume loading or back put the job entirely. A sequentially assigned surrogate key makes this task straightforward.

Checking for updated dimension data

I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
"Extract" data from your OLTP database
Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
Insert the data ("Load") in to your dimension table.
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.

Resources