dbt table coverage metric - data-warehouse

Given that I have a data warehouse with various tables being created from various sources many of them by dbt, I want to measure a concept like 'dbt table coverage' which I define as:
dtc = count(tables and views that exist) / count(non ephemeral models and sources)
This would be really useful in order to maintain a sense of quality/completeness, especially during transition projects. Is there a dbt command like:
dbt report table-coverage --schemas=['reporting','example']
>>> 96% coverage, 48/50 tables in the schemas provided are captured in dbt.
If not, how can we add this to the project?!
What alternate approaches could I take to solving the problem

To do this I would probably create a model (view) that queried the information_schema and made some assumptions about a 1-to-1 mapping of {sourceTableName} to stg_{sourceTableName} (Assuming this means coverage for you).
Additionally I would look into using the graph.sources.values() JINJA function in order to iterate through all of the documented sources in your project, and then compare that with each of the models in {target.schema}
https://docs.getdbt.com/reference/dbt-jinja-functions/graph#accessing-sources
If you're comparing the existence of source.schema.yml with the source.information_schema. I would alter the approach to consider mapping each of the items in the graph against the total count of items in the information_schema on the source database.

A couple thoughts here since this is pretty intriguing to my current case as well:
dbt doesn't give outputs of queries or return a result to the command line. (That I know of!) So that if 1 inherently unsupported feature at this time. i.e. dbt report or dbt query doesn't exist yet. If that's desired, I'm recommend build out a feature request here:
https://github.com/fishtown-analytics/dbt/issues
If you're ok with making a model in dbt and then just executing that via your client of choice, let's give that a shot. (I'm using postgres so convert accordingly)
WITH schema_map as
(select schemaname as schema,
tablename as name,
'Table' as Type,
CASE WHEN schemaname like '%dbt%' THEN 1
ELSE 0 END as dbt_created
from pg_tables
WHERE NOT schemaname = ANY('{information_schema,pg_catalog}')
UNION
select schemaname as schema,
viewname as name,
'View' as Type,
CASE WHEN schemaname like '%dbt%' THEN 1
ELSE 0 END as dbt_created
from pg_views
WHERE NOT schemaname = ANY('{information_schema,pg_catalog}') )
SELECT count(name) as total_tables_and_views,
sum(dbt_created) as dbt_created,
to_char((sum(dbt_created)::dec/count(name)::dec)*100,'999D99%') as dbt_coverage
FROM schema_map
Gives the result:
total_tables_and_views | dbt_created | dbt_coverage
391 |292 | 74.68%

Just to feed back to the community and thanks to Jordan and Gscott for the inspiration. The solution I executed for SQL Server/ Synapse was:
A Daily execution of Counts of models in INFORMATION_SCHEMA.TABLES and in the dbt graph as one table.
An incremental table built on 1 that selects schemas of interest and aggregates. In my case below I filter out staging and testing.
DbtModelCounts:
{% set models = [] -%}
{% if execute %}
{% for node in graph.nodes.values()
| selectattr("resource_type", "equalto", "model")
%}
{%- do models.append(node.name) -%}
{% endfor %}
{% endif %}
with tables AS
(
SELECT table_catalog [db], table_schema [schema_name], table_name [name], table_type [type]
FROM INFORMATION_SCHEMA.TABLES
),
dbt_tables AS
(
SELECT *
FROM tables
WHERE name in (
{%- for model in models %}
('{{ model}}')
{% if not loop.last %},
{% endif %}
{% endfor %}
)
)
SELECT
tables.db,
tables.schema_name,
tables.type,
COUNT(tables.name) ModelCount,
COUNT(dbt_tables.name) DbtModelCount
FROM tables
LEFT JOIN dbt_tables ON
tables.name=dbt_tables.name AND
tables.schema_name = dbt_tables.schema_name AND
tables.db = dbt_tables.db AND
tables.type = dbt_tables.type
GROUP BY
tables.db,
tables.schema_name,
tables.type
Dbt Coverage:
{{
config(
materialized='incremental',
unique_key='DateCreated'
)
}}
SELECT
CAST(GETDATE() AS DATE) AS DateCreated,
GETDATE() AS DateTimeCreatedUTC,
SUM(DbtModelCount) AS DbtModelCount,
SUM(ModelCount) AS TotalModels,
SUM(DbtModelCount)*100.0/SUM(ModelCount) as DbtCoveragePercentage
FROM {{ref('DbtModelCounts')}}
WHERE schema_name NOT LIKE 'testing%' AND schema_name NOT LIKE 'staging%'
To do, add logic for defined sources to also compute percentage of sources that map to my staging or raw schema tables .

Related

How to pass a table name as a parameter in BigQuery procedure?

I am trying to build bigquery stored procedure where I need to pass the table name as a parameter. My code is:
CREATE OR REPLACE PROCEDURE `MKT_DS.PXV2DWY_CREATE_PROPERTY_FEATURES` (table_name STRING)
BEGIN
----step 1
CREATE OR REPLACE TABLE `MKT_DS.PXV2DWY_CREATE_PROPERTY_FEATURES_01` AS
SELECT DISTINCT XX.HH_ID, A.ECR_PRTY_ID, XX.ANCHOR_DT
FROM table_name XX
LEFT JOIN
(
SELECT DISTINCT HH_ID, ECR_PRTY_ID
FROM `analytics-mkt-cleanroom.Master.EDW_ECR_ECR_MAPPING`
WHERE HH_ID NOT LIKE 'U%'
AND ECR_PRTY_ID IS NOT NULL
)A
ON XX.HH_ID = A.HH_ID----one (HH) to many (ecr)
;
END;
CALL MKT_DS.PXV2DWY_CREATE_PROPERTY_FEATURES(`analytics-mkt-cleanroom.MKT_DS.Home_Services_Multi_Class_Aesthetic_Baseline_Final_Training_Sample`);
I followed a couple of similar questions here and here, tried writing an EXECUTE IMMEDIATE version of the above but not able to work out the right syntax.
I think issue is; the SELECT statement in my code is selecting multiple columns XX.HH_ID, A.ECR_PRTY_ID, XX.ANCHOR_DT and the EXECUTIVE IMMEDIATE setup is meant to work only for one column. But I'm not sure. Please advise. Thank you.
I am basically trying to write stored procedures for data pipeline building.
Hope below is helpful.
pass a parameter as a string.
CALL MKT_DS.PXV2DWY_CREATE_PROPERTY_FEATURES(`analytics-mkt-cleanroom.MKT_DS.Home_Services_Multi_Class_Aesthetic_Baseline_Final_Training_Sample`);
-->
CALL MKT_DS.PXV2DWY_CREATE_PROPERTY_FEATURES('analytics-mkt-cleanroom.MKT_DS.Home_Services_Multi_Class_Aesthetic_Baseline_Final_Training_Sample');
use EXECUTE IMMEDIATE since a table name can't be parameterized as a variable in a query.
----step 1
EXECUTE IMMEDIATE FORMAT("""
CREATE OR REPLACE TABLE `MKT_DS.PXV2DWY_CREATE_PROPERTY_FEATURES_01` AS
SELECT DISTINCT XX.HH_ID, A.ECR_PRTY_ID, XX.ANCHOR_DT
FROM `%s` XX
LEFT JOIN
(
SELECT DISTINCT HH_ID, ECR_PRTY_ID
FROM `analytics-mkt-cleanroom.Master.EDW_ECR_ECR_MAPPING`
WHERE HH_ID NOT LIKE 'U%%'
AND ECR_PRTY_ID IS NOT NULL
)A
ON XX.HH_ID = A.HH_ID----one (HH) to many (ecr)
;
""", table_name);
escape % in a format string with additional %
LIKE 'U%'
-->
LIKE 'U%%'
see PARSE_DATE not working in FORMAT() in BigQuery

Event-Time Temporal Table Join requires both primary key and row time attribute in versioned table, but no row time attribute can be found

I have tried to use lookup join but i find this problem:
SELECT
> e.isFired,
> e.eventMrid,
> e.createDateTime,
> r.id AS eventReference_id,
> r.type
> FROM Event e
> JOIN EventReference FOR SYSTEM_TIME AS OF e.createDateTime AS r
> ON r.id = e.eventReference_id;
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.ValidationException: Event-Time Temporal Table Join requires both primary key and row time attribute in versioned table, but no row time attribute can be found.
Whether that query will be interpreted by the Flink SQL planner as a temporal join or a lookup join depends on the type of the table on the right-hand side. In this case I guess you haven't used a lookup source. And your time attribute might not be defined correctly.
Temporal (time-versioned) joins require
an equality predicate on the primary key of the versioned table
a time attribute
and lookup joins require
a lookup source connector, (e.g., JDBC, HBase, Hive, or something custom)
an equality join predicate
using a processing time attribute in combination with
FOR SYSTEM_TIME AS OF (to prevent needing to update the join results)

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

Is it possible to use literal data as stream source in Sumologic?

Is it possible for a Sumologic user to define data source values inside a Query and use it in subquery condition?
For example in SQL, one can use literal data as source table.
-- example in MySQL
SELECT * FROM (
SELECT 1 as `id`, 'Alice' as `name`
UNION ALL
SELECT 2 as `id`, 'Bob' as `name`
-- ...
) as literal_table
I wonder if Sumo logic also have such kind of functionality.
I believe combining such literal with subqueries would make user's life easier.
I believe the equivalent in a Sumo Logic query would be combining the save operator to create a lookup table in a subquery: https://help.sumologic.com/05Search/Subqueries#Reference_data_from_child_query_using_save_and_lookup
Basically something like this:
_sourceCategory=katta
[subquery:(_sourceCategory=stream explainJSONPlan.ETT) error
| where !(statusmessage="Finished successfully" or statusmessage="Query canceled" or isNull(statusMessage))
| count by sessionId, statusMessage
| fields -_count
| save /explainPlan/neededSessions
| compose sessionId keywords]
| parse "[sessionId=*]" as sessionId
| lookup statusMessage from /explainPlan/neededSessions on sessionid=sessionid
Where /explainPlan/neededSessions is your literal data table that you select from later on in the query (using lookup).
You can define a lookup table with some static map/dictionary you update not so often (you can even point to a file in the internet in case you change the mapping often).
And then you can use the |lookup operator. It's nothing special for subqueries.
Disclaimer: I am currently employed by Sumo Logic.

PSQL group by vs. aggregate speed

So, the general question is, what's faster, taking an aggregate of a field or having extra expressions in the GROUP BY clause. Here are the two queries.
Query 1 (extra expressions in GROUP BY):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN some_date_field IS NOT NULL
THEN
FLOOR(((some_date_field - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
some_integer * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1, some_date_field, some_integer
) subquery
Query 2 (using an (arbitrary, because each record for the table 2 fields in question here have the same value (in this dataset)) aggregate function):
SELECT sum(subquery.what_i_want)
FROM (
SELECT table_1.some_id,
(
CASE WHEN MAX(some_date_field) IS NOT NULL
THEN
FLOOR(((MAX(some_date_field) - current_date)::numeric / 7) + 1) * MAX(some_other_integer)
ELSE
MAX(some_integer) * MAX(some_other_integer)
END
) what_i_want
FROM table_1
JOIN table_2 on table_1.some_id = table_2.id
WHERE ((some_date_field IS NOT NULL AND some_date_field > current_date) OR some_integer > 0) -- per the data and what i want, one of these will always be true
GROUP BY some_id_1
) subquery
As far as I can tell, psql doesn't provide good benchmarking tools. \timing on only times for one query, so running a benchmark with enough trials for meaningful results is... tedious at best.
For the record, I did do this at about n=50 and saw the aggregate method (Query 2) run faster on average, but a p value of ~.13, so not quite conclusive.
'sup with that?
The general answer - should be +- same. There's a chance to hit/miss function based index when using/not using functions on a field, but not aggregation function and in where clause more then in column list. But this is speculation only.
What you should use for analyzing execution is EXPLAIN ANALYZE. In plan you not only see scan types, but also number of iterations, cost and individual operations time. And of course you can use it with psql

Resources