I want to apply joins on 2 DBT models and those 2 models has the same column names. Hence when I am trying to apply any joins. I am getting the following error.
column "rateplan_amendmenttype" specified more than once
Here is snippet of my code which I am trying:
with a_in as (
select * from {{source('dbt_alice', 'common_a')}}
)
select * from a_in cross join {{source('dbt_alice', 'common_c')}}
Using the adapter function of DBT.We can achieve the result.
-- store the columns from a_in and b_in as a list in jinja
{%- set common_a_cols = adapter.get_columns_in_relation(source('dbt_alice', 'common_a')) -%}
{%- set common_c_cols = adapter.get_columns_in_relation(source('dbt_alice', 'common_c')) -%}
-- select every field, dynamically applying a rename to ensure there are no conflicts
select
{% for col in common_a_cols %}
common_a.{{col.name}} as a_{{col.name}},
{% endfor %}
{% for col2 in common_c_cols %}
{% if not loop.last %}
common_c.{{col2.name}} as b_{{col2.name}},
{% else %}
common_c.{{col2.name}} as b_{{col2.name}}
{% endif %}
{% endfor %}
from
{{source('dbt_alice', 'common_a')}} as common_a
cross join
{{source('dbt_alice', 'common_c')}} as common_c
Related
Given that I have a data warehouse with various tables being created from various sources many of them by dbt, I want to measure a concept like 'dbt table coverage' which I define as:
dtc = count(tables and views that exist) / count(non ephemeral models and sources)
This would be really useful in order to maintain a sense of quality/completeness, especially during transition projects. Is there a dbt command like:
dbt report table-coverage --schemas=['reporting','example']
>>> 96% coverage, 48/50 tables in the schemas provided are captured in dbt.
If not, how can we add this to the project?!
What alternate approaches could I take to solving the problem
To do this I would probably create a model (view) that queried the information_schema and made some assumptions about a 1-to-1 mapping of {sourceTableName} to stg_{sourceTableName} (Assuming this means coverage for you).
Additionally I would look into using the graph.sources.values() JINJA function in order to iterate through all of the documented sources in your project, and then compare that with each of the models in {target.schema}
https://docs.getdbt.com/reference/dbt-jinja-functions/graph#accessing-sources
If you're comparing the existence of source.schema.yml with the source.information_schema. I would alter the approach to consider mapping each of the items in the graph against the total count of items in the information_schema on the source database.
A couple thoughts here since this is pretty intriguing to my current case as well:
dbt doesn't give outputs of queries or return a result to the command line. (That I know of!) So that if 1 inherently unsupported feature at this time. i.e. dbt report or dbt query doesn't exist yet. If that's desired, I'm recommend build out a feature request here:
https://github.com/fishtown-analytics/dbt/issues
If you're ok with making a model in dbt and then just executing that via your client of choice, let's give that a shot. (I'm using postgres so convert accordingly)
WITH schema_map as
(select schemaname as schema,
tablename as name,
'Table' as Type,
CASE WHEN schemaname like '%dbt%' THEN 1
ELSE 0 END as dbt_created
from pg_tables
WHERE NOT schemaname = ANY('{information_schema,pg_catalog}')
UNION
select schemaname as schema,
viewname as name,
'View' as Type,
CASE WHEN schemaname like '%dbt%' THEN 1
ELSE 0 END as dbt_created
from pg_views
WHERE NOT schemaname = ANY('{information_schema,pg_catalog}') )
SELECT count(name) as total_tables_and_views,
sum(dbt_created) as dbt_created,
to_char((sum(dbt_created)::dec/count(name)::dec)*100,'999D99%') as dbt_coverage
FROM schema_map
Gives the result:
total_tables_and_views | dbt_created | dbt_coverage
391 |292 | 74.68%
Just to feed back to the community and thanks to Jordan and Gscott for the inspiration. The solution I executed for SQL Server/ Synapse was:
A Daily execution of Counts of models in INFORMATION_SCHEMA.TABLES and in the dbt graph as one table.
An incremental table built on 1 that selects schemas of interest and aggregates. In my case below I filter out staging and testing.
DbtModelCounts:
{% set models = [] -%}
{% if execute %}
{% for node in graph.nodes.values()
| selectattr("resource_type", "equalto", "model")
%}
{%- do models.append(node.name) -%}
{% endfor %}
{% endif %}
with tables AS
(
SELECT table_catalog [db], table_schema [schema_name], table_name [name], table_type [type]
FROM INFORMATION_SCHEMA.TABLES
),
dbt_tables AS
(
SELECT *
FROM tables
WHERE name in (
{%- for model in models %}
('{{ model}}')
{% if not loop.last %},
{% endif %}
{% endfor %}
)
)
SELECT
tables.db,
tables.schema_name,
tables.type,
COUNT(tables.name) ModelCount,
COUNT(dbt_tables.name) DbtModelCount
FROM tables
LEFT JOIN dbt_tables ON
tables.name=dbt_tables.name AND
tables.schema_name = dbt_tables.schema_name AND
tables.db = dbt_tables.db AND
tables.type = dbt_tables.type
GROUP BY
tables.db,
tables.schema_name,
tables.type
Dbt Coverage:
{{
config(
materialized='incremental',
unique_key='DateCreated'
)
}}
SELECT
CAST(GETDATE() AS DATE) AS DateCreated,
GETDATE() AS DateTimeCreatedUTC,
SUM(DbtModelCount) AS DbtModelCount,
SUM(ModelCount) AS TotalModels,
SUM(DbtModelCount)*100.0/SUM(ModelCount) as DbtCoveragePercentage
FROM {{ref('DbtModelCounts')}}
WHERE schema_name NOT LIKE 'testing%' AND schema_name NOT LIKE 'staging%'
To do, add logic for defined sources to also compute percentage of sources that map to my staging or raw schema tables .
--Specific Informatica PowerCenter Qs--
I have incoming data field like this and need to extract the substrings from either side of the hyphens and store them in individual fields of the target table. I am getting the correct results from the database but the same is not working in Informatica. In Expression my code says parsed successfully but nothing is getting loaded.
It would be great if someone can assist me with all the 8 REGEXP code lines as it seems it differs quite a bit as I traverse deep into the string.
select replace(regexp_substr('ABC-10000-DEF-200-*-*-XYZ-*' ,'[^-]*(-|$)',1,1), '-', '' ) from dual;
select replace(regexp_substr('ABC-10000-DEF-200-*-*-XYZ-*' ,'[^-]*(-|$)',1,2), '-', '' ) from dual;
select replace(regexp_substr('ABC-10000-DEF-200-*-*-XYZ-*' ,'[^-]*(-|$)',1,3), '-', '' ) from dual;
select regexp_substr('ABC-10000-DEF-200-*-*-XYZ-*','[^-]+',1,1) from dual;
select regexp_substr('ABC-10000-DEF-200-*-*-XYZ-*','[^-]+',1,2) from dual;
select regexp_substr('ABC-10000-DEF-200-*-*-XYZ-*','[^-]+',1,3) from dual;
INFA Case 1: When I am using the below, its succeeding for the first occurrence but coming as nulls for the other 7 substring extracts.
REG_EXTRACT(String_Input,'([^-]*),?([^-]*),?([^-]*).*',1) --> Succeeds
REG_EXTRACT(String_Input,'([^-]*),?([^-]*),?([^-]*).*',2) --> Null
REG_EXTRACT(String_Input,'([^-]*),?([^-]*),?([^-]*).*',3) -- Null and so on till 8.
Case 2: When I use the below, I get all Nulls.
REG_EXTRACT('String_Input','[^-]+',1,1) --> Null
REG_EXTRACT('String_Input','[^-]+',1,2) --> Null
REG_EXTRACT('String_Input','[^-]+',1,3) --> Null ```
select sum(table_3.col_1) as number,
(CASE
when table_1.line_1 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_1, '(?i)City\s*(\d+)'))[1])
when table_1.line_2 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_2, '(?i)City\s*(\d+)'))[1])
when table_1.line_3 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_3, '(?i)City\s*(\d+)'))[1])
when table_1.line_4 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_4, '(?i)City\s*(\d+)'))[1])
when table_1.line_5 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_5, '(?i)City\s*(\d+)'))[1])
when table_1.line_6 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_6, '(?i)City\s*(\d+)'))[1])
when table_1.line_7 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_7, '(?i)City\s*(\d+)'))[1])
when table_1.line_8 ~* '(?i)City\s*(\d+)' then 'C' || (select (regexp_matches(table_1.line_8, '(?i)City\s*(\d+)'))[1])
else 'City'::varchar
end
) as string_result
from table_1
join table_2 on table_1.id = table_2.table_1_id
join table_3 on table_2.id = table_3.table_2_id
join table_4 on table_3.table_4_id = table_4.id
where table_4.id = 2
group by string_result
When I run the above query in pgadmin, I get all the information I am expecting. over 20 rows. But when I run this in rails using a heredoc and ActiveRecord::Base.connection.execute(sql) I only get the last row returned. I've done this almost exact same thing(different data of course) in other projects with great success. I'm at a loss at to why here only one row is returned when run in Rails
UPDATE: I was able to figure out why this is happening, but still having issues correcting it. The problem is that when the query is passed into the ActiveRecord::Base.connection.execute(sql) it is escaping the space and digit requirements of my regex. I'm still going through the postgres pattern matching docs, but I have a hard time with regex's at the moment since I'm still pretty new at using them. But I'm trying to capture the digits in strings matching something like 'City 24' where it matches with City case insensitvely
You cannot access the results like array. Need to use each method.
Example:
ActiveRecord::Base.connection.execute(sql).each |row|
p row
end
What you can do is
ActiveRecord::Base.connection.select_all
This will return you an array of the results.
For some reason, the results from PostgreSQL are not an Array, while other DBs return Array. Either of those should work.
Finally starting to wrap my head around regexes. I was finally able to solve my issue. It was the digits capture. It worked in pgadmin with (\d+) but when passed through the ActiveRecord::Base.connections.execute(sql) it didn't work for some reason. The fix was to give it a range with [0-9] and since the + is simply one or more I gave it {1,2}. So now the regex is (?i)City\s*([0-9]{1,2}) This will match the string case insensitively for 'City' followed by any number of spaces. And the will grab the first two digits after that.
I have a collector which collects three fields from a log file and saves it to influxDB in following format:
FeildA FeildB FeildC
------- -------- --------
A 00 123B 02 100A 00 13A 00 123
I want to plot graph in Grafana such that I get count of occurrence of "A" and "B" (FeildA)
IMP: FeildA can have multiple values, not known before-hand. Hence writing query with "where" clause is not an option.
If FeildA is only defined as field in measurement schema you can use regexp in "where" clause and these queries might work for you:
```
SELECT COUNT(FeildA) FROM "logdata" WHERE $timeFilter and FeildA::field =~ /^A$/
SELECT COUNT(FeildA) FROM "logdata" WHERE $timeFilter and FeildA::field =~ /^A$/
SELECT COUNT(FeildA) FROM "logdata" WHERE $timeFilter and FeildA =~ /^(A|B)$/
```
If the number of expected distinct values of FeildA (cardinality) is resonable the real solution would be to make FeildA a "Tag" instead of "Field". Then you can use "group by tag" in query. For example, query:
```
SELECT COUNT(FeildA) FROM "logdata" WHERE $timeFilter AND "FeildA" =~ /^(A|B|C|D)$/ GROUP BY time(1m), FeildA fill(null)
```
will give counts of occurrence of "A","B","C","D". But this require changes in collector.
FeildA can be both a "tag" and a "field" in influxdb but it is better when names are different to avoid collision and simplify syntax in queries.
I use PostgreSQL 9.3.3 and I have a table with one column named as title (character varying(50)).
When I have executed the following query:
select * from test
order by title asc
I got the following results:
#
A
#Example
Why "#Example" is in the last position? In my opinion "#Example" should be in the second position.
Sort behaviour for text (including char and varchar as well as the text type) depends on the current collation of your locale.
See previous closely related questions:
PostgreSQL Sort
https://stackoverflow.com/q/21006868/398670
If you want to do a simplistic sort by ASCII value, rather than a properly localized sort following your local language rules, you can use the COLLATE clause
select *
from test
order by title COLLATE "C" ASC
or change the database collation globally (requires dump and reload, or full reindex). On my Fedora 19 Linux system, I get the following results:
regress=> SHOW lc_collate;
lc_collate
-------------
en_US.UTF-8
(1 row)
regress=> WITH v(title) AS (VALUES ('#a'), ('a'), ('#'), ('a#a'), ('a#'))
SELECT title FROM v ORDER BY title ASC;
title
-------
#
a
#a
a#
a#a
(5 rows)
regress=> WITH v(title) AS (VALUES ('#a'), ('a'), ('#'), ('a#a'), ('a#'))
SELECT title FROM v ORDER BY title COLLATE "C" ASC;
title
-------
#
#a
a
a#
a#a
(5 rows)
PostgreSQL uses your operating system's collation support, so it's possible for results to vary slightly from host OS to host OS. In particular, at least some versions of Mac OS X have significantly broken unicode collation handling.
It seems, that when sorting Oracle as well as Postgres just ignore non alpha numeric chars, e.g.
select '*'
union all
select '#'
union all
select 'A'
union all
select '*E'
union all
select '*B'
union all
select '#C'
union all
select '#D'
order by 1 asc
returns (look: that DBMS doesn't pay any attention on prefix before 'A'..'E')
*
#
A
*B
#C
#D
*E
In your case, what Postgres actually sorts is
'', 'A' and 'Example'
If you put '#' in the middle od the string, the behaviour will be the same:
select 'A#B'
union all
select 'AC'
union all
select 'A#D'
union all
select 'AE'
order by 1 asc
returns (# ignored, and so 'AB', 'AC', 'AD' and 'AE' actually compared)
A#B
AC
A#D
AE
To change the comparison rules you should use collation, e.g.
select '#' collate "POSIX"
union all
select 'A' collate "POSIX"
union all
select '#Example' collate "POSIX"
order by 1 asc
returns (as it required in your case)
#
#Example
A