Event-Time Temporal Table Join requires both primary key and row time attribute in versioned table, but no row time attribute can be found - join

I have tried to use lookup join but i find this problem:
SELECT
> e.isFired,
> e.eventMrid,
> e.createDateTime,
> r.id AS eventReference_id,
> r.type
> FROM Event e
> JOIN EventReference FOR SYSTEM_TIME AS OF e.createDateTime AS r
> ON r.id = e.eventReference_id;
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.table.api.ValidationException: Event-Time Temporal Table Join requires both primary key and row time attribute in versioned table, but no row time attribute can be found.

Whether that query will be interpreted by the Flink SQL planner as a temporal join or a lookup join depends on the type of the table on the right-hand side. In this case I guess you haven't used a lookup source. And your time attribute might not be defined correctly.
Temporal (time-versioned) joins require
an equality predicate on the primary key of the versioned table
a time attribute
and lookup joins require
a lookup source connector, (e.g., JDBC, HBase, Hive, or something custom)
an equality join predicate
using a processing time attribute in combination with
FOR SYSTEM_TIME AS OF (to prevent needing to update the join results)

Related

Ruby - ActiveRecord - Select one record per 'group' based on a specific column value

I have this table:
User
Name
Role
Mason
Engineer
Jackson
Engineer
Mason
Supervisor
Jackson
Supervisor
Graham
Engineer
Graham
Engineer
There can be exact duplicates (same Name/Role combination). Ignore comments about primary key.
I am writing a query that will give the distinct values from 'Name' column, with the corresponding 'Role'. To select the corresponding 'Role', if there is a 'Supervisor' role for a name, that record is returned. Otherwise, a record with the 'Engineer' role should be returned if it exists.
For the above table, the expected result is:
Name
Role
Mason
Supervisor
Jackson
Supervisor
Graham
Engineer
I tried ordering 'Role' in descending order, so that I can group by Name,Role and pick the first item - it will be a 'Supervisor' role if present, else 'Engineer' role - which matches my expecation.
I also tried doing User.select('DISTINCT ON (name) \*).order(Role: :desc) - I am not seeing this clause in the SQL query that gets executed.
Also, I tried another approach to get all valid Name, Role combinations and then process it offline iterating the result set and using if-else to decide which row to display.
However, I am interested in anything that is efficient and does not over do this handling.
I am new to Ruby and therefore reaching out.
If I wanted to do this in pure SQL, I would have to use GROUP BY.
SELECT Name, MAX(Role) FROM User GROUP BY Name
So one method would be to execute this SQL statement against the base connection.
ActiveRecord::Base.connection.execute("SELECT Name, MAX(Role) FROM User GROUP BY Name")
That would provide exactly the data you need, though it wouldn't be returned as ActiveRecord models. If you need those models then I would use find_by_sql and do an inner join to provide the records.
User.find_by_sql("SELECT User.* FROM User INNER JOIN (SELECT Name AS n, MAX(Role) AS r FROM User GROUP BY Name) U2 WHERE Name = U2.n AND Role = U2.r")
Unfortunately that would provide both records for Graham.

Convert long SQL query into ActiveRecord expression

I need to convert the SQL query below to an active record query expression. I have come across a number of related posts that involve joining two different models using .joins()but in this case I am joining the model with a subset of itself. In particular, I could not find how to express the SELECT a.some_column, ... FROM my_model a
SELECT a.city_id, a.statistic_id, a.value, a.year
FROM city_values a
INNER JOIN (
SELECT city_id, statistic_id, MAX(year) AS year
FROM city_values
GROUP BY city_id, statistic_id) b
ON a.city_id = b.city_id AND a.year = b.year and a.statistic_id =
b.statistic_id
ORDER BY city_id, statistic_id;
I have a model named CityValue corresponding to data for various cities, statistics and years. The model has attributes city_id, statistic_id, value, year. The ActiveRecord query is intended to get the recent years statistics values for each city and statistic and return an ActiveRecord::Relation object.
From my understanding about the SQL query provided, you are trying to select the rows that contain the recent year statistics for each unique pair of (city_id, statistic_id).
I would not use a join for this problem. cvs stands for city_values.
CityValue.
select('DISTINCT ON (cvs.city_id, cvs.statistic_id), cvs.*').
order('cvs.city_id, cvs.statistic_id, cvs.year DESC')
Explanations about DISTINCT ON can be found here:
https://www.vertabelo.com/blog/technical-articles/postgresql-select-distinct-on.

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Ambiguous column error creating table in Aster Studio 6.0

I am new to databases and am posting a problem from work. I am creating a table in Aster Studio 6.0, but got an error about an ambiguous column. I ran the same query in Teradata SQL Assistant and did not get an error.
I have six tables with millions of rows named EDW.SWIFTIQ_TRANS_DTL, EDW.SWIFTIQ_STORE, EDW.SWIFTIQ_PROD, EDW.STORE_XREF, EDW.TDLNX_STR_OUTLT, and EDW.SURV_CWC.
EDW represents the original database, but the columns were labeled with aliases.
I did a trim() on the VARCHAR columns for saving spool space. For the error about TDLNX_RTL_OUTLT_NBR, I performed an INNER JOIN on similar columns from two different tables. Doing a preview in SQL Assistant, there was a temporary table with only one column called TDLNX_RTL_OUTLT_NBR.
Here’s the SQL query:
CREATE TABLE public.table_name
DISTRIBUTE BY HASH (SRC_SYS_PROD_ID) AS (
SELECT * FROM load_from_teradata(
ON public.load_from_teradata_dummy
TDPID(‘database_name')
USERNAME(’user_name')
PASSWORD(’ss')
QUERY ('SELECT e.TDLNX_RTL_OUTLT_NBR, e.OUTLT_ST_ADDR_TXT, e.STORE_OUTLT_ZIP_CD, d.TRANS_ID, d.TRANS_DT,
d.TRANS_TM, d.UNIT_QTY, d.SRC_SYS_STORE_ID, d.SRC_SYS_PROD_ID, d.SRC_SYS_NM, a.SRC_SYS_STORE_ID, a.SRC_SYS_NM, a.STORE_NM,
a.CITY_NM, a.ZIP_CD, a.ST_cd, p.SRC_SYS_PROD_ID, p.SRC_SYS_NM, p.UPC_CD, p.PROD_ID, f.SRC_SYS_STORE_ID, f.SRC_SYS_NM,
f.TDLNX_RTL_OUTLT_NBR, g.SURV_CWC_WSLR_CUST_PARTY_ID, g.AGE_CD, g.HIGH_END_ACCT_FLG, g.RACE_ETHNC_CD, g.OCCPN_CD
FROM EDW.SWIFTIQ_TRANS_DTL d
INNER JOIN EDW.SWIFTIQ_STORE a
ON trim( a.SRC_SYS_STORE_ID) = trim(d.SRC_SYS_STORE_ID)
INNER JOIN EDW.SWIFTIQ_PROD p
ON trim(p.SRC_SYS_PROD_ID) = trim(d.SRC_SYS_PROD_ID)
and p.SRC_SYS_NM = d.SRC_SYS_NM
INNER JOIN EDW.STORE_XREF f
ON trim(f.SRC_SYS_STORE_ID) = trim(a.SRC_SYS_STORE_ID)
INNER JOIN EDW.TDLNX_STR_OUTLT e
ON trim(e.TDLNX_RTL_OUTLT_NBR)= trim(f.TDLNX_RTL_OUTLT_NBR)
INNER JOIN EDW.SURV_CWC g
ON g.SURV_CWC_WSLR_CUST_PARTY_ID = e.WSLR_CUST_PARTY_ID
WHERE TRANS_DT between ''2015-01-01'' and ''2015-03-31''')
num_instances('4') ) );
ERROR: column reference 'TDLNX_RTL_OUTLT_NBR' is ambiguous.
EDIT: Forgot to include a description about the table aliases. a stands for EDW.SWIFTIQ_STORE, p for EDW.SWIFTIQ_PROD, f for EDW.STORE_XREF, e for EDW.TDLNX_STR_OUTLT, g for EDW.SURV_CWC, and d for EDW.SWIFTIQ_TRANS_DTL.
You will get the same error when you try CREATE TABLE AS SELECT in Teradata. There are three column names, SRC_SYS_NM & SRC_SYS_PROD_ID & SRC_SYS_STORE_ID, which are used multiple times (with different table aliases) within the SELECT.
Add column aliases to make those names unique, e.g. trans_SRC_SYS_NM instead of d.SRC_SYS_NM.
Additionally the TRIMs in the joins are a very bad idea. You will probably not save that much spool, but force the optimizer to redistribute all spools for join-preparation.

Alternative way of joining two datasets in SAS

I have two datasets DS1 and DS2. DS1 is 100,000rows x 40cols, DS2 is 20,000rows x 20cols. I actually need to pull COL1 from DS1 if some fields match DS2.
Since I am very-very new to SAS, I am trying to stick to SQL logic.
So basically I did (shot version)
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
OR DS1.COL3=DS2.COL3
OR DS1.COL4=DS2.COL2
...
After an hour or so, it was still running, but I was getting emails from SAS that I am using 700gb or so. Is there a better and faster SAS-way of doing this operation?
I would use 3 separate queries and use a UNION
proc sql;
...
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
on DS1.COL2=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
On DS1.COL3=DS2.COL3
UNION
SELECT DS1.col1
FROM DS1 INNER JOIN DS2
ON DS1.COL4=DS2.COL2
...
You may have null or blank values in the columns you are joining on. Your query is probably matching all the null/blank columns together resulting in a very large result set.
I suggest adding additional clauses to exclude null results.
Also - if the same row happens to exist in both tables, then you should also prevent the row from joining to itself.
Either of these could effectively result in a cartesian product join (or something close to a cartesian product join).
EDIT : By the way - a good way of debugging this type of problem is to limit both datasets to a certain number of rows - say 100 in each - and then running it and checking the output to make sure it's expected. You can do this using the SQL options inobs=, outobs=, and loops=. Here's a link to the documentation.
First sort the datasets that you are trying to merge using proc sort. Then merge the datasets based on id.
Here is how you can do it.
I have assumed you match field as ID
proc sort data=DS1;
by ID;
proc sort data=DS2;
by ID;
data out;
merge DS1 DS2;
by ID;
run;
You can use proc sort for Ds3 and DS4 and then include them in merge statement if you need to join them as well.

Resources