Google Bigquery SELECT timeouts with simple joins - join

This simple query is timeouting, any ideas how to optimise it using some BigQuery tricks?
SELECT
s.typeFlight s_type, r.distance, r.price, (d.booking_token IS NULL) clicked
FROM [search.searches] s
LEFT JOIN [search.search_results] r ON r.searchid=s.searchid
LEFT JOIN [search.clicks] d ON d.booking_token=r.booking_token
WHERE s.saved_at BETWEEN TIMESTAMP('2016-03-01 00:00:00')
AND TIMESTAMP('2016-03-05 00:00:00')
Query settings
Query Priority Batch
Destination Table bucket-984:search.result
Write Preference Overwrite table
Allow Large Results true
The data comes from search engine, so the table clicks is small (under million rows) but the table searches and search_results are huge. The query processes about 5 TB of data.

You could push the where filtering into the first select so there's less data to join:
SELECT
s.typeFlight s.type, r.distance, r.price, (d.booking_token IS NULL) clicked
FROM (
SELECT typeFlight, type, searchid
FROM [search.searches]
WHERE saved_at BETWEEN TIMESTAMP('2016-03-01 00:00:00')
AND TIMESTAMP('2016-03-05 00:00:00')
) s
LEFT JOIN [search.search_results] r ON r.searchid=s.searchid
LEFT JOIN [search.clicks] d ON d.booking_token=r.booking_token
Sometimes it is helpful to look at the Query Plan Explanation https://cloud.google.com/bigquery/query-plan-explanation to see where your query is spending time.

Related

How to use the exceptjoin in Cognos-11?

I don't get an except join to work in Cognos-11. Where or what am I missing?
Some understanding for a beginner in this branch would be nice ;-)
What I've tried so far is making two queries. The first one holds data items like "customer", "BeginningDate" and "Purpose". The second query holds data items like "customer", "Adress" and "Community".
What I'd like to accomplish is to get in query3: the "customers" from query1 that are not available in query2. To me it sounds like an except-join.
I went to the query work area, created a query3 and dragged an "except-join" icon on it. Then I dragged query1 into the upper space and query2 into the lower. What I'm used to getting with other joins, is a possibility to set a new link, cardinality and so on. Now double clicking the join isn't opening any pop-up. The properties of the except-join show "Set operation = Except", "Duplicates = remove", "Projection list = Manual".
How do I get query3 filled with the data item "customer" that only holds a list of customers which are solely appearing in query1?
In SQL terms, you want
select T2.C1
from T1
left outer join T2 on T1.C1 = T2.C1
where T2.C1 is null
So, in the query pane of a Cognos report...
Use a regular join.
Join using customer from both queries.
Change the cardinality to 1..1 on the query1 side and 0..1 on the query2 side.
In the filters for query3, add a filter for query2.customer is null.
EXCEPT is not a join. It is used to compare two data sets.
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/set-operators-except-and-intersect-transact-sql?view=sql-server-2017
What you need is an INNER JOIN. That would be the join tool in the Toolbox in Cognos.

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Emulating an interval join in hive

I am using hive 0.13.
I have two tables:
data table. columns: id, time. 1E10 rows.
mymap table. columns: id, name, start_time, end_time. 1E6 rows.
For each row in the data table I want to get the name from the mymap table matching the id and the time interval. So I want to do a join like:
select data.id, time, name from data left outer join mymap on data.id = mymap.id and time>=start_time and time<end_time
It is known that for every row in data there are 0 or 1 matches in mymap.
The above query is not supported in hive as it is a non-equi-join. Moving the inequality conditions into a where filter does not work cause the join explodes before the filter is applied:
select data.id, time, name from data left outer join mymap on data.id = mymap.id where mymap.id is null or (time>=start_time and time<end_time)
(I am aware that the queries are not exactly equivalent due to cases where there is a match for id but no matching interval. This can be solved as I describe here: Hive: work around for non equi left join)
How can I go about this?
You could perform your join and then query from that table. I didn't test this code, but it would read something like
select id
,time
,name
from (
select d.id
,d.time
,m.name
,m.start_time
,m.end_time
from data as d LEFT OUTER JOIN mymap as m
ON d.id = m.id
) x
where time>=start_time
AND time<end_time
You could potentially get around this issue by flattening out the data structure in table2 and using a UDF to process the joined records.
select
id,
time,
nameFinderUDF(b.name_list, time) as name
from
data a
LEFT OUTER JOIN
(
select
id,
collect_set(array(name,cast(start_time as string),cast(end_time as string))) as name_list
from
mymap
group by
id
) b
ON (a.id=b.id)
With a UDF that does something like:
public String evaluate(ArrayList<ArrayList<String>> name_list,Long time) {
for (int i;i<name_list.length;i++) {
if (time >= Long.parseLong(name_list[i][1]) && time <= Long.parseLong(name_list[i][2])) {
return name_list[i][0]
return null;
}
This approach should make the merge 1 to 1, but it could create a fairly large data structure repeated many times. It is still quite a bit more efficient than a straight join.

Get incremental changes between Hive partitions

I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user

using SQL aggregate functions with JOINs

I have two tables - tool_downloads and tool_configurations. I am trying to retrieve the most recent build date for each tool in my database. The layout of the DB is simple. One table called tool_downloads keeps track of when a tool is downloaded. Another table is called tool_configurations and stores the actual data about the tool. They are linked together by the tool_conf_id.
If I run the following query which omits dates, I get back 200 records.
SELECT DISTINCT a.tool_conf_id, b.tool_conf_id
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
When I try to add in date information I get back hundreds of thousands of records! Here is the query that fails horribly.
SELECT DISTINCT a.tool_conf_id, max(a.configured_date) as config_date, b.configuration_name
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
ORDER BY a.tool_conf_id
I know the problem has something to do with group-bys/aggregate data and joins. I can't really search google since I don't know the name of the problem I'm encountering. Any help would be appreciated.
Solution is:
SELECT b.tool_conf_id, b.configuration_name, max(a.configured_date) as config_date
FROM tool_downloads a
JOIN tool_configurations b
ON a.tool_conf_id = b.tool_conf_id
GROUP BY b.tool_conf_id, b.configuration_name

Resources