Performance Issues with sqlite joins - join

I've been working on building a database structure, when I ran into a problem. I have 3 tables A,B,C. Tables B and C have a 1-1 relationship while Tables A and B have a many-1 relationship. I attempted to run the following query.
SELECT A.id
FROM A
INNER JOIN B ON A.B_id = B.id
INNER JOIN C ON B.id = C.B_id
LIMIT 0,40
The query never completed, and the program ran for several seconds before not responding. Seeing as this query will need to return thousands of records, I was rather distraught that it didn't work limited to only 40 records. I then remembered that indices existed, so I created an index on all of the join criteria. I created one for A.B_id,B.id, and C.B_id. The result was a query that worked. It worked after I removed the limit clause as well, so I proceeded to the next query.
SELECT A.id
FROM A
INNER JOIN B ON A.B_id = B.id
LEFT OUTER JOIN C ON B.id = C.B_id
LIMIT 0,40
Note that the only difference is the second join is now a left outer join. I though that since the keys are all the same, the index should still speed this one up. I was incorrect, as the query above completed, but was rather slow. I removed the limit clause, and the query didn't complete. I removed the indices that I added previously and tried the limited statement again. It ran in the same time.
The problem ended up being that without indices, the INNER JOIN did not work at all and the LEFT OUTER JOIN worked only when limited, though it was also slow. With the indices, the INNER JOIN successfully completed as quickly as I need it to be, but the LEFT OUTER JOIN continued to work about the same.
Table A has 200,000 records
Table B has 50,000 records
Table C has 10,000 records
The LEFT OUTER JOIN query above should produce 40,000 records
Does anyone know a way to speed this up? Am I missing an index or something that would increase the performance of the LEFT OUTER JOIN?
As requested, here are the outputs for explain query plan:
For the two INNER JOINS:
"0" "0" "2" "SCAN TABLE C USING COVERING INDEX c.bid (~1000000 rows)"
"0" "1" "1" "SEARCH TABLE B USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)"
"0" "2" "0" "SEARCH TABLE A USING COVERING INDEX a.bid (B_id=?) (~10 rows)"
For the INNER JOIN and OUTER JOIN:
"0" "0" "0" "SCAN TABLE A USING COVERING INDEX a.bid (~1000000 rows)"
"0" "1" "1" "SEARCH TABLE B USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)"
"0" "2" "2" "SCAN TABLE C USING COVERING INDEX c.bid (~100000 rows)"

Related

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

SQL Server Query Performance - Normal Join vs Subquery

I have two queries that return the same data.
Query1, which is normal join takes a long time to execute:
SELECT TOP 1000 bigtable.*, tbl1.name, tb2.name FROM
bigtable INNER JOIN tbl1 on bigtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by bigtable.id desc
Query2 that uses a sub-query returns fairly quickly:
SELECT subtable.*, tbl1.name, tb2.name FROM
(SELECT TOP 1000 FROM bigtable) subtable
INNER JOIN tbl1 on subtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by subtable.id desc
bigtable contains 100k rows or so. tbl1 is a very small table (less than 10 rows). I would rather not use subqueries. If I skip the order by clause, both queries run quickly. I have tried adding indexes to the fields being joined, adding a DESC index on id etc. but nothing seems to help.
Any help is appreciated!
===> Update:
This turned out to be an non-issue. After creating another table similar to tbl1 with the same rows, I found that the Query1 ran under a second (with the copied table). Rebuilt stats on tbl1 and it fixed it.
I think that the two queries are not equivalent - try to write the second one as
SELECT subtable.*, tbl1.name, tb2.name FROM
(SELECT TOP 1000 FROM bigtable order by bigtable.id desc) subtable
INNER JOIN tbl1 on subtable.id1 = tbl1.id1 AND
INNER JOIN tbl2 on tbl1.id1 = tbl2.id1
order by subtable.id desc
I expect the expensive operation to be the ordering of the big table, which is now present in both versions.

Redshift - Efficient JOIN clause with OR

I have the need to join a huge table (10 million plus rows) to a lookup table (15k plus rows) with an OR condition. Something like:
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2 ON t1.c = t2.c OR t1.d = t2.d;
This is because table1 can have c or d as NULL, and I'd like to join on whichever is available, leaving out the rest. The query plan says there is a Nested Loop, which I realize is because of the OR condition. Is there a clean, efficient way of solving this problem? I'm using Redshift.
EDIT: I am trying to run this with a UNION, but it doesn't seem to be any faster than before.
If you have a preferred column you can NVL() (aka COALESCE()) them and join on that.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t2.d)
FROM table1 t1
JOIN table2 t2
ON t1.c = NVL(t2.c,t2.d);
I'd also suggest that you should set the lookup table to DISTSTYLE ALL to ensure that the larger table is not redistributed.
[ Also, 10 million rows isn't big for Redshift. Not trying to be snotty just saying that we get excellent performance on Redshift even when querying (and joining) tables with hundreds of billions of rows. ]
How about doing two (left) joins? With the small lookup table performance shouldn't be too bad even.
SELECT t1.a, t1.b, nvl(t1.c, t2.c), nvl(t1.d, t3.d)
FROM table1 t1
LEFT JOIN table2 t2 ON t1.d = t2.d and t1.c is null
LEFT JOIN table2 t3 ON t1.c = t3.c and t1.d is null
Your original query only returns rows that match at least one of c or d in the lookup table. If that's not guaranteed you may need to add filters...for example rows in t1 where both c and d are null or have values not present in table2.
Don't really need the null checks in the joins, but might be slightly faster.

select multiple columns from different tables and join in hive

I have a hive table A with 5 columns, the first column(A.key) is the key and I want to keep all 5 columns. I want to select 2 columns from B, say B.key1 and B.key2 and 2 columns from C, say C.key1 and C.key2. I want to join these columns with A.key = B.key1 and B.key2 = C.key1
What I want is a new external table D that has the following columns. B.key2 and C.key2 values should be given NULL if no matching happened.
A.key, A_col1, A_col2, A_col3, A_col4, B.key2, C.key2
What should be the correct hive query command? I got a max split error for my initial try.
Does this work?
create external table D as
select A.key, A.col1, A.col2, A.col3, A.col4, B.key2, C.key2
from A left outer join B on A.key = B.key1 left outer join C on A.key = C.key2;
If not, could you post more info about the "max split error" you mentioned? Copy+paste specific error message text is good.

Get incremental changes between Hive partitions

I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user

Resources