Redshift performance tuning on a JOIN query - join

I'm having trouble with performance on the following query:
SELECT [COLUMNS] FROM TABLE A JOIN TABLE B ON [KEYS]
If I remove the join, leaving only the select the query takes seconds. With the join, it takes 30 minutes.
Table sizes are A (844,082,912) & B (1,540,379,815) rows.
Distribution and sort keys are equivalent to the join KEYS.
Looking on AWS graphs, I see (attached) one node with has some 100% CPU utilisation for a short time.
Looking on system table (svv_diskusage) I am not sure what I see (attached), as it does not indicate (as far as I can tell) if one node has much more data than the others.
if the issue is faulty distribution, how can I see it?
is it something else?

Here https://aws.amazon.com/articles/8341516668711341 (Uneven Distribution) you can see an example of the same graph style: one node is working harder than the others, which indicates your data is not evenly distributed.
Regarding svv_diskusage, it describes the values stored in each slice. If the slices are not relatively evenly used, that's an indicator for a bad distribution key. Try the following query to get a higher abstraction over distribution amooung nodes and not slices:
select owner, host, diskno, used, capacity,
(used-tossed)/capacity::numeric *100 as pctused
from stv_partitions order by owner;
set search_path to '$user', 'public', 'ic';
select * from pg_table_def where tablename = '{TableNameHere}';

Related

How does one perform a "range query"?

Google cloud dataflow supports what I would call a "full outer join" SQL like statement through their "CoGroupByKey"method. However, is there any way to implement in dataflow what would be in SQL a "range join"? For example, if I had a table called "people" in which there was a floating point field called "age". And let's say I wanted all the pairs of people in which their ages were within say five years from each other. I could write the following statement:
select p1.name, p1.age, p2.name, p2.age
from people p1, people p2
where p1.age between (p2.age - 5.0) and (p2.age + 5.0);
I couldn't determine if there was a way to accomplish this in dataflow. (Again, if I wanted a strict equality, that I could use a CoGroupByKey, but in this case it's not a strict equality condition).
For my particular use case, the "people" table is not too large – maybe 500,000 rows and approximately 50 megs of RAM required. So, I could, I think, simply run a asList() method to create a single object that sits in a single computer's RAM and then just sort the people object by age and then write some sort of routine that "walks through the list from the low stage to the highest age" and while walking through the list outputs those people whose ages are less than 10 years from each other. This would work, but it would be single threaded etc. I was wondering if there was a "better" way of doing it using the dataflow architecture. (And other developers may need to find a "dataflow" way of doing this operation if the object that they were dealing with dies not fit nicely into memory of one single computer, e.g. a people table of maybe 1 billion rows etc.)
The trick to making this work efficiently at scale is to partition your data into sets of potential matches. In your case, you could assign each person to two different keys, age rounded up to multiple of 5, and age rounded down to multiple of 5. Then, do a GroupByKey on these buckets, and emit all the pairs within each bucket that are actually close enough in age. You'll need to eliminate duplicates, since it's possible for two records to both end up in the same two buckets.
With this solution, the entire data does not need to fit in memory, just each subset of the data.

Does bucketing two *large* tables in Hive *in the same way* help perform much more efficient joins?

Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)

What is the default MapReduce join used by Apache Hive?

What is the default MapReduce join algorithm implemented by Hive? Is it a Map-Side Join, Reduce-Side, Broadcast-Join, etc.?
It is not specified in the original paper nor the Hive wiki on joins:
http://cs.brown.edu/courses/cs227/papers/hive.pdf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.
A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.
A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.
There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.
A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.
Do an explain on the Hive query and you can see the execution plan.

Reducers stopped working at 66.68% while running HIVE Join query

Trying to join 6 tables which are having 5 million rows approximately in each table. Trying to join on account number which is sorted in ascending order on all tables. Map tasks are successfully finished and reducers stopped working at 66.68%. Tried options like increasing number of reducers and also tried other options set hive.auto.convert.join = true; and set hive.hashtable.max.memory.usage = 0.9; and set hive.smalltable.filesize = 25000000L; but the result is same. Tried with small number of records (like 5000 rows) and the query works really well.
Please suggest what can be done here to make it work.
Reducers at 66% start doing the actual reduce (0-33% is shuffle, 33-66% is sort). In a join with hive, the reducer is performing a Cartesian product between the two data sets.
I'm going to guess that there is at least one foreign key that is appearing frequently in all of the data sets. Watch for NULL and default values.
For example, in a join, imagine the key "abc" appears ten times in each of the six tables (10^6). That's a million output records for that one key. If "abc" appears 1000 times in one table, 1000 in another, 1000 in another, then twice in the other three tables, you get 8 billion records (1000^3 * 2^3). You can see how this gets out of hand. I'm guessing there is at least one key that is resulting in a massive number of output records.
This is general good practice to avoid in RDBMS outside of Hive as well. Doing multiple inner joins between many-to-many relationships can get you in a lot of trouble.
For debugging this now, and in the future, you could use the JobTracker to find and examine the logs for the Reducer(s) in question. You can then instrument the reduce operation to get a better handle as to what's going on. be careful you don't blow it up with logging of course!
Try looking at the number of records input to the reduce operation for example.

Optimizing JOINs : comparison with indexed tables

Let's say we have a time consuming query described below :
(SELECT ...
FROM ...) AS FOO
LEFT JOIN (
SELECT ...
FROM ...) AS BAR
ON FOO.BarID = BAR.ID
Let's suppose that
(SELECT ...
FROM ...) AS FOO
Returns many rows (let's say 10 M). Every single row has to be joined with data in BAR.
Now let's say we insert the result of
SELECT ...
FROM ...) AS BAR
In a table, and add the ad hoc index(es) to it.
My question :
How would the performance of the "JOIN" with a live query differ from the performance of the "JOIN" to a table containing the result of the previous live query, to which ad hoc indexes would have been added ?
Another way to put it :
If a JOIN is slow, would there be any gain in actually storing and indexing the table to which we JOIN to ?
The answer is 'Maybe'.
It depends on the statistics of the data in question. The only way you'll find out for sure is to actually load the first query into a temp table, stick a relevant index on it, then run the second part of the query.
I can tell you if speed is what you want, if it's possible for you load the results of your first query permanently into a table then of course your query is going to be quicker.
If you want it to be even faster, depending on which DBMS you are using you could consider creating an index which crosses both tables - if you're using SQL Server they're called 'Indexed Views' or you can also look up 'Reified indexes' for other systems.
Finally, if you want the ultimate in speed, consider denormalising your data and eliminating the join that is occurring on the fly - basically you move the pre-processing (the join) offline at the cost of storage space and data consistency (your live table will be a little behind depending on how frequently you run your updates).
I hope this helps.

Resources