Optimize Bigquery query: Shuffle reached broadcast limit

Optimize Bigquery query: Shuffle reached broadcast limit - join

I'm trying to process this query.
SELECT
r.src,r.dst, ROUND(r.price/50)*50 pb,COUNT(*) results
FROM [search.interesting_routes] ovr
LEFT JOIN [search.search_results2] r ON ovr.src=r.src AND ovr.dst=r.dst
WHERE DATE(r.saved_at) >= '2015-10-1' AND DATE(r.saved_at) <= '2015-10-01' AND r.price < 20000
GROUP BY pb, r.src, r.dst
ORDER BY pb
The table search_results2 contains a huge amout of search results about prices for routes (route is defined by src and dst).
I need to count all records in search_results2 for each record in interesting_routes for different price buckets.
The query works fine on small sample of data, but once the data is huge it ends with
Error: Shuffle reached broadcast limit for table __I0 (broadcasted at
least 176120970 bytes). Consider using partitioned joins instead of
broadcast joins.
I have a difficulty to rewrite the SELECT with usage of suggested partitioned join. Or at least get the result somehow.

Related

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.

The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

Stream analytics getting average for 1 year from history

I have Stream Analytics job with
INPUTS:
1) "InputStreamCSV" - linked to Event hub and recievies data . InputStreamHistory
2) "InputStreamHistory" - Input stream linked BlobStorage. InputStreamCSV
OUTPUTS:
1) "AlertOUT" - linked to table storage and inserts alarm event as row in table
I want to calculate AVERAGE amount for all transactions for year 2018(one number - 5,2) and compare it with transaction, that is comming in 2019:
If new transaction amount is bigger than average - put that transaction in "AlertOUT" output.
I am calculating average as :
SELECT AVG(Amount) AS TresholdAmount
FROM InputStreamHistory
group by TumblingWindow(minute, 1)
Recieving new transaction as:
SELECT * INTO AlertOUT FROM InputStreamCSV TIMESTAMP BY EventTime
How can I combine this 2 queries to be able to check if new transaction amount is bigger than average transactions amount for last year?

Please use JOIN operator in ASA sql,you could refer to below sql to try to combine the 2 query sql.
WITH
t2 AS
(
SELECT AVG(Amount) AS TresholdAmount
FROM jsoninput2
group by TumblingWindow(minute, 1)
)
select t2.TresholdAmount
from jsoninput t1 TIMESTAMP BY EntryTime
JOIN t2
ON DATEDIFF(minute,t1,t2) BETWEEN 0 AND 5
where t1.Amount > t2.TresholdAmount
If the history data is stable, you also could join the history data as reference data.Please refer to official sample.

If you are comparing last year's average with current stream, it would be better to use reference data. Compute the averages for 2018 using either asa itself or a different query engine to a storage blob. After that you can use the blob as reference data in asa query - it will replace the average computation in your example.
After that you can do a reference data join with inputStreamCsv to produce alerts.
Even if you would like to update the averages once in a while, above pattern would work. Based on the refresh frequency, you can either use another asa job or a batch analytics solution.

Hive Bucketed Map Join

I am facing issue in executing bucketed map join.
I am using hive 0.10.
Table1 is a partitioned table on year,month and day. Each partition data is bucketed by column c1 into 128 buckets. I have almost 100 million records per day.
Table 1
create table1
(
....
....
)
partitioned by (year int,month int,day int)
CLUSTERED BY(c1) INTO 128 BUCKETS;
Table2 is a large lookup table bucketed on column c1. I have 80 million records loaded into 128 buckets.
Table 2
create table2
(
c1
c2
...
)
CLUSTERED BY(c1) INTO 128 BUCKETS;
I have checked the data and it's loaded as per expectation into buckets.
Now, I am trying to enforce bucketed map join.That's where I am stuck.
set hive.auto.convert.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.mapjoin.bucket.cache.size=1000000;
select a.c1 as c1_tb2,a.c2
b.c1,b....
from table2 a
JOIN table1 b
ON (a.c1=b.c1);
I am still not getting bucketed map join. Am I missing something? Even I tried to execute join on only 1 partition. But, still I am getting same result.
Or
Bucketed map join doesn't work partition tables?
Please help.Thanks.

This explanation is for Hive 0.13. AFAICT, bucketed map join doesn't take effect for auto converted map joins. You will need to explicitly call out map join in the syntax like this:
set hive.optimize.bucketmapjoin = true;
explain extended select /* +MAPJOIN(b) */ count(*)
from nation_b1 a
join nation_b2 b on (a.n_regionkey = b.n_regionkey);
Note that only explain extended shows you the flag that indicates if bucket map join is being used or not. Look for this line in the plan.
BucketMapJoin: true

Tables are bucketed in hive to manage/process the portion of data individually. It will make the process easy to manage and efficient in terms of performance.
Lets understand the join when the data is stored in buckets:
Lets say there are two tables user and user_visits and both table data is bucketed using user_id in 4 buckets . It means bucket 1 of user will contain rows with same user ids as that of bucket 1 of user_visits. And if a join is performed on these two tables on user_id columns, if it is possible to send bucket 1 of both tables to same mapper then good amount of optimization can be achieved. This is exactly done in bucketed map join.
Prerequisites for bucket map join:
Tables being joined are bucketized on the join columns,
The number of buckets in one table is a same/multiple of the number of buckets in the other table.
The buckets can be joined with each other, If the tables being joined are bucketized on the join columns. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
set hive.optimize.bucketmapjoin = true
If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

How to select the max record for each of a group of candidates in Grails?

I have a table that gets populated every day with records from reporting systems.
I have a list of the serial numbers those i am interested in returning in an asset list.
How do I get Grails to return the records that match the maximum "epoch" entry for each asset? In sql I would cross join the table back to itself after picking out the maximum such as:
select a.* from assetTable a inner join (select sn, max(epoch) epoch from assetTable group by sn) b on a.sn = b.sn and a.epoch = b.epoch
but I cannot figure out how to get this done efficiently with Grails...
From a domain class perspective it is pretty simple. Consider for the same of example that I have a single domain class "AssetTable" and it has Integer epoch, String sn, ...
Literally, all I want to do is get the latest entry (all fields) for a subset of serial numbers (sn) that I have in a List.

How to efficiently search for last record matching a condition in Rails and PostgreSQL?

Suppose you want to find the last record entered into the database (highest ID) matching a string: Model.where(:name => 'Joe'). There are 100,000+ records. There are many matches (say thousands).
What is the most efficient way to do this? Does PostgreSQL need to find all the records, or can it just find the last one? Is this a particularly slow query?
Working in Rails 3.0.7, Ruby 1.9.2 and PostgreSQL 8.3.

The important part here is to have a matching index. You can try this small test setup:
Create schema xfor testing:
-- DROP SCHEMA x CASCADE; -- to wipe it all for a retest or when done.
CREATE SCHEMA x;
CREATE TABLE x.tbl(id serial, name text);
Insert 10000 random rows:
INSERT INTO x.tbl(name) SELECT 'x' || generate_series(1,10000);
Insert another 10000 rows with repeating names:
INSERT INTO x.tbl(name) SELECT 'y' || generate_series(1,10000)%20;
Delete random 10% to make it more real life:
DELETE FROM x.tbl WHERE random() < 0.1;
ANALYZE x.tbl;
Query can look like this:
SELECT *
FROM x.tbl
WHERE name = 'y17'
ORDER BY id DESC
LIMIT 1;
--> Total runtime: 5.535 ms
CREATE INDEX tbl_name_idx on x.tbl(name);
--> Total runtime: 1.228 ms
DROP INDEX x.tbl_name_idx;
CREATE INDEX tbl_name_id_idx on x.tbl(name, id);
--> Total runtime: 0.053 ms
DROP INDEX x.tbl_name_id_idx;
CREATE INDEX tbl_name_id_idx on x.tbl(name, id DESC);
--> Total runtime: 0.048 ms
DROP INDEX x.tbl_name_id_idx;
CREATE INDEX tbl_name_idx on x.tbl(name);
CLUSTER x.tbl using tbl_name_idx;
--> Total runtime: 1.144 ms
DROP INDEX x.tbl_name_id_idx;
CREATE INDEX tbl_name_id_idx on x.tbl(name, id DESC);
CLUSTER x.tbl using tbl_name_id_idx;
--> Total runtime: 0.047 ms
Conclusion
With a fitting index, the query performs more than 100x faster.
Top performer is a multicolumn index with the filter column first and the sort column last.
Matching sort order in the index helps a little in this case.
Clustering helps with the simple index, because still many columns have to be read from the table, and these can be found in adjacent blocks after clustering. It doesn't help with the multicolumn index in this case, because only one record has to be fetched from the table.
Read more about multicolumn indexes in the manual.
All of these effects grow with the size of the table. 10000 rows of two tiny columns is just a very small test case.

You can put the query together in Rails and the ORM will write the proper SQL:
Model.where(:name=>"Joe").order('created_at DESC').first
This should not result in retrieving all Model records, nor even a table scan.

This is probably the easiest:
SELECT [columns] FROM [table] WHERE [criteria] ORDER BY [id column] DESC LIMIT 1
Note: Indexing is important here. A huge DB will be slow to search no matter how you do it if you're not indexing the right way.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart