Union all performance on serverless sql in synapse is poor - serverless

We are moving a database to a delta lake, and have copied the tables to delta tables in azure delta lake gen2.
The users access the data through a serverless sql pool in synapse.
We have run into a showstopper.
When doing these queries table like this:
select top 1000
col1
,col2
,...
,col50
from tablea
select top 1000
col1
,col2
,...
,col60
from tableb
each will take about 5 seconds at most. the data use in each when looking at the look might ba about 100 MB
when doing this query
select top 1000 * from
(
select
col1
,col2
,...
,col50
from tablea
union all
select top 1000
col1
,col2
,...
,col50
from tableb
) a
it takes minutes, and the log shows a memory use of hundreds of gigabytes
Naively I would think that the optimizer would just return the first 1000 rows from the first table in the union all, but for some reason it parses all of the data before returning the first 1000 rows.
The only solution we can see is to create a new set of tables in the delta lake, with the tables unioned, however this seems like a waste of space.
We have seperate tables, because the dimensionality is slightly different, and the measures are wholly different.
In perhaps 80% of the time the users will only query rows from one of the tables.
Is there a way to optimize union all in serverless queries?

Related

How can I filter zero's from a Google Sheets Query Function across two tabs?

I have data across two tabs (May22 and Jun22).
Both have identical headings (A to E).
I can amalgamate the two tabs using:
=QUERY({May22!A:E;Jun22!A:E},"select * where Col3 is not NULL",1)
Col3 (C) contains the Amounts. A lot of which are 0.
How can I modify my query above to filter the zero's from column C inside the query? This will dramatically reduce the size of my amalgamated data.
Thanks
Tom

Google Sheets - Query Pivot - show all results

In google sheets have a pivot table with columns with text day ranges 1-30, 31-60, 61-90, 90-120, >120 where some records fall under those day ranges.
This is sample data:
Unique
Account Doc
Amount
Day Range
1
123456
1000
1-30
2
561530
2000
>120
3
123456
1500
61-90
4
25106
3000
1-30
I can get this data to pivot using standard pivot tables but users needs it to be clean without the pivot table buttons and formatting. I am trying to convert to google query function but I am stumped. I'd could try the option of pivoting and then calling the pivot to a query to remove the formatting but that seems redundant and there is a lot of other things happening in my sheet so afraid of making updating slow.
End result would look like below where the day ranges are pivoted and amount is showing for each record.
All columns need to be preserved as if there are null values all results appear even null/zero values for Amount as with column 31-60 and 91-120 showing the columns with no results. I use a unique id to ensure that all records come back as some of the
I can get the query to pivot with:
=query(rawdata,"Select Z,B,E,D,F,H,J, Sum(N) where B="&$C$31&" Group by Z,B,E,D,F,H,J pivot AA order by F",1)
However if the filter on B only has some day ranges and not others it will only show those columns with data.
Link to google sheet with sample data:
https://docs.google.com/spreadsheets/d/1seX4T3M8Mo9eVZYteyAbUG2zmWM9VCZ2-6oDd76QMA8/edit?usp=sharing
Result Query:
Unique
Account Doc
1-30
31-60
61-90
91-120
>120
1
123456
1000
2
561530
2000
3
123456
1500
4
25106
3000
try:
=ARRAYFORMULA(QUERY(QUERY({'Copy of raw data'!A:AA; IFERROR(VLOOKUP(
SEQUENCE(COUNTA(UNIQUE('Copy of raw data'!B2:B)&TRANSPOSE(UNIQUE('Copy of raw data'!AA2:AA)))), {
SEQUENCE(COUNTA(UNIQUE('Copy of raw data'!B2:B)&TRANSPOSE(UNIQUE('Copy of raw data'!AA2:AA)))),
SPLIT(FLATTEN(UNIQUE('Copy of raw data'!B2:B)&"×"&TRANSPOSE(UNIQUE('Copy of raw data'!AA2:AA))), "×"),
TO_TEXT(SPLIT(FLATTEN(UNIQUE('Copy of raw data'!B2:B)&"×"&TRANSPOSE(UNIQUE('Copy of raw data'!AA2:AA))), "×"))},
{0, 2, SEQUENCE(1, 24, 0, 0), 5}, 0))},
"select Col26,Col2,Col5,Col4,Col6,Col8,Col10,sum(Col14)
where 1=1 "&IF(B1="",, " and Col2="&B1)&"
group by Col26,Col2,Col5,Col4,Col6,Col8,Col10
pivot Col27
order by Col6", 1), "where Col1 is not null", 1))

Find frequency of words in a column in Google Sheets and lookup another value from a different column using formulae

I have 2 columns of data in a Google Sheet. Column1 is unique words or sentences (words are repeated in sentences) and the Column2 is a numeric value next to each (say votes). I am trying to get a list of unique words from Column1 and then the sum of values (votes) from Column2 when the word was present either on its own or in a sentence.
Here is a sample of the data I am working with in Google Sheets:
Term Votes
apple 20
apple eat 100
orange 30
orange rules 40
rule why 50
This is what the end result looks like:
Word Votes
apple 120
eat 100
orange 70
rules 40
rule 50
why 50
The way I am doing it now is quite long and I am not sure if this is the best solution.
Here's my solution:
JOIN values in Column1 using a delimiter " " and then SPLIT them using the same delimiter and then TRANSPOSE them into a column all in one step. This way I have a list of all the words used in Column1 in say Column3.
In Column4 pull out all the UNIQUE values and then do a COUNTIF for the unique values from Column3. This way I am able to get the frequency of each unique word by referencing to the lsit of all words.
In order to get the sum of Votes I have to TRANSPOSE Column4 and then QUERY Column1 and Column2 by using dynamic text in the formula. The formula looks like =QUERY(Column1:Column2, "SELECT SUM(Column2) WHERE Column1 CONTAINS '" & referenceToUniqueWord & "'", 1). The reason I have to transpose first is because the query formula outputs 2 cells of data ie Text: sumColumn1 and Number: 'sum of votes'. Since for one cell of unique word I get two cells of data I am not able to drag the formula down and hence I have to do it horizontally.
I finally get three rows of data after the last step:
One row is just transposed Column4 (all the unique words). Second row is just the text sumColumn2 from using the QUERY formula. And third row is the actual sum of votes, resulting from individual QUERY formulae. I then transpose these rows to columns and to get my final table I VLOOKUP the frequency values arrived at earlier.
This approach is lengthy and prone to errors. Also doesn't work if the list is large and in the initial JOIN I get an error of limit 50,000 reached. Any ideas to make it better are welcome. I know this can be done much easier using Scripts but I'd prefer to have it done using only formulae.
try:
=ARRAYFORMULA(QUERY(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(
IF(IFERROR(SPLIT(A:A, " "))="",,"♠"&SPLIT(A:A, " ")&"♦"&B:B)
,,999^99)),,999^99), "♠")), "♦"),
"select Col1,sum(Col2)
group by Col1
order by sum(Col2) desc
label sum(Col2)''"))

Merging two data sets in order to add default values for missing data

I'm trying to merge two datasets in order to insert default rows for missing data. The use case is that I have a list of dates and attendance numbers for training sessions on those dates, but if I have no records at all for a training session then it's missing from the list.
In my sheet at the moment I have a two column set of dates and attendance numbers, and in another sheet I have worked out all the Wednesdays and Fridays (training days) between the start and end dates of all the sessions we have data for.
Is there a way to merge the two datasets together so that the zero attendance for each session is the base set and then I merge in the rows for which I have data? I've tried using some of the query command but if I specify two datasets using {Sheet1!A1:A,Sheet2!B1:B} I get array errors.
The attendance information is currently gathered with a query like this:
=QUERY({Records!A2:B}, "SELECT Col1, COUNT(Col2) WHERE (Col1 IS NOT NULL) GROUP BY Col1 ORDER BY Col1 ASC LABEL Col1 'Session Date', COUNT(Col2) 'Skaters'") where the Records sheets is just date and names.
If I update it to read from two datasets (=QUERY({Records!A2:B, Scratch!B2:B}, "SELECT Col1, COUNT(Col2) WHERE (Col1 IS NOT NULL) GROUP BY Col1 ORDER BY Col1 ASC LABEL Col1 'Session Date', COUNT(Col2) 'Skaters'")then I get a REF error of Function ARRAY_ROW parameter 2 has mismatched row size. Expected: 982. Actual: 999. Seems fair, as it's created misaligned dataset, rather than merging based on the date column.
I'm probably treating the spreadsheet a bit too much like a database, and while I would be more comfortable dropping into the script editor to resolve this I'm trying to learn a few spreadsheet techniques.
Data
Records looks like this:
| 2018-05-04 | Bob |
| 2018-05-04 | Fred |
| 2018-05-12 | Bob |
So no-one took attendance on the 9th, and so the stats are skewed as Bob gets a misleading 100% attendance record.
I do not understand the details of what you are trying to do but since it seems to involve combining one list of just dates and at least two lists of dates and names offer the following example:
The formula is:
=ArrayFormula(query({Sheet1!B1:C20;Sheet2!E1:F20;Sheet3!I1:J20},"select * where Col2 is not NULL order by Col1 "))

Merge multiple tables

I have lots of sheets describing different kind of expenses and gains of my small company, and I find no easy way to merge my tables like in this example I made:
I want the last table to be auto filled with the lines of the others tables when I update them, so I can foresee the expenses and gain in time (by ordering the green table automatically by date ascending).
By now the only temp solution I found is to copy references to the other tables lines (yellow and blue) in the merging table (green) in advance.
Pivot tables do not permit to achieve this kind of gathering on several tables.
Use this Query formula in cell I2:
=QUERY({A2:C; E2:G}, "select * where Col1 is not null", 1)
To also order them by Date, add the order by:
=QUERY({A2:C; E2:G}, "select * where Col1 is not null order by Col1", 1)

Resources