BigQuery taking too much time on simple LEFT JOIN - join

So I'm doing a really basic left join, basically joining different identifiers of my database, described below :
SELECT
main_id,
DT.table_1.mid_id AS mid_id,
final_id
FROM DT.table_1
LEFT JOIN DT.table_2 ON DT.table_1.mid_id = DT.table_2.mid_id
Table 1 is composed of four columns, main_id, mid_id, firstSeen and lastSeen.
There is 17,014,676 rows, for 519 MB of data. Each row is composed of a unique main_id - mid_id couple, but a main_id/mid_id can appear multiple times in the table.
Table 2 is composed of four columns, mid_id, final_id, firstSeen and lastSeen.
There is 66,779,079 rows, for 3.86 GB of data. In the same way, each row is composed of a unique mid_id - final_id couple, but a mid_id/final_id can appear multiple times in the table.
BigQuery is using only 3.11 GB for the query himself.
first_id and mid_id are integers, final_id is a string.
The Query result was too big for bigQuery to resolve so I had to create a "result" table, containing first, mid and final id with the exact type I wrote above. The "Allow Large Results" option had to be selected, or an error was thrown.
My problem is that this simple query already took an hour, and is not even finalised yet ! I read that the good practice would have been to do a RIGHT JOIN so that the first table in the join is the biggest, but still, an hour is awfully long, even for that case !
Do you, kind people of Stack Overflow, have an explanation ?
Thank you by advance !

Related

Unique List from one column to a limit row length across multiple columns Google Sheets

So ultimately what i'm trying to do is, copy paste some data into another sheet. What i want to do with that data. Is to remove any duplicate data, and create multiple columns of a fixed length. E.I. I have unique 14000 entries. It will create 3 columns to 5000 of unique entries each.
Short explanation, I have a long ass column of data I want to break up into several columns with a size limit.
So what I was able to figure out so far is
=unique(QUERY(Sheet2!A:I,"Select B limit 5000"))
Would it be something like by chance?
=unique({QUERY(Sheet2!A:I,"Select B limit 5000");A:A;B:B;C:C})
Not sure how it works but got it to work... and i still need to add in the unique aspect to it
=ArrayFormula(TRIM(split(transpose(SPLIT(Query(Sheet2!B2:B30000&","&if(MOD(row(Sheet2!B2:B30000)-row(A2),B2)=0,"|",""),,9^9),"|")),",")))
i found it from here
https://infoinspired.com/google-docs/spreadsheet/split-a-column-into-multiple-n-columns-sheets/

How to reflect multiple cells based on specific criteria

Imagine a list on the left filled with employees going down the spreadsheet and headers across the top based categorized on infractions that an employee might violate. this sheet is connected to another sheet which adds a one every time a form is submitted against the employee adding up for the quarter. So employee john smith has across his row would show a 0 if he never committed this infraction and add a 1 to the column each time he did so a row might look like this. John Smith 0 4 5 0 1
The goal is to show the experts name and infraction with how many times this infraction took place removing the infractions that he did not commit so ideally it would look like John Smith 4 5 1 and the header of each number would show what he did.
The goal is to make it much easier to see who did what essentially. There will be over 100 employees and alot of 0's so optically it would look better to distill in order to quickly identify who did what and how many times.
Any ideas?
V lookups and important ranges based on if this is greater than 0 is tedious and does not exactly pull what we want. Essentially omitting the 0s and just showing what an employee has done rather than what they have not done is the goal. All index and match formulas do not seem to specifically answer this problem
simple Index V lookups and matching formulas have been tried
Not able to reflect all three variables (employee/frequency/infraction) while not showing on a master list the people who did not commit the offense
There's a few ways you could set this up. I would set this up so it
Column A = Employee
Column B = Infraction
Column C = 1
Column D = Date
That way you can do a pivot summary and have the employees, with their infractions below their name and the months/years they occurred. Also you can adjust this table as necessary, such as filter by the employee name or by date or by infraction.
The added benefit is you could create a chart with all of these as filters, like cutoff a date range or pick an employee or infraction and it can show a bar graph of all the infractions by month or something like that.
I would agree that listing your data of infractions line by line (as they happen) and using a pivot table would probably be the easiest.
You could also use the AGGREGATE function to pull from a large database as well. This way you could type in an employees name, and a list of all infractions would pop up next to the name (or wherever you would want it) with as much detail as you would like. This way is more complex, but using both a pivot table and the AGGREGATE function might get you the best of both worlds (you could searching infraction types, dates, employees, employee types, and get all the details in the world if wanted).
Hope this helps!
JW

PowerBI counts rows in non related table including filtering and non-matches

I have two tables in PowerBI and a slicer, presented below in an abstracted way.
I want to know the number of orders placed for a customer in a given date range. This data is a sample for illustration - there are actually around 10,000 Customers and 500,000 Orders and both tables have many other fields, Ids etc.
My challenge -
Whilst this is easy enough do by relating the tables and doing a count, the difficulty comes in when I still want to see customers with 0 orders and on top of that I want this to work within a date range. In other words, instead of the customers with no orders disappearing form the list, I want them to appear in the list, but with a 0 value, depending on the date range. It would also be good if this could act as a measure, so I can see the number of total customers that have not ordered on a month by month basis. I have tried outer joins, merge queries, cross joins and lookups and cant seem to crack it.
Example 1: If I set the order date slicer to be: 02/01/2017 to 01/01/2018 I want the following results
Example 2: If I set the order date slicer to be: 03/01/2017 to 06/01/2017 I want the following results
Any help appreciated!
Thanks
This is entirely possible with a Measure. When you're using the Order field to count the rows for each customer, you're essential doing a COUNTROWS() function.
With your relationship still active, we can Prefix this in a measure to check for the blanks, and in those cases, return 0. something like this would work
Measure = IF(ISBLANK(COUNTROWS(Orders)),0,COUNTROWS(Orders))
In this case, 'Orders' is the table containing the Order and Order Date fields

MonetDB - left/right joins too slow than inner join

I have been comparing MySQL with MonetDB. Obviously, queries that took minutes in MySQL got executed in a matter of few seconds in Monet.
However, I found a real blockade with joins.
I have 2 tables - each with 150 columns. Among these (150+150) columns, around 60 are CHARACTER LARGE OBJECT type. Both the tables are populated with around 50,000 rows - with data in all the 150 columns. The average length of data in a CLOB type column is 9,000 (varying from 2 characters to 20,000 characters). The primary key of both the tables have same values and the join is always based on the primary key. The rows are by default inserted in ascending order on the primary key.
When I ran an inner join query on these two tables with about 5 criteria and with limit 1000, Monet processed this query in 5 seconds, which is completely impressive compared to MySQL's (19 seconds).
But when I ran the same query with same criteria and limits using a left or right joins, Monet took around 5 minutes, which is clearly way behind MySQL's (just 22 seconds).
I went through the logs using trace statement, but the traces of inner and left joins are more or less the same, except that time for each action is far higher in left join trace.
Also, the time taken for the same join query execution varies by 2 or 3 seconds when run at several time intervals.
Having read a lot about Monet's speed compared to traditional relation row based DBs, I could feel that I am missing something, but couldn't figure out what.
Can anyone please tell me why there is such a huge difference in such query execution time and how I can prevent it?
Much grateful to have any help. Thanks a lot in advance.
P.S.: I am running Monet on Macbook Pro - 2.3 GHz Core i7 - Quad core processor with 8 GB RAM.

Speed of ALTER TABLE ADD COLUMN in Sqlite3?

I have an iOS app that uses sqlite3 databases extensively. I need to add a column to a good portion of those tables. None of the tables are what I'd really consider large (I'm used to dealing with many millions of rows in MySQL tables), but given the hardware constraints of iOS devices I want to make sure it won't be a problem. The largest tables would be a few hundred thousand rows. Most of them would be a few hundred to a few thousand or tens of thousands.
I noticed that sqlite3 can only add columns to the end of a table. I'm assuming that's for some type of speed optimization, though possibly it's just a constraint of the database file format.
What is the time cost of adding a row to an sqlite3 table?
Does it simply update the schema and not change the table data?
Does the time increase with number of rows or number of columns already in the table?
I know the obvious answer to this is "just test" and I'll be doing that soon, but I couldn't find an answer on StackOverflow after a few minutes of searching so I figured I'd ask so others can find this information easier in the future.
From the SQLite ALTER TABLE documentation:
The execution time of the ALTER TABLE command is independent of the
amount of data in the table. The ALTER TABLE command runs as quickly
on a table with 10 million rows as it does on a table with 1 row.
The documentation implies the operation is O(1). It should run in negligible time.

Resources