SQLite3 Database Query Optimization - join

I want a result by combining 4 tables. Previously I was using 4 different queries and to improve performance, started with joining the tables and querying from single table. But there was no improvement in performance.
I later learnt that SQLite translates join statements to "where clause" and I can directly use "Where" clause instead of join that would save some CPU time.
But the problem with "Where" clause is if one condition out of four fails, the result set is null. I want a table with rest of the columns (that matches other conditions) filled and not an empty table if one condition fails. Is there a way to acheive this? Thanks!

Have you considered using LEFT OUTER JOIN ?
for example
SELECT Customers.AcctNumber, Customers.Custname, catalogsales.InvoiceNo
FROM Customers
LEFT OUTER JOIN catalogsales ON Customers.Acctnumber = catalogsales.AcctNumber
In this example if there are not any matching rows in "catalogsales", then it will still return the data from the "left" table, which in this case is "Customers"
Without example SQL it's hard to know what you've tried.

Related

ActiveRecord WHERE IN with a separate query—combine to a single INNER JOIN query

I currently have a very complicated query which (simplified) looks like:
User.where(party_id: [Party.cool_parties.pluck(:id)])
It works great, but hits the db with 2 separate queries. I'm just wondering how to combine it into a single query? Some kind of inner join with a temporary table somehow? I'm not sure how to do that with Rails...
User.where(party_id: Party.cool_parties)
Just pass the assocation/relation and ActiveRecord will construct a subquery.
pluck should only be used in the cases where you actually want to force a query to load the data as an array - which is a lot less often then then you think.

Predicate Pushdown vs On Clause

When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined. This is known as predicate pushdown (http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).
However, I have also learned recently that there is another way of filtering data from a table before joining it with another table(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/).
One can provide the condition in the ON clause, and table a will be filtered before the join is performed
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id AND a.some_name=6
Do both of these provide the predicate pushdown optimization?
Thank you
Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS
ON join condition works before join.
WHERE is applied after join.
Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 --Right table filter
will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN, because if b.some_name=6, it cannot be NULL.
And PPD does not change this behavior.
You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records
And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.
LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6
PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.
More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344
So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.

Optimizing JOINs : comparison with indexed tables

Let's say we have a time consuming query described below :
(SELECT ...
FROM ...) AS FOO
LEFT JOIN (
SELECT ...
FROM ...) AS BAR
ON FOO.BarID = BAR.ID
Let's suppose that
(SELECT ...
FROM ...) AS FOO
Returns many rows (let's say 10 M). Every single row has to be joined with data in BAR.
Now let's say we insert the result of
SELECT ...
FROM ...) AS BAR
In a table, and add the ad hoc index(es) to it.
My question :
How would the performance of the "JOIN" with a live query differ from the performance of the "JOIN" to a table containing the result of the previous live query, to which ad hoc indexes would have been added ?
Another way to put it :
If a JOIN is slow, would there be any gain in actually storing and indexing the table to which we JOIN to ?
The answer is 'Maybe'.
It depends on the statistics of the data in question. The only way you'll find out for sure is to actually load the first query into a temp table, stick a relevant index on it, then run the second part of the query.
I can tell you if speed is what you want, if it's possible for you load the results of your first query permanently into a table then of course your query is going to be quicker.
If you want it to be even faster, depending on which DBMS you are using you could consider creating an index which crosses both tables - if you're using SQL Server they're called 'Indexed Views' or you can also look up 'Reified indexes' for other systems.
Finally, if you want the ultimate in speed, consider denormalising your data and eliminating the join that is occurring on the fly - basically you move the pre-processing (the join) offline at the cost of storage space and data consistency (your live table will be a little behind depending on how frequently you run your updates).
I hope this helps.

Fetch latest rows grouped by uniqe field value

I have a table of Books with an author_id field.
I'd like to fetch an array of Books which contain only one Book of every author. The one with the latest updated_at field.
The problem with straightforward approach like Books.all.group('author_id') on Postgres is that it needs all requested field in its GROUP BY block. (See https://stackoverflow.com/a/6106195/1245302)
But I need to get all Book objects one per author, the recent one, ignoring all other fields.
It seems to me that there's enough data for the DBMS to find exactly the rows I want,
at least I could do that myself without any other fields in GROUP BY block. :)
Is there any simple Rails 3 + Postgres (version < 9) or SQL implementation
independent way to get that?
UPDATE
Nice solution for Postgres:
books.unscoped.select('DISTINCT ON(author_id) *').order('author_id').order('updated_at DESC')
BUT! there still problem remains – results are sorted by author_id in the first place, but i need to sort by updated_at inside the same author_id-s (to find, say the top-10 recent book authors).
And Postgres doesn't allow you to change order of ORDER BY arguments in DISTINCT queries :(
I don't know Rails, but hopefully showing you the SQL for what you want will help get you to a way to generate the right SQL.
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC;
The DISTINCT ON (author_id) portion should not be confused with part of the result column list -- it just says that there will be one row per author_id. The list in a DISTINCT ON clause must be the leading portion of the ORDER BY clause in such a query, and the row which is kept is the one which sorts first based on the rest of the ORDER BY clause.
With a large number of rows this way of writing the query is usually much faster than any solution based on GROUP BY or window functions, often by an order of magnitude or more. It is a PostgreSQL extension, though; so it should not be used in code which is intended to be portable.
If you want to use this result set inside another query (for example, to find the 10 most recently updated authors), there are two ways to do that. You can use a subquery, like this:
SELECT *
FROM (SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC) w
ORDER BY updated_at DESC
LIMIT 10;
You could also use a CTE, like this:
WITH w AS (
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC)
SELECT * FROM w
ORDER BY updated_at DESC
LIMIT 10;
The usual advice about CTEs holds here: use them only where there isn't another way to write the query or if needed to coerce the planner by introducing an optimization barrier. The plans are very similar, but passing the intermediate results through the CTE scan adds a little overhead. On my small test set the CTE form is 17% slower.
This is belated, but in response to questions about overriding/resetting a default order, use .reorder(nil).order(:whatever_you_want_instead)
(I can't comment, so posting as an answer for now)

Fetch data from multiple tables and sort all by their time

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?
You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)
You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.
Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Resources