how to implement full outer join with lucene query time join? - join

For the example shown here, if i want to find all articles with all the comments present for each article, i.e. like a full outer join,how can it be done with query time join of lucene.
Is it possible that in the syntax of createJoinQuery
JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, searcher);
fromQuery will return all the distict values for fromField?

Related

Predicate Pushdown vs On Clause

When performing a join in Hive and then filtering the output with a where clause, the Hive compiler will try to filter data before the tables are joined. This is known as predicate pushdown (http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/)
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6
Rows from table a which have some_name = 6 will be filtered before performing the join, if push down predicates are enabled(hive.optimize.ppd).
However, I have also learned recently that there is another way of filtering data from a table before joining it with another table(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before-where-clause/).
One can provide the condition in the ON clause, and table a will be filtered before the join is performed
For example:
SELECT * FROM a JOIN b ON a.some_id=b.some_other_id AND a.some_name=6
Do both of these provide the predicate pushdown optimization?
Thank you
Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS
ON join condition works before join.
WHERE is applied after join.
Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 --Right table filter
will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN, because if b.some_name=6, it cannot be NULL.
And PPD does not change this behavior.
You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id
WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records
And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.
LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:
SELECT * FROM a
LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6
PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.
More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344
So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.

What with clause do? Neo4j

I don't understand what WITH clause do in Neo4j. I read the The Neo4j Manual v2.2.2 but it is not quite clear about WITH clause. There are not many examples. For example I have the following graph where the blue nodes are football teams and the yellow ones are their stadiums.
I want to find stadiums where two or more teams play. I found that query and it works.
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with a, count(*) as foaf
where foaf > 1
return a
count(*) says us the numbers of matching rows. But I don't understand what WITH clause do.
WITH allows you to pass on data from one part of the query to the next. Whatever you list in WITH will be available in the next query part.
You can use aggregation, SKIP, LIMIT, ORDER BY with WITH much like in RETURN.
The only difference is that your expressions have to get an alias with AS alias to be able to access them in later query parts.
That means you can chain query parts where one computes some data and the next query part can use that computed data. In your case it is what GROUP BY and HAVING would be in SQL but WITH is much more powerful than that.
here is another example
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with distinct a
order by a.name limit 10
match (a)-[:IN_CITY]->(c:City)
return c.name

solr join - return parent and child document

I am using Solr's (4.0.0-beta) join capability to query an index that has documents with parent/child relationships. The join query works great, but I only get the parent documents in the search results. I believe this is the expected behavior.
Is it possible, though, to get both the parent and the child documents to be returned in the search results? (as separate search hits).
For example:
Parents:
SolrDocument{uid=m_1, media_id=1}<br/>
SolrDocument{uid=m_2, media_id=2}<br/>
SolrDocument{uid=m_3, media_id=3}
Children:
SolrDocument(uid=p_1, page_id=1, fk_media_id=[1], partNumber=[abc, def, xyz]}<br/>
SolrDocument(uid=p_2, page_id=2, fk_media_id=[1,2], partNumber=[123, 456]}<br/>
SolrDocument(uid=p_3, page_id=3, fk_media_id=[1,3], partNumber=[100, 101]}
I query by partNumber like this:
{!join from=fk_media_id to=media_id}partNumber:abc
and I get the parent document (uid=m_1) in the results, as expected. But I would like, in this case, both the parent and the child to be returned in the results. Is that possible?
No, It´s not posible. According to Solr Wiki:
For people who are used to SQL, it's important to note that Joins in Solr are not really equivalent to SQL Joins because no information about the table being joined "from" is carried forward into the final result. A more appropriate SQL analogy would be an "inner query""
http://wiki.apache.org/solr/Join
You have to denormalize all your data to do that or run two different querys.

SQLite3 Database Query Optimization

I want a result by combining 4 tables. Previously I was using 4 different queries and to improve performance, started with joining the tables and querying from single table. But there was no improvement in performance.
I later learnt that SQLite translates join statements to "where clause" and I can directly use "Where" clause instead of join that would save some CPU time.
But the problem with "Where" clause is if one condition out of four fails, the result set is null. I want a table with rest of the columns (that matches other conditions) filled and not an empty table if one condition fails. Is there a way to acheive this? Thanks!
Have you considered using LEFT OUTER JOIN ?
for example
SELECT Customers.AcctNumber, Customers.Custname, catalogsales.InvoiceNo
FROM Customers
LEFT OUTER JOIN catalogsales ON Customers.Acctnumber = catalogsales.AcctNumber
In this example if there are not any matching rows in "catalogsales", then it will still return the data from the "left" table, which in this case is "Customers"
Without example SQL it's hard to know what you've tried.

Fetch latest rows grouped by uniqe field value

I have a table of Books with an author_id field.
I'd like to fetch an array of Books which contain only one Book of every author. The one with the latest updated_at field.
The problem with straightforward approach like Books.all.group('author_id') on Postgres is that it needs all requested field in its GROUP BY block. (See https://stackoverflow.com/a/6106195/1245302)
But I need to get all Book objects one per author, the recent one, ignoring all other fields.
It seems to me that there's enough data for the DBMS to find exactly the rows I want,
at least I could do that myself without any other fields in GROUP BY block. :)
Is there any simple Rails 3 + Postgres (version < 9) or SQL implementation
independent way to get that?
UPDATE
Nice solution for Postgres:
books.unscoped.select('DISTINCT ON(author_id) *').order('author_id').order('updated_at DESC')
BUT! there still problem remains – results are sorted by author_id in the first place, but i need to sort by updated_at inside the same author_id-s (to find, say the top-10 recent book authors).
And Postgres doesn't allow you to change order of ORDER BY arguments in DISTINCT queries :(
I don't know Rails, but hopefully showing you the SQL for what you want will help get you to a way to generate the right SQL.
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC;
The DISTINCT ON (author_id) portion should not be confused with part of the result column list -- it just says that there will be one row per author_id. The list in a DISTINCT ON clause must be the leading portion of the ORDER BY clause in such a query, and the row which is kept is the one which sorts first based on the rest of the ORDER BY clause.
With a large number of rows this way of writing the query is usually much faster than any solution based on GROUP BY or window functions, often by an order of magnitude or more. It is a PostgreSQL extension, though; so it should not be used in code which is intended to be portable.
If you want to use this result set inside another query (for example, to find the 10 most recently updated authors), there are two ways to do that. You can use a subquery, like this:
SELECT *
FROM (SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC) w
ORDER BY updated_at DESC
LIMIT 10;
You could also use a CTE, like this:
WITH w AS (
SELECT DISTINCT ON (author_id) *
FROM Books
ORDER BY author_id, updated_at DESC)
SELECT * FROM w
ORDER BY updated_at DESC
LIMIT 10;
The usual advice about CTEs holds here: use them only where there isn't another way to write the query or if needed to coerce the planner by introducing an optimization barrier. The plans are very similar, but passing the intermediate results through the CTE scan adds a little overhead. On my small test set the CTE form is 17% slower.
This is belated, but in response to questions about overriding/resetting a default order, use .reorder(nil).order(:whatever_you_want_instead)
(I can't comment, so posting as an answer for now)

Resources