How to use SphinxSE table + sort by "weight desc" when there are other joins in query? - join

By default, when you perform query to sphinx table, Sphinx engine returns rows which are already sorted by query weight and does it really fast.
So, when I do this:
select
article.name
from article
left join article_ft on article._id=article_ft.id
where article_ft.query='some text;mode=any;';
Where:
article is InnoDB like table.
article_ft is Sphinx table.
Both of them (article.name and article_ft) contain these data (1 line = 1 row):
This is text.
This is also some text.
This is another text.
Sphinx engine will return rows like:
This is also some text.
This is text.
This is another text.
But, If I do something like this:
select
article.name
from article
left join article_ft on article._id=article_ft.id
left join article_category on article.category=article_category._id
where article_ft.query='some text;mode=any;';
It seems, MariaDB sorts it by its own way here.
Even If I provide Sphinx's 'sort' option like this:
select
article.name
from article
left join article_ft on article._id=article_ft.id
left join article_category on article.category=article_category._id
where article_ft.query='some text;mode=any;sort=extended:#weight desc;';
Still it doesn't work.
Changing order of joins doesn't work as well.
If I use order by article_ft.weight DESC MariaDB returns error message like:
Error: ER_ILLEGAL_HA: Storage engine SPHINX of the table `article_ft` doesn't have this option
in case if article has no rows that could match condition like article.category=50.
article_ft was created using this:
CREATE TABLE article_ft
(
id BIGINT NOT NULL,
weight INTEGER NOT NULL,
query VARCHAR(3072) NOT NULL,
INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://192.168.1.98:9402/article_ft";
How to use this "magical" sort by weight feature if query contains more joins with no errors in return?
Thanks forward, for any reply!
P.S. Can't provide you a fiddle for this because I do not know any SQL fiddle online service which supports Sphinx Tables. Also if you found more relevant topic question I'll appreciate that.

Put the article_ft table first in the query. ie ... article_ft inner join article ...
Or maybe use FORCE INDEX, to force the use of the query index. Then it might honour the sort order.
Failing that use a subquery?
(select name,weight from article_ft ... ) order by weight desc;

Related

How to use join with sort in Solr?

I'm trying to sort documents of type 'Case' by the 'Name' of the 'Contact' they belong to in Solr. But cases have no 'ContactName' field or similar, only 'ContactId'.
Only examples I could find are iterations of the example on this link: https://wiki.apache.org/solr/Join
But I couldn't apply it to my situation because of the sorting afterwards. The following gives me the cases I want but I can't sort it by the contact name afterwards because it only returns the fields of the cases.
{!join from=Id to=ContactId}*:*
SQL equivalent of what I want would be something like:
SELECT Case.Id, Contact.Name
FROM Case
LEFT JOIN Contact
ON Case.ContactId = Contact.Id
ORDER BY Contact.Name ASC;
So to answer my own question after some digging and a Solr training:
It is not best practice to use joins in a NoSql database like Solr. If you need joins, then your database is structured wrong. You should index everything you need, in the document itself, even if it is redundant. So in my case, I should index 'Contact.Name' field in my 'Case' documents.
Still, it is apparently possible to use SQL queries in Solr in case it is absolutely needed, if you're using SolrCloud but it doesn't support joins. However it is possible to work around that as follows:
SELECT s1.Id
FROM salesforce s1, salesforce s2
WHERE s1._type_ = 'Case' and s2._type_ = 'Contact' AND s1.ContactId = s2.Id
ORDER BY s2.Name ASC
It should be noted that the fields after '.' like the 'Id' in 's1.Id' must have 'docValues' activated in the schema. More info on docValues is here.

Why does Hive warn that this subquery would cause a Cartesian product?

According to Hive's documentation it supports NOT IN subqueries in a WHERE clause, provided that the subquery is an uncorrelated subquery (does not reference columns from the main query).
However, when I attempt to run the trivial query below, I get an error FAILED: SemanticException Cartesian products are disabled for safety reasons.
-- sample data
CREATE TEMPORARY TABLE foods (name STRING);
CREATE TEMPORARY TABLE vegetables (name STRING);
INSERT INTO foods VALUES ('steak'), ('eggs'), ('celery'), ('onion'), ('carrot');
INSERT INTO vegetables VALUES ('celery'), ('onion'), ('carrot');
-- the problematic query
SELECT *
FROM foods
WHERE foods.name NOT IN (SELECT vegetables.name FROM vegetables)
Note that if I use an IN clause instead of a NOT IN clause, it actually works fine, which is perplexing because the query evaluation structure should be the same in either case.
Is there a workaround for this, or another way to filter values from a query based on their presence in another table?
This is Hive 2.3.4 btw, running on an Amazon EMR cluster.
Not sure why you would get that error. One work around is to use not exists.
SELECT f.*
FROM foods f
WHERE NOT EXISTS (SELECT 1
FROM vegetables v
WHERE v.name = f.name)
or a left join
SELECT f.*
FROM foods f
LEFT JOIN vegetables v ON v.name = f.name
WHERE v.name is NULL
You got cartesian join because this is what Hive does in this case. vegetables table is very small (just one row) and it is being broadcasted to perform the cross (most probably map-join, check the plan) join. Hive does cross (map) join first and then applies filter. Explicit left join syntax with filter as #VamsiPrabhala said will force to perform left join, but in this case it works the same, because the table is very small and CROSS JOIN does not multiply rows.
Execute EXPLAIN on your query and you will see what is exactly happening.

Properly format an ActiveRecord query with a subquery in Postgres

I have a working SQL query for Postgres v10.
SELECT *
FROM
(
SELECT DISTINCT ON (title) products.title, products.*
FROM "products"
) subquery
WHERE subquery.active = TRUE AND subquery.product_type_id = 1
ORDER BY created_at DESC
With the goal of the query to do a distinct based on the title column, then filter and order them. (I used the subquery in the first place, as it seemed there was no way to combine DISTINCT ON with ORDER BY without a subquery.
I am trying to express said query in ActiveRecord.
I have been doing
Product.select("*")
.from(Product.select("DISTINCT ON (product.title) product.title, meals.*"))
.where("subquery.active IS true")
.where("subquery.meal_type_id = ?", 1)
.order("created_at DESC")
and, that works! But, it's fairly messy with the string where clauses in there. Is there a better way to express this query with ActiveRecord/Arel, or am I just running into the limits of what ActiveRecord can express?
I think the resulting ActiveRecord call can be improved.
But I would start improving with original SQL query first.
Subquery
SELECT DISTINCT ON (title) products.title, products.* FROM products
(I think that instead of meals there should be products?) has duplicate products.title, which is not necessary there. Worse, it misses ORDER BY clause. As PostgreSQL documentation says:
Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
I would rewrite sub-query as:
SELECT DISTINCT ON (title) * FROM products ORDER BY title ASC
which gives us a call:
Product.select('DISTINCT ON (title) *').order(title: :asc)
In main query where calls use Rails-generated alias for the subquery. I would not rely on Rails internal convention on aliasing subqueries, as it may change anytime. If you do not take this into account you could merge these conditions in one where call with hash-style argument syntax.
The final result:
Product.select('*')
.from(Product.select('DISTINCT ON (title) *').order(title: :asc))
.where(subquery: { active: true, meal_type_id: 1 })
.order('created_at DESC')

Return duplicate records (activerecord, postgres)

I have the following query returning duplicate titles, but :id is nil:
Movie.select(:title).group(:title).having("count(*) > 1")
[#<Movie:0x007f81f7111c20 id: nil, title: "Fargo">,
#<Movie:0x007f81f7111ab8 id: nil, title: "Children of Men">,
#<Movie:0x007f81f7111950 id: nil, title: "The Martian">,
#<Movie:0x007f81f71117e8 id: nil, title: "Gravity">]
I tried adding :id to the select and group but it returns an empty array. How can I return the whole movie record, not just the titles?
A SQL-y Way
First, let's just solve the problem in SQL, so that the Rails-specific syntax doesn't trick us.
This SO question is a pretty clear parallel: Finding duplicate values in a SQL Table
The answer from KM (second from the top, non-checkmarked, at the moment) meets your criteria of returning all duplicated records along with their IDs. I've modified KM's SQL to match your table...
SELECT
m.id, m.title
FROM
movies m
INNER JOIN (
SELECT
title, COUNT(*) AS CountOf
FROM
movies
GROUP BY
title
HAVING COUNT(*)>1
) dupes
ON
m.title=dupes.title
The portion inside the INNER JOIN ( ) is essentially what you've generated already. A grouped table of duplicated titles and counts. The trick is JOINing it to the unmodified movies table, which will exclude any movies that don't have matches in the query of dupes.
Why is this so hard to generate in Rails? The trickiest part is that, because we're JOINing movies to movies, we have to create table aliases (m and dupes in my query above).
Sadly, it Rails doesn't provide any clean ways of declaring these aliases. Some references:
Rails GitHub issues mentioning "join" and "alias". Misery.
SO Question: ActiveRecord query with alias'd table names
Fortunately, since we've got the SQL in-hand, we can use the .find_by_sql method...
Movie.find_by_sql("SELECT m.id, m.title FROM movies m INNER JOIN (SELECT title, COUNT(*) FROM movies GROUP BY title HAVING COUNT(*)>1) dupes ON m.first=.first")
Because we're calling Movie.find_by_sql, ActiveRecord assumes our hand-written SQL can be bundled into Movie objects. It doesn't massage or generate anything, which lets us do our aliases.
This approach has its shortcomings. It returns an array and not an ActiveRecord Relation, which means it can't be chained with other scopes. And, in the documentation for the find_by_sql method, we get extra discouragement...
This should be a last resort because using, for example, MySQL specific terms will lock you to using that particular database engine or require you to change your call if you switch engines.
A Rails-y Way
Really, what is the SQL doing above? It's getting a list of names that appear more than once. Then, it's matching that list against the original table. So, let's just do that using Rails.
titles_with_multiple = Movie.group(:title).having("count(title) > 1").count.keys
Movie.where(title: titles_with_multiple)
We call .keys because the first query returns an hash. The keys are our titles. The where() method can take an array, and we've handed it an array of titles. Winner.
You could argue one line of Ruby is more elegant than two. And if that one line of Ruby has an ungodly string of SQL embedded within it, how elegant is it really?
Hope this helps!
You can try to add id in your select:
Movie.select([:id, :title]).group(:title).having("count(title) > 1")

Rails 3 LIKE query raises exception when using a double colon and a dot

In rails 3.0.0, the following query works fine:
Author.where("name LIKE :input",{:input => "#{params[:q]}%"}).includes(:books).order('created_at')
However, when I input as search string (so containing a double colon followed by a dot):
aa:.bb
I get the following exception:
ActiveRecord::StatementInvalid: SQLite3::SQLException: ambiguous column name: created_at
In the logs the these are the sql queries:
with aa as input:
Author Load (0.4ms) SELECT "authors".* FROM "authors" WHERE (name LIKE 'aa%') ORDER BY created_at
Book Load (2.5ms) SELECT "books".* FROM "books" WHERE ("books".author_id IN (1,2,3)) ORDER BY id
with aa:.bb as input:
SELECT DISTINCT "authors".id FROM "authors" LEFT OUTER JOIN "books" ON "books"."author_id" = "authors"."id" WHERE (name LIKE 'aa:.bb%') ORDER BY created_at DESC LIMIT 12 OFFSET 0
SQLite3::SQLException: ambiguous column name: created_at
It seems that with the aa:.bb input, an extra query is made to fetch the distinct author id_s.
I thought Rails would escape all the characters. Is this expected behaviour or a bug?
Best Regards,
Pieter
The "ambiguous column" error usually happens when you use includes or joins and don't specify which table you're referring to:
"name LIKE :input"
Should be:
"authors.name LIKE :input"
Just "name" is ambiguous if your books table has a name column too.
Also: have a look at your development.log to see what the generated query looks like. This will show you if it's being escaped properly.
Replace
.includes(:books)
with
.preload(:books)
This should force activerecord to use 2 queries instead of the join.
Rails has 2 versions of includes: One which constructs a big query with joins (the 2nd of your 2 queries and thus more likely to result in ambiguous column references and one that avoids the joins in favour of a separate query per association.
Rails decides which strategy to used based on whether it thinks that your conditions, order etc refer to the included tables (since in that case the joins version is required). Where a condition is a string fragment that heuristic isn't very sophisticated - i seem to recall that it just scans the conditions for anything that might look like a column from another table (ie foo.bar) so having a literal of that form could fool it.
You can either qualify your column names so that it doesn't matter which includes strategy is used or you can use preload/eager_load instead of includes. These behave similarly to includes but force a specific include strategy rather than trying to guess which is most appropriate.

Resources