I have this table:
User
Name
Role
Mason
Engineer
Jackson
Engineer
Mason
Supervisor
Jackson
Supervisor
Graham
Engineer
Graham
Engineer
There can be exact duplicates (same Name/Role combination). Ignore comments about primary key.
I am writing a query that will give the distinct values from 'Name' column, with the corresponding 'Role'. To select the corresponding 'Role', if there is a 'Supervisor' role for a name, that record is returned. Otherwise, a record with the 'Engineer' role should be returned if it exists.
For the above table, the expected result is:
Name
Role
Mason
Supervisor
Jackson
Supervisor
Graham
Engineer
I tried ordering 'Role' in descending order, so that I can group by Name,Role and pick the first item - it will be a 'Supervisor' role if present, else 'Engineer' role - which matches my expecation.
I also tried doing User.select('DISTINCT ON (name) \*).order(Role: :desc) - I am not seeing this clause in the SQL query that gets executed.
Also, I tried another approach to get all valid Name, Role combinations and then process it offline iterating the result set and using if-else to decide which row to display.
However, I am interested in anything that is efficient and does not over do this handling.
I am new to Ruby and therefore reaching out.
If I wanted to do this in pure SQL, I would have to use GROUP BY.
SELECT Name, MAX(Role) FROM User GROUP BY Name
So one method would be to execute this SQL statement against the base connection.
ActiveRecord::Base.connection.execute("SELECT Name, MAX(Role) FROM User GROUP BY Name")
That would provide exactly the data you need, though it wouldn't be returned as ActiveRecord models. If you need those models then I would use find_by_sql and do an inner join to provide the records.
User.find_by_sql("SELECT User.* FROM User INNER JOIN (SELECT Name AS n, MAX(Role) AS r FROM User GROUP BY Name) U2 WHERE Name = U2.n AND Role = U2.r")
Unfortunately that would provide both records for Graham.
I'm trying to sort documents of type 'Case' by the 'Name' of the 'Contact' they belong to in Solr. But cases have no 'ContactName' field or similar, only 'ContactId'.
Only examples I could find are iterations of the example on this link: https://wiki.apache.org/solr/Join
But I couldn't apply it to my situation because of the sorting afterwards. The following gives me the cases I want but I can't sort it by the contact name afterwards because it only returns the fields of the cases.
{!join from=Id to=ContactId}*:*
SQL equivalent of what I want would be something like:
SELECT Case.Id, Contact.Name
FROM Case
LEFT JOIN Contact
ON Case.ContactId = Contact.Id
ORDER BY Contact.Name ASC;
So to answer my own question after some digging and a Solr training:
It is not best practice to use joins in a NoSql database like Solr. If you need joins, then your database is structured wrong. You should index everything you need, in the document itself, even if it is redundant. So in my case, I should index 'Contact.Name' field in my 'Case' documents.
Still, it is apparently possible to use SQL queries in Solr in case it is absolutely needed, if you're using SolrCloud but it doesn't support joins. However it is possible to work around that as follows:
SELECT s1.Id
FROM salesforce s1, salesforce s2
WHERE s1._type_ = 'Case' and s2._type_ = 'Contact' AND s1.ContactId = s2.Id
ORDER BY s2.Name ASC
It should be noted that the fields after '.' like the 'Id' in 's1.Id' must have 'docValues' activated in the schema. More info on docValues is here.
We have 2 tables: users and statuses
The status table has a user_id, status and occured_on. The status is either 'removed' or 'added' and occured_on is the date the user was removed or added.
I need the current added users. That is, all the (distinct) users whose newest status record is 'added'.
I'm using Rails, and have tried:
User
.joins(:statuses)
.where('statuses.status = ?', 'added')
.order('statuses.occured_on DESC')
.uniq
Which translates to the SQL:
SELECT DISTINCT users.*
FROM users
INNER JOIN statuses
ON statuses.user_id = users.id
WHERE statuses.status = 'added'
ORDER BY statuses.occured_on DESC
That gives me the error:
PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: ...statuses.status = 'added') ORDER BY statuses.oc...
I'd be happy knowing either the Rails code that would work or the straight SQL.
Also, I'd prefer no sub-selects if possible.
Concider the following database schema change:
StatusTable:
StatusId
Status
UserId
ActiveFrom
ActiveTo
Afterwards you can add additional checks such as:
CONSTRAINT chk_from_to CHECK (ActiveFrom <= ActiveTo)
Then your query would look something like:
SELECT users.*
FROM users
JOIN statuses ON UserId = users.user_id AND ActiveFrom < CURRENT_TIMESTAMP AND ActiveTo > CURRENT_TIMESTAMP
WHERE statuses.Status = 'active'
With such structure you might need to change the way you change statuses, but from my own experience, this structure is much more flexible, and easier to query.
SELECT * FROM users INNER JOIN statuses ON users.id=statuses.user_id WHERE statuses.status='added' ORDER BY statuses.occured_on
After clarification, I don't think the schema is well designed for your goal. Can you clarify why you want the status change history contained in that table? My general approach to this would be that active users should be contained in a table called projects_users, containing project_id, user_id. When they are "removed" they should be removed from that table. Logs of the actions - adding and remove users from projects - should be stored in a separate table.
There's no good way that I'm aware of to write this query given your current design. Even if you fixed the errors, this runs error free in MySQL (which is exactly what you have)
SELECT DISTINCT `users`.* FROM `users`
INNER JOIN `projects_users`
ON `users`.`id`=`projects_users`.`user_id`
WHERE `status`='added'
ORDER BY `projects_users`.`occured_on` DESC
it still won't get you the correct results. The ORDER BY clause will just get you the most recent change to "added", it won't guarantee there is not a more recent "removed" action. To do that you'd need to compare the date of each most recent added record to the date of the most recent removed record, for each user, a nightmare.
I have a relationship between two models, Registers and Competitions. I have a very complicated dynamic query that is being built and if the conditions are right I need to limit Registration records to only those where it's Competition parent meets a certain criteria. In order to do this without select from the Competition table I was thinking of something along the lines of...
Register.where("competition_id in ?", Competition.where("...").collect {|i| i.id})
Which produces this SQL:
SELECT "registers".* FROM "registers" WHERE (competition_id in 1,2,3,4...)
I don't think PostgreSQL liked the fact that the in parameters aren't surrounded by parenthesis. How can I compare the Register foreign key to a list of competition ids?
you can make it a bit shorter and skip the collect (this worked for me in 3.2.3).
Register.where(competition_id: Competition.where("..."))
this will result in the following sql:
SELECT "registers".* FROM "registers" WHERE "registers"."competition_id" IN (SELECT "competitions"."id" FROM "competitions" WHERE "...")
Try this instead:
competitions = Competition.where("...").collect {|i| i.id}
Register.where(:competition_id => competitions)
I have a model of Widgets. Widgets belong to a Store model, which belongs to an Area model, which belongs to a Company. At the Company model, I need to find all associated widgets. Easy:
class Widget < ActiveRecord::Base
def self.in_company(company)
includes(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
Which will generate this beautiful query:
> Widget.in_company(Company.first).count
SQL (50.5ms) SELECT COUNT(DISTINCT "widgets"."id") FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1
=> 15088
But, I later need to use this scope in more complex scope. The problem is that AR is expanding the query by selecting individual fields, which fails in PG because selected fields must in the GROUP BY clause or the aggregate function.
Here is the more complex scope.
def self.sum_amount_chart_series(company, start_time)
orders_by_day = Widget.in_company(company).archived.not_void.
where(:print_datetime => start_time.beginning_of_day..Time.zone.now.end_of_day).
group(pg_print_date_group).
select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
end
def self.pg_print_date_group
"CAST((print_datetime + interval '#{tz_offset_hours} hours') AS date)"
end
And this is the select it is throwing at PG:
> Widget.sum_amount_chart_series(Company.first, 1.day.ago)
SELECT "widgets"."id" AS t0_r0, "widgets"."user_id" AS t0_r1,<...BIG SNIP, YOU GET THE IDEA...> FROM "widgets" LEFT OUTER JOIN "stores" ON "stores"."id" = "widgets"."store_id" LEFT OUTER JOIN "areas" ON "areas"."id" = "stores"."area_id" LEFT OUTER JOIN "companies" ON "companies"."id" = "areas"."company_id" WHERE "companies"."id" = 1 AND "widgets"."archived" = 't' AND "widgets"."voided" = 'f' AND ("widgets"."print_datetime" BETWEEN '2011-04-24 00:00:00.000000' AND '2011-04-25 23:59:59.999999') GROUP BY CAST((print_datetime + interval '-7 hours') AS date)
Which generates this error:
PGError: ERROR: column
"widgets.id" must appear in the
GROUP BY clause or be used in an
aggregate function LINE 1: SELECT
"widgets"."id" AS t0_r0,
"widgets"."user_id...
How do I rewrite the Widget.in_company scope so that AR does not expand the select query to include every Widget model field?
As Frank explained, PostgreSQL will reject any query which doesn't return a reproducible set of rows.
Suppose you've a query like:
select a, b, agg(c)
from tbl
group by a
PostgreSQL will reject it because b is left unspecified in the group by statement. Run that in MySQL, by contrast, and it will be accepted. In the latter case, however, fire up a few inserts, updates and deletes, and the order of the rows on disk pages ends up different.
If memory serves, implementation details are so that MySQL will actually sort by a, b and return the first b in the set. But as far as the SQL standard is concerned, the behavior is unspecified -- and sure enough, PostgreSQL does not always sort before running aggregate functions.
Potentially, this might result in different values of b in result set in PostgreSQL. And thus, PostgreSQL yields an error unless you're more specific:
select a, b, agg(c)
from tbl
group by a, b
What Frank highlighted is that, in PostgreSQL 9.1, if a is the primary key, than you can leave b unspecified -- the planner has been taught to ignore subsequent group by fields when applicable primary keys imply a unique row.
For your problem in particular, you need to specify your group by as you currently do, plus every field that you're basing your aggregate onto, i.e. "widgets"."id", "widgets"."user_id", [snip] but not stuff like sum(amount), which are the aggregate function calls.
As an off topic side note, I'm not sure how your ORM/model works but the SQL it's generating isn't optimal. Many of those left outer joins seem like they should be inner joins. This will result in allowing the planner to pick an appropriate join order where applicable.
PostgreSQL version 9.1 (beta at this moment) might fix your problem, but only if there is a functional dependency on the primary key.
From the release notes:
Allow non-GROUP BY columns in the
query target list when the primary key
is specified in the GROUP BY clause
(Peter Eisentraut)
Some other database system already
allowed this behavior, and because of
the primary key, the result is
unambiguous.
You could run a test and see if it fixes your problem. If you can wait for the production release, this can fix the problem without changing your code.
Firstly simplify your life by storing all dates in a standard time-zone. Changing dates with time-zones should really be done in the view as a user convenience. This alone should save you a lot of pain.
If you're already in production write a migration to create a normalised_date column wherever it would be helpful.
nrI propose that the other problem here is the use of raw SQL, which rails won't poke around for you. To avoid this try using the gem called Squeel (aka Metawhere 2) http://metautonomo.us/projects/squeel/
If you use this you should be able to remove hard coded SQL and let rails get back to doing its magic.
For example:
.select("#{pg_print_date_group} as print_date, sum(amount) as total_amount")
becomes (once your remove the need for normalising the date):
.select{sum(amount).as(total_amount)}
Sorry to answer my own question, but I figured it out.
First, let me apologize to those who thought I might be having an SQL or Postgres issue, it is not. The issue is with ActiveRecord and the SQL it is generating.
The answer is... use .joins instead of .includes. So I just changed the line in the top code and it works as expected.
class Widget < ActiveRecord::Base
def self.in_company(company)
joins(:store => {:area => :company}).where(:companies => {:id => company.id})
end
end
I'm guessing that when using .includes, ActiveRecord is trying to be smart and use JOINS in the SQL, but it's not smart enough for this particular case and was generating that ugly SQL to select all associated columns.
However, all the replies have taught me quite a bit about Postgres that I did not know, so thank you very much.
sort in mysql:
> ids = [11,31,29]
=> [11, 31, 29]
> Page.where(id: ids).order("field(id, #{ids.join(',')})")
in postgres:
def self.order_by_ids(ids)
order_by = ["case"]
ids.each_with_index.map do |id, index|
order_by << "WHEN id='#{id}' THEN #{index}"
end
order_by << "end"
order(order_by.join(" "))
end
User.where(:id => [3,2,1]).order_by_ids([3,2,1]).map(&:id)
#=> [3,2,1]