Adding unique two-column index with already not unique data

Adding unique two-column index with already not unique data - ruby-on-rails

I have a rails app and need to add a unique constraint, so that a :record never has the same (:user, :hour) combination.
I imagine the best way to do this is by adding a unique index:
add_index :records, [:user_id, :hour], :unique => true
The problem is, the migration I wrote to do that fails, because my database already has non-unique combinations. How do I find those combinations?
This answer suggests "check with GROUP BY and COUNT" but I'm a total newbie, and I would love some help interpreting that.
Do I write a helper method to do that? Where in my app would that go?
It's too complex to do it in the console, right?
Or should I be looking at some sort of a script?
Thank you!

Run this query in your database console: SELECT *, COUNT(*) as n FROM table_name group by column_name HAVING n>1
Fix the duplicate rows
Re-run your migration
IMHO, you should edit your duplicate data manually so that you can be sure the data is correctly fixed.
Update:
OP didn't mention he/she is using Postgres and I gave a solution for MySQL.
For Postgres:
Based on this solution: Find duplicate rows with PostgreSQL
Run this query:
SELECT * FROM (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY merchant_Id, url ORDER BY id asc) AS Row
FROM Photos
) dups
WHERE
dups.Row > 1
More explanation:
In order for you to execute the migration and add unique constraint to your columns, you need to fix the current data first. Usually, there's no automatic step for this in order to make sure you won't end up with incorrect data.
That's why you need to manually find the duplicate rows and fix it. The given query will show you which rows are duplicates. So, from there, fix the data and you should be able to run the migration.
Mooore update:
The duplicated rows do not get marked. For an example, if you get this kindo of result:
ID ROW
235 2
236 3
2 2
3 3
You should select the row with id=235 and then select every row with the same column value as id=235. From there, you'll see every id which are duplicates from id=235. Then, just edit them one by one.

Related

Updating existing records with IDs of new rows while using a "with" clause

Platform: Ruby on Rails with PostgreSQL database.
Problem:
We are doing some backfilling to migrate our data to a new structure. It's created a rather convoluted situation, and we'd like to handle it as efficiently as possible. It's partially addressed with SQL similar to this:
with rows as (
insert into responses (prompt_id, answer, received_at, user_id, category_id)
select prompt_id, null as answer, received_at, user_id, category_id
from prompts
where user_status = 0 and skipped is not true
returning id, category_id
)
insert into category_responses (category_id, response_id)
select category_id, id as response_id
from rows;
The tables and columns have been obfuscated/simplified so the reasoning behind it may not be as clear, but category_responses is a many-to-many join table. What we're doing is grabbing existing prompts, and creating a set of empty responses (answer is NULL) for each.
The piece that's missing is to then associate the records in prompts with the newly created responses. Is there a way to do this within the query? I would like to avoid adding a prompt_id column to answers if possible, but I am guessing this would be one way to handle that, including it in the returning clause, then issuing a second query to update the prompts table - and anyway I'm not even sure you can run more than one query with the results of a single with clause.
What's the best way to accomplish this?

I have settled on adding the needed column, and updated the query as follows:
with tab1 as (
insert into responses (prompt_id, answer, received_at, user_id, category_id, prompt_id)
select prompt_id, null as answer, received_at, user_id, category_id
from prompts
where user_status = 0 and skipped is not true
returning id, category_id, prompt_id
),
tab2 as (
update prompts
set response_id = tab1.response_id,
category_id = tab1.category_id
from tab1
where prompts.id = tab1.prompt_id
returning prompts.response_id as response_id, prompts.category_id as category_id
)
insert into category_responses (category_id, response_id)
select category_id, id as response_id
from tab2;

How to delete all logs except last 100 for each user in single table?

I have a single logs table which contains entries for users. I want to (prune) delete all but the last 100 for each user. I'd like to do this in the most efficient way (one statement using ActiveRecord if possible).
I know I can use the following:
.order(created_at: :desc) to get the records sorted
.offset(100) to get all records except the ones I want to keep
.ids to pluck the record ids
select(:user_id).distinct to get a list of all users in the table
The table has id, user_id, created_at columns (and others not pertinent to this question).
Each user should have at least the last 100 log entries remaining the logs table.
Not really sure how to do this using ruby syntax with my Log model. If it can't be done efficiently using ruby then I'll resort to using the SQL equivalent.
Any help much appreciated.

In SQL, you could do this:
DELETE FROM logs
USING (SELECT id
FROM (SELECT id,
row_number()
OVER (PARTITION BY user_id
ORDER BY created_at DESC)
AS rownr
FROM logs
) AS a
WHERE rownr > 100
) AS b
WHERE logs.id = b.id;
If the table is large, this will be slow.

Rails 4 - Order by the presence of an attribute

Using Rails 4. Psql DB.
I have a model Article with an attribute amazon_title. I am having trouble understanding how I can order my articles so articles with the amazon_title present are first, and ones without are second.
I've tried ordering them like this with no success:
Article.all.order(amazon_title: :desc)
The above orders it alphabetically, showing blank first, present second, and nil third.
I feel like this is very simple, but for some reason I cannot find the answer. Thanks!

For PostgreSQL (the order will be true, false, nil):
Article.order('amazon_title DESC NULLS LAST')
Another option (database agnostic):
Article.order('(CASE WHEN amazon_title THEN 1 WHEN amazon_title IS NULL THEN 2 ELSE 3 END) ASC')

In PostgreSQL you can pass NULLS FIRST OR NULLS LAST depending on your requirement. That's why I asked you about your DB.
Article.order('amazon_title DESC NULLS FIRST')
the above will list the NULLS first and
Article.order('amazon_title DESC NULLS LAST')
and this one will list the NULL records last.
Hope that helps!

Delete all records that are not the latest

I have a table that deliberately has duplicates in it. In this instance the things that will be duplicated are a deviceId, and the datetime. Sometimes the customer updates their data. The table has three columns, deviceId, datetime and value (there is an incremental primary key). Sometimes when the customer re-evaluates their data, they notice that the value is incorrect, they then update it and send the data for re-processing. As a consequence, i need to be able to delete records that are not the very latest records. I cant do it by datetime, as this will also be duplicated in some cases and I cant truncate the staging table.
To delete the dupes I have the following:
;WITH DupeData AS (
SELECT ROW_NUMBER() OVER(PARTITION BY tblMeterData_Id,fldDateTime, fldValue, [fldBatchId],[fldProcessed] ORDER BY fldDateTime) AS ROW
FROM [Stage.tblMeterData])
DELETE FROM DupeData
WHERE ROW > 1
The problem with this, is it seems to delete a random duplicate.
I want to keep the latest record that is in the staging area and delete any others that are not the latest record. I can then update the relevant row with the new value, with the latest data, when I take it from staging into prod.

is any primary or unique key on the table?
if there's unique id - the easiest way below
not sure about performance but should work ok on small amounts
DELETE FROM DupeData
where id in
(select id from
( SELECT id,
ROW_NUMBER() OVER(PARTITION BY tblMeterData_Id,fldDateTime, fldValue, [fldBatchId],[fldProcessed] ORDER BY fldDateTime) AS ROW
FROM [Stage.tblMeterData])
) q
where q.row > 1)

Rails Postgres Error GROUP BY clause or be used in an aggregate function

In SQLite (development) I don't have any errors, but in production with Postgres I get the following error. I don't really understand the error.
PG::Error: ERROR: column "commits.updated_at" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...mmits"."user_id" = 1 GROUP BY mission_id ORDER BY updated_at...
^
: SELECT COUNT(*) AS count_all, mission_id AS mission_id FROM "commits" WHERE "commits"."user_id" = 1 GROUP BY mission_id ORDER BY updated_at DESC
My controller method:
def show
#user = User.find(params[:id])
#commits = #user.commits.order("updated_at DESC").page(params[:page]).per(25)
#missions_commits = #commits.group("mission_id").count.length
end
UPDATE:
So i digged further into this PostgreSQL specific annoyance and I am surprised that this exception is not mentioned in the Ruby on Rails Guide.
I am using psql (PostgreSQL) 9.1.11
So from what I understand, I need to specify which column that should be used whenever you use the GROUP_BY clause. I thought using SELECT would help, which can be annoying if you need to SELECT a lot of columns.
Interesting discussion here
Anyways, when I look at the error, everytime the cursor is pointed to updated_at. In the SQL query, rails will always ORDER BY updated_at. So I have tried this horrible query:
#commits.group("mission_id, date(updated_at)")
.select("date(updated_at), count(mission_id)")
.having("count(mission_id) > 0")
.order("count(mission_id)").length
which gives me the following SQL
SELECT date(updated_at), count(mission_id)
FROM "commits"
WHERE "commits"."user_id" = 1
GROUP BY mission_id, date(updated_at)
HAVING count(mission_id) > 0
ORDER BY updated_at DESC, count(mission_id)
LIMIT 25 OFFSET 0
the error is the same.
Note that no matter what it will ORDER BY updated_at, even if I wanted to order by something else.
Also I don't want to group the records by updated_at just by mission_id.
This PostgreSQL error is just misleading and has little explanation to solving it. I have tried many formulas from the stackoverflow sidebar, nothing works and always the same error.
UPDATE 2:
So I got it to work, but it needs to group the updated_at because of the automatic ORDER BY updated_at. How do I count only by mission_id?
#missions_commits = #commits.group("mission_id, updated_at").count("mission_id").size

I guest you want to show general number of distinct Missions related with Commits, anyway it won't be number on page.
Try this:
#commits = #user.commits.order("updated_at DESC").page(params[:page]).per(25)
#missions_commits = #user.commits.distinct.count(:mission_id)
However if you want to get the number of distinct Missions on page I suppose it should be:
#missions_commits = #commits.collect(&:mission_id).uniq.count
Update
In Rails 3, distinct did not exist, but pure SQL counting should be used this way:
#missions_commits = #user.commits.count(:mission_id, distinct: true)

See the docs for PostgreSQL GROUP BY here:
http://www.postgresql.org/docs/9.3/interactive/sql-select.html#SQL-GROUPBY
Basically, unlike Sqlite (and MySQL) postgres requires that any columns selected or ordered on must appear in an aggregate function or the group by clause.
If you think it through, you'll see that this actually makes sense. Sqlite/MySQL cheat under the hood and silently drop those fields (not sure that's technically what happens).
Or thinking about it another way if you are grouping by a field, what's the point of ordering it? How would that even make sense unless you also had an aggregate function on the ordered field?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Adding unique two-column index with already not unique data - ruby-on-rails

Related

Updating existing records with IDs of new rows while using a "with" clause

How to delete all logs except last 100 for each user in single table?

Rails 4 - Order by the presence of an attribute

Delete all records that are not the latest

Rails Postgres Error GROUP BY clause or be used in an aggregate function

Categories

Resources