Delete all records that are not the latest - stored-procedures

I have a table that deliberately has duplicates in it. In this instance the things that will be duplicated are a deviceId, and the datetime. Sometimes the customer updates their data. The table has three columns, deviceId, datetime and value (there is an incremental primary key). Sometimes when the customer re-evaluates their data, they notice that the value is incorrect, they then update it and send the data for re-processing. As a consequence, i need to be able to delete records that are not the very latest records. I cant do it by datetime, as this will also be duplicated in some cases and I cant truncate the staging table.
To delete the dupes I have the following:
;WITH DupeData AS (
SELECT ROW_NUMBER() OVER(PARTITION BY tblMeterData_Id,fldDateTime, fldValue, [fldBatchId],[fldProcessed] ORDER BY fldDateTime) AS ROW
FROM [Stage.tblMeterData])
DELETE FROM DupeData
WHERE ROW > 1
The problem with this, is it seems to delete a random duplicate.
I want to keep the latest record that is in the staging area and delete any others that are not the latest record. I can then update the relevant row with the new value, with the latest data, when I take it from staging into prod.

is any primary or unique key on the table?
if there's unique id - the easiest way below
not sure about performance but should work ok on small amounts
DELETE FROM DupeData
where id in
(select id from
( SELECT id,
ROW_NUMBER() OVER(PARTITION BY tblMeterData_Id,fldDateTime, fldValue, [fldBatchId],[fldProcessed] ORDER BY fldDateTime) AS ROW
FROM [Stage.tblMeterData])
) q
where q.row > 1)

Related

Updating existing records with IDs of new rows while using a "with" clause

Platform: Ruby on Rails with PostgreSQL database.
Problem:
We are doing some backfilling to migrate our data to a new structure. It's created a rather convoluted situation, and we'd like to handle it as efficiently as possible. It's partially addressed with SQL similar to this:
with rows as (
insert into responses (prompt_id, answer, received_at, user_id, category_id)
select prompt_id, null as answer, received_at, user_id, category_id
from prompts
where user_status = 0 and skipped is not true
returning id, category_id
)
insert into category_responses (category_id, response_id)
select category_id, id as response_id
from rows;
The tables and columns have been obfuscated/simplified so the reasoning behind it may not be as clear, but category_responses is a many-to-many join table. What we're doing is grabbing existing prompts, and creating a set of empty responses (answer is NULL) for each.
The piece that's missing is to then associate the records in prompts with the newly created responses. Is there a way to do this within the query? I would like to avoid adding a prompt_id column to answers if possible, but I am guessing this would be one way to handle that, including it in the returning clause, then issuing a second query to update the prompts table - and anyway I'm not even sure you can run more than one query with the results of a single with clause.
What's the best way to accomplish this?
I have settled on adding the needed column, and updated the query as follows:
with tab1 as (
insert into responses (prompt_id, answer, received_at, user_id, category_id, prompt_id)
select prompt_id, null as answer, received_at, user_id, category_id
from prompts
where user_status = 0 and skipped is not true
returning id, category_id, prompt_id
),
tab2 as (
update prompts
set response_id = tab1.response_id,
category_id = tab1.category_id
from tab1
where prompts.id = tab1.prompt_id
returning prompts.response_id as response_id, prompts.category_id as category_id
)
insert into category_responses (category_id, response_id)
select category_id, id as response_id
from tab2;

How to delete all logs except last 100 for each user in single table?

I have a single logs table which contains entries for users. I want to (prune) delete all but the last 100 for each user. I'd like to do this in the most efficient way (one statement using ActiveRecord if possible).
I know I can use the following:
.order(created_at: :desc) to get the records sorted
.offset(100) to get all records except the ones I want to keep
.ids to pluck the record ids
select(:user_id).distinct to get a list of all users in the table
The table has id, user_id, created_at columns (and others not pertinent to this question).
Each user should have at least the last 100 log entries remaining the logs table.
Not really sure how to do this using ruby syntax with my Log model. If it can't be done efficiently using ruby then I'll resort to using the SQL equivalent.
Any help much appreciated.
In SQL, you could do this:
DELETE FROM logs
USING (SELECT id
FROM (SELECT id,
row_number()
OVER (PARTITION BY user_id
ORDER BY created_at DESC)
AS rownr
FROM logs
) AS a
WHERE rownr > 100
) AS b
WHERE logs.id = b.id;
If the table is large, this will be slow.

Order with DISTINCT ids in rails with postgres

I have the following code to join two tables microposts and activities with micropost_id column and then order based on created_at of activities table with distinct micropost id.
Micropost.joins("INNER JOIN activities ON
(activities.micropost_id = microposts.id)").
where('activities.user_id= ?',id).order('activities.created_at DESC').
select("DISTINCT (microposts.id), *")
which should return whole micropost columns.This is not working in my developement enviornment.
(PG::InvalidColumnReference: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
If I add activities.created_at in SELECT DISTINCT, I will get repeated micropost ids because the have distinct activities.created_at column. I have done a lot of search to reach here. But the problem always persist because of this postgres condition to avoid random selection.
I want to select based on order of activities.created_at with distinct micropost _id.
Please help..
To start with, we need to quickly cover what SELECT DISTINCT is actually doing. It looks like just a nice keyword to make sure you only get back distinct values, which shouldn't change anything, right? Except as you're finding out, behind the scenes, SELECT DISTINCT is actually acting more like a GROUP BY. If you want to select distinct values of something, you can only order that result set by the same values you're selecting -- otherwise, Postgres doesn't know what to do.
To explain where the ambiguity comes from, consider this simple set of data for your activities:
CREATE TABLE activities (
id INTEGER PRIMARY KEY,
created_at TIMESTAMP WITH TIME ZONE,
micropost_id INTEGER REFERENCES microposts(id)
);
INSERT INTO activities (id, created_at, micropost_id)
VALUES (1, current_timestamp, 1),
(2, current_timestamp - interval '3 hours', 1),
(3, current_timestamp - interval '2 hours', 2)
You stated in your question that you want "distinct micropost_id" "based on order of activities.created_at". It's easy to order these activities by descending created_at (1, 3, 2), but both 1 and 2 have the same micropost_id of 1. So if you want the query to return just micropost IDs, should it return 1, 2 or 2, 1?
If you can answer the above question, you need to take your logic for doing so and move it into your query. Let's say that, and I think this is pretty likely, you want this to be a list of microposts which were most recently acted on. In that case, you want to sort the microposts in descending order of their most recent activity. Postgres can do that for you, in a number of ways, but the easiest way in my mind is this:
SELECT micropost_id
FROM activities
JOIN microposts ON activities.micropost_id = microposts.id
GROUP BY micropost_id
ORDER BY MAX(activities.created_at) DESC
Note that I've dropped the SELECT DISTINCT bit in favor of using GROUP BY, since Postgres handles them much better. The MAX(activities.created_at) bit tells Postgres to, for each group of activities with the same micropost_id, sort by only the most recent.
You can translate the above to Rails like so:
Micropost.select('microposts.*')
.joins("JOIN activities ON activities.micropost_id = microposts.id")
.where('activities.user_id' => id)
.group('microposts.id')
.order('MAX(activities.created_at) DESC')
Hope this helps! You can play around with this sqlFiddle if you want to understand more about how the query works.
Try the below code
Micropost.select('microposts.*, activities.created_at')
.joins("INNER JOIN activities ON (activities.micropost_id = microposts.id)")
.where('activities.user_id= ?',id)
.order('activities.created_at DESC')
.uniq

DB2 joins difficulties

I have the following situation (simplified):
2 BiTemp Tables
basicdata (id, btmp_tsd, name, prename)
extendeddata (id, btmp_tsd, basicid, codename, codevalue)
In extendeddata, there can be multible entries for one basicdata with each a different codename and value.
I have to create an SQL to select all rows which have changed since a specified time. For the basicdata table this is relatively simple:
SELECT ID, BTMP_TSD, NAME, PRENAME
FROM BASICDATA BD
WHERE BTMP_TSD =
(SELECT MAX(BTMP_TSD)
FROM BASICDATA BD2
WHERE BD2.ID = BD.PRTNR_ID
AND BD2.BTMP_TSD > :MINTSD
AND BD2.BTMP_TSD <= :MAXTSD
)
ORDER BY ID
WITH UR
Now I will need to Join on the second table to get the codevalue for the codename 'test'. The problem is, it may not exist, in this case, the row should be collected anyway. But if there is a row but not within the timerange, I should not get a result.
I hope I was able to explain my issue. Joins are one of the things I still don't see trough...
Edit:
Okay here's a sample
basicdata:
id,btmp_tsd,name,prename
1,2013-05-25,test,user
2,2013-06-26,user,two
3,2013-06-26,peter,hans
1,2013-06-20,test,us3r
2,2013-10-30,us3r,two
extendeddata:
id,btmp_tsd,basicid,codename,codevalue
1,2013-05-25,1,superadmin,1
2,2013-06-26,3,admin,1
3,2013-11-25,1,superadmin,0
Okay now having these entries and I want all userid's which have had any changes since 2013-10-01 I should get
User1 (Because the extendeddata superadmin had a change)
User2 (Had a Name change and I want him even tough he has no entry on the extendeddata table)
not User3 (He has an entries on both tables but it's not in the specified range)
The following query should do what you want.
select *
from basicdata b left outer join extendeddata e on b.id=e.basicid
where b.btmp_tsd >= '2013-10-01'
or e.btmp_tsd >= '2013-10-01'
DISCLAIMER: I didn't test the sql. So syntax might not be 100% perfect.

Adding unique two-column index with already not unique data

I have a rails app and need to add a unique constraint, so that a :record never has the same (:user, :hour) combination.
I imagine the best way to do this is by adding a unique index:
add_index :records, [:user_id, :hour], :unique => true
The problem is, the migration I wrote to do that fails, because my database already has non-unique combinations. How do I find those combinations?
This answer suggests "check with GROUP BY and COUNT" but I'm a total newbie, and I would love some help interpreting that.
Do I write a helper method to do that? Where in my app would that go?
It's too complex to do it in the console, right?
Or should I be looking at some sort of a script?
Thank you!
Run this query in your database console: SELECT *, COUNT(*) as n FROM table_name group by column_name HAVING n>1
Fix the duplicate rows
Re-run your migration
IMHO, you should edit your duplicate data manually so that you can be sure the data is correctly fixed.
Update:
OP didn't mention he/she is using Postgres and I gave a solution for MySQL.
For Postgres:
Based on this solution: Find duplicate rows with PostgreSQL
Run this query:
SELECT * FROM (
SELECT id,
ROW_NUMBER() OVER(PARTITION BY merchant_Id, url ORDER BY id asc) AS Row
FROM Photos
) dups
WHERE
dups.Row > 1
More explanation:
In order for you to execute the migration and add unique constraint to your columns, you need to fix the current data first. Usually, there's no automatic step for this in order to make sure you won't end up with incorrect data.
That's why you need to manually find the duplicate rows and fix it. The given query will show you which rows are duplicates. So, from there, fix the data and you should be able to run the migration.
Mooore update:
The duplicated rows do not get marked. For an example, if you get this kindo of result:
ID ROW
235 2
236 3
2 2
3 3
You should select the row with id=235 and then select every row with the same column value as id=235. From there, you'll see every id which are duplicates from id=235. Then, just edit them one by one.

Resources