PostgreSQL gapless sequences - ruby-on-rails

I'm moving from MySql to Postgres, and I noticed that when you delete rows from MySql, the unique ids for those rows are re-used when you make new ones. With Postgres, if you create rows, and delete them, the unique ids are not used again.
Is there a reason for this behaviour in Postgres? Can I make it act more like MySql in this case?

Sequences have gaps to permit concurrent inserts. Attempting to avoid gaps or to re-use deleted IDs creates horrible performance problems. See the PostgreSQL wiki FAQ.
PostgreSQL SEQUENCEs are used to allocate IDs. These only ever increase, and they're exempt from the usual transaction rollback rules to permit multiple transactions to grab new IDs at the same time. This means that if a transaction rolls back, those IDs are "thrown away"; there's no list of "free" IDs kept, just the current ID counter. Sequences are also usually incremented if the database shuts down uncleanly.
Synthetic keys (IDs) are meaningless anyway. Their order is not significant, their only property of significance is uniqueness. You can't meaningfully measure how "far apart" two IDs are, nor can you meaningfully say if one is greater or less than another. All you can do is say "equal" or "not equal". Anything else is unsafe. You shouldn't care about gaps.
If you need a gapless sequence that re-uses deleted IDs, you can have one, you just have to give up a huge amount of performance for it - in particular, you cannot have any concurrency on INSERTs at all, because you have to scan the table for the lowest free ID, locking the table for write so no other transaction can claim the same ID. Try searching for "postgresql gapless sequence".
The simplest approach is to use a counter table and a function that gets the next ID. Here's a generalized version that uses a counter table to generate consecutive gapless IDs; it doesn't re-use IDs, though.
CREATE TABLE thetable_id_counter ( last_id integer not null );
INSERT INTO thetable_id_counter VALUES (0);
CREATE OR REPLACE FUNCTION get_next_id(countertable regclass, countercolumn text) RETURNS integer AS $$
DECLARE
next_value integer;
BEGIN
EXECUTE format('UPDATE %s SET %I = %I + 1 RETURNING %I', countertable, countercolumn, countercolumn, countercolumn) INTO next_value;
RETURN next_value;
END;
$$ LANGUAGE plpgsql;
COMMENT ON get_next_id(countername regclass) IS 'Increment and return value from integer column $2 in table $1';
Usage:
INSERT INTO dummy(id, blah)
VALUES ( get_next_id('thetable_id_counter','last_id'), 42 );
Note that when one open transaction has obtained an ID, all other transactions that try to call get_next_id will block until the 1st transaction commits or rolls back. This is unavoidable and for gapless IDs and is by design.
If you want to store multiple counters for different purposes in a table, just add a parameter to the above function, add a column to the counter table, and add a WHERE clause to the UPDATE that matches the parameter to the added column. That way you can have multiple independently-locked counter rows. Do not just add extra columns for new counters.
This function does not re-use deleted IDs, it just avoids introducing gaps.
To re-use IDs I advise ... not re-using IDs.
If you really must, you can do so by adding an ON INSERT OR UPDATE OR DELETE trigger on the table of interest that adds deleted IDs to a free-list side table, and removes them from the free-list table when they're INSERTed. Treat an UPDATE as a DELETE followed by an INSERT. Now modify the ID generation function above so that it does a SELECT free_id INTO next_value FROM free_ids FOR UPDATE LIMIT 1 and if found, DELETEs that row. IF NOT FOUND gets a new ID from the generator table as normal. Here's an untested extension of the prior function to support re-use:
CREATE OR REPLACE FUNCTION get_next_id_reuse(countertable regclass, countercolumn text, freelisttable regclass, freelistcolumn text) RETURNS integer AS $$
DECLARE
next_value integer;
BEGIN
EXECUTE format('SELECT %I FROM %s FOR UPDATE LIMIT 1', freelistcolumn, freelisttable) INTO next_value;
IF next_value IS NOT NULL THEN
EXECUTE format('DELETE FROM %s WHERE %I = %L', freelisttable, freelistcolumn, next_value);
ELSE
EXECUTE format('UPDATE %s SET %I = %I + 1 RETURNING %I', countertable, countercolumn, countercolumn, countercolumn) INTO next_value;
END IF;
RETURN next_value;
END;
$$ LANGUAGE plpgsql;

Related

Bigquery - parametrize tables and columns in a stored procedure

Consider an enterprise that captures sensor data for different production facilities. per facility, we create an aggregation query that averages the values to 5min timeslots. This query exists out of a long list of with-clauses and writes data to a table (called aggregation_table).
Now my problem: currently we have n queries running that exactly run the same logic, the only thing that differs are table names (and sometimes column names but let's ignore that for now).
Instead of managing n different scripts that are basically the same, I would like to put it in a stored procedure that is able to work like this:
CALL aggregation_query(facility_name) -> resolve the different tables for that facility and then use them in the different with clauses
On top of that, instead of having this long set of clauses that give me the end-result, I would like to chunk them up in logical blocks that are parametrizable, So for example, if I call the aforementioned stored_procedure for facility A, I want to be able to pass / use this table name in these different functions, where the output can be re-used in the next statement (like you would do with with clauses).
Another argument of why I want to chunk this up in re-usable blocks is because we have many "derivatives" on this aggregation query, for example to manage historical data, to correct data or to have the sensor data on another aggregation level. As these become overly complex, it is much easier to manage them without having to copy paste and adjust these every time.
In the current set-up, it could be useful to know that I am only entitled to use plain BigQuery, As my team is not allowed to access the CI/CD / scheduling and repository. (meaning that I cannot solve the issue by having CI/CD that deploys the n different versions of the procedure and functions)
So in the end, I would like to end up with something like this using only bigquery:
CREATE OR REPLACE PROCEDURE
`aggregation_function`()
BEGIN
DECLARE
tablename STRING;
DECLARE
active_table_name STRING; ##get list OF tables CREATE TEMP TABLE tableNames AS
SELECT
table_catalog,
table_schema,
table_name
FROM
`catalog.schema.INFORMATION_SCHEMA.TABLES`
WHERE
table_name = tablename;
WHILE
(
SELECT
COUNT(*)
FROM
tableNames) >= 1 DO ##build dataset + TABLE name
SET
active_table_name = CONCAT('`',table_catalog,'.',table_schema,'.' ,table_name,'`'); ##use concat TO build string AND execute
EXECUTE IMMEDIATE '''
INSERT INTO
`aggregation_table_for_facility` (timeslot, sensor_name, AVG_VALUE )
WITH
STEP_1 AS (
SELECT
*
FROM
my_table_function_step_1(active_table_name,
parameter1,
parameter2) ),
STEP_2 AS (
SELECT
*
FROM
my_table_function_step_2(STEP_1,
parameter1,
parameter2) )
SELECT * FROM STEP_2
'''
USING active_table_name as active_table_name;
DELETE
FROM
tableNames
WHERE
table_name = tablename;
END WHILE
;
END
;
I was hoping someone could make a snippet on how I can do this in Standard SQL / Bigquery, so basically:
stored procedure that takes in a string variable and is able to use that as a table (partly solved in the approach above, but not sure if there are better ways)
(table) function that is able to take this table_name parameter as well and return back a table that can be used in the next with clause (or alternatively writes to a temp table)
I think below code snippets should provide you with some insights when dealing with procedures, inserts and execute immediate statements.
Here I'm creating a procedure which will insert values into a table that exists on the information schema. Also, as a value I want to return I use OUT active_table_name to return the value I assigned inside the procedure.
CREATE OR REPLACE PROCEDURE `project-id.dataset`.custom_function(tablename STRING,OUT active_table_name STRING)
BEGIN
DECLARE query STRING;
SET active_table_name= (SELECT CONCAT('`',table_catalog,'.',table_schema,'.' ,table_name,'`')
FROM `project-id.dataset.INFORMATION_SCHEMA.TABLES`
WHERE table_name = tablename);
#multine query can be handled by using ''' or """
Set query =
'''
insert into %s (string_field_0,string_field_1,string_field_2,string_field_3,string_field_4,int64_field_5)
with custom_query as (
select string_field_0,string_field_2,'169 BestCity',string_field_3,string_field_4,55677 from %s limit 1
)
select * from custom_query;
''';
# querys must perform operations and must be the last thing to perform
# pass parameters using format
execute immediate (format(query,active_table_name,active_table_name));
END
You can also use a loop to iterate trough records from a working table so it will execute the procedure and also be able to get the value from the procedure to use somewhere else.ie:A second procedure to perform a delete operation.
DECLARE tablename STRING;
DECLARE out_value STRING;
FOR record IN
(SELECT tablename from `my-project-id.dataset.table`)
DO
SET tablename = record.tablename;
LOOP
call `project-id.dataset`.custom_function(tablename,out_value);
select out_value;
END LOOP;
END FOR;
To recap, there are some restrictions such as the possibility to call procedures inside a execute immediate or to use execute immediate inside an execute immediate, to count a few. I think these snippets should help you dealing with your current situation.
For this sample I use the following documentation:
Data Manipulation Language
Dealing with outputs
Information Schema Tables
Execute Immediate
For...In
Loops

Evaluating datetime register during trigger event

I was testing the following behaviour for datetime special register (stated here)
If the SQL statement in which a datetime special register is used is
in a user-defined function or stored procedure that is within the
scope of a trigger, Db2 uses the timestamp for the triggering SQL
statement to determine the special register value.
So i crated a table with a timestamp field, a stored procedure (native sql) that is inserting the same 10 rows to the table and the tamestamp column is given the value of "current timestamp". Then i created a trigger on some other table (after insert trigger).
The result is 10 rows with increasing timestamp. I expected the timestamp to be the same as in my interpretation the stored procedure was in the scope of a trigger.
Can you help me what this statement means?
create trigger date_check
after insert on test
for each row mode db2sql
call date_sp2()
create procedure date_sp2()
language sql
BEGIN
declare i smallint default 0;
my_loop: LOOP
insert into empty_char values('Y','Y','Y','Y',current date, current timestamp);
SET I = I + 1;
IF I = 10 THEN LEAVE my_loop;
END IF;
END LOOP my_loop;
END
My guess is that note 1 applies
https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_currenttimestamp.html#fntarg_1
If this special register is used more than one time within a single SQL statement, or used with CURRENT DATE or CURRENT TIME within a single statement, all values are based on a single clock reading. ¹
¹ Except for the case of a non-atomic multiple row INSERT or MERGE statement.

Left join table on multiple tables in SAS

I've got multiple master tables in the same format with the same variables. I now want to left join another variable but I can't combine the master tables due to limited storage on my computer. Is there a way that I can left join a variable onto multiple master tables within one PROC SQL? Maybe with the help of a macro?
The LEFT JOIN code looks like this for one join but I'm looking for an alternative than to copy and paste this 5 times:
PROC SQL;
CREATE TABLE New AS
SELECT a.*, b.Value
FROM Old a LEFT JOIN Additional b
ON a.ID = b.ID;
QUIT;
You can't do it in one create table statement, as it only creates one table at a time. But you can do a few things, depending on what your actual limiting factor is (you mention a few).
If you simply want to avoid writing the same code five times, but otherwise don't care how it executes, then just write the code in a macro, as you reference.
%macro update_table(old=, new=);
PROC SQL;
CREATE TABLE &new. AS
SELECT a.*, b.Value
FROM &old. a LEFT JOIN Additional b
ON a.ID = b.ID;
QUIT;
%mend update_table;
%update_table(old=old1, new=new1)
%update_table(old=old2, new=new2)
%update_table(old=old3, new=new3)
Of course, if the names of the five tables are in a pattern, you can perhaps automate this further based on that pattern, but you don't give sufficient information to figure that out.
If you on the other hand need to do this more efficiently in terms of processing than running the SQL query five times, it can be done a number of ways, depending on the specifics of your additional table and your specific limitations. It looks to me that you have a good use case for a format lookup here, for example; see for example Jenine Eason's paper, Proc Format, a Speedy Alternative to Sort/Merge. If you're just merging on the ID, this is very easy.
data for_format;
set additional;
start = ID;
label = value;
fmtname='AdditionalF'; *or '$AdditionalF' if ID is character-valued;
output;
if _n_=1 then do; *creating an "other" option so it returns missing if not found;
hlo='o';
label = ' ';
output;
end;
run;
And then you just have five data steps with a PUT statement adding the value, or even you could simply format the ID variable with that format and it would have that value whenever you did most PROCs (if this is something like a classifier that you don't truly need "in" the data).
You can do this in a single pass through the data in a Data Step using a hash table to lookup values.
data new1 new2 new3;
set old1(in=a) old2(in=b) old3(in=c);
format value best.;
if _n_=1 then do;
%create_hash(lk,id,value,"Additional");
end;
value = .;
rc = lk.find();
drop rc;
if a then
output new1;
else if b then
output new2;
else if c then
output new3;
run;
%create_hash() macro available here.
You could, alternatively, use Joe's format with the same Data Step syntax.

PostgreSQL and ActiveRecord subselect for race condition

I'm experiencing a race condition in ActiveRecord with PostgreSQL where I'm reading a value then incrementing it and inserting a new record:
num = Foo.where(bar_id: 42).maximum(:number)
Foo.create!({
bar_id: 42,
number: num + 1
})
At scale, multiple threads will simultaneously read then write the same value of number. Wrapping this in a transaction doesn't fix the race condition because the SELECT doesn't lock the table. I can't use an auto increment, because number is not unique, it's only unique given a certain bar_id. I see 3 possible fixes:
Explicitly use a postgres lock (a row-level lock?)
Use a unique constraint and retry on fails (yuck!)
Override save to use a subselect, I.E.
INSERT INTO foo (bar_id, number) VALUES (42, (SELECT MAX(number) + 1 FROM foo WHERE bar_id = 42));
All these solutions seem like I'd be reimplementing large parts of ActiveRecord::Base#save! Is there an easier way?
UPDATE:
I thought I found the answer with Foo.lock(true).where(bar_id: 42).maximum(:number) but that uses SELECT FOR UDPATE which isn't allowed on aggregate queries
UPDATE 2:
I've just been informed by our DBA, that even if we could do INSERT INTO foo (bar_id, number) VALUES (42, (SELECT MAX(number) + 1 FROM foo WHERE bar_id = 42)); that doesn't fix anything, since the SELECT runs in a different lock than the INSERT
Your options are:
Run in SERIALIZABLE isolation. Interdependent transactions will be aborted on commit as having a serialization failure. You'll get lots of error log spam, and you'll be doing lots of retries, but it'll work reliably.
Define a UNIQUE constraint and retry on failure, as you noted. Same issues as above.
If there is a parent object, you can SELECT ... FOR UPDATE the parent object before doing your max query. In this case you'd SELECT 1 FROM bar WHERE bar_id = $1 FOR UPDATE. You are using bar as a lock for all foos with that bar_id. You can then know that it's safe to proceed, so long as every query that's doing your counter increment does this reliably. This can work quite well.
This still does an aggregate query for each call, which (per next option) is unnecessary, but at least it doesn't spam the error log like the above options.
Use a counter table. This is what I'd do. Either in bar, or in a side-table like bar_foo_counter, acquire a row ID using
UPDATE bar_foo_counter SET counter = counter + 1
WHERE bar_id = $1 RETURNING counter
or the less efficient option if your framework can't handle RETURNING:
SELECT counter FROM bar_foo_counter
WHERE bar_id = $1 FOR UPDATE;
UPDATE bar_foo_counter SET counter = $1;
Then, in the same transaction, use the generated counter row for the number. When you commit, the counter table row for that bar_id gets unlocked for the next query to use. If you roll back, the change is discarded.
I recommend the counter approach, using a dedicated side table for the counter instead of adding a column to bar. That's cleaner to model, and means you create less update bloat in bar, which can slow down queries to bar.

MySQL stored procedure causing problems?

EDIT:
I've narrowed my mysql wait timeout down to this line:
IF #resultsFound > 0 THEN
INSERT INTO product_search_query (QueryText, CategoryId) VALUES (keywords, topLevelCategoryId);
END IF;
Any idea why this would cause a problem? I can't work it out!
I've written a stored proc to search for products in certain categories, due to certain constraints I came across, I was unable to do what I wanted (limiting, but whilst still returning the total number of rows found, with sorting, etc..)
It's meant splits up a string of category Ids, from 1,2,3 in to a temporary table, then builds the full-text search query based on sorting options and limits, executes the query string and then selects out the total number of results.
Now, I know I'm no MySQL guru, very far from it, I've got it working, but I keep getting time outs with product searches etc. So I'm thinking this may be causing some kind of problem?
Does anyone have any ideas how I can tidy this up, or even do it in a much better way that I probably don't know about?
Thanks.
DELIMITER $$
DROP PROCEDURE IF EXISTS `product_search` $$
CREATE DEFINER=`root`#`localhost` PROCEDURE `product_search`(keywords text, categories text, topLevelCategoryId int, sortOrder int, startOffset int, itemsToReturn int)
BEGIN
declare foundPos tinyint unsigned;
declare tmpTxt text;
declare delimLen tinyint unsigned;
declare element text;
declare resultingNum int unsigned;
drop temporary table if exists categoryIds;
create temporary table categoryIds
(
`CategoryId` int
) engine = memory;
set tmpTxt = categories;
set foundPos = instr(tmpTxt, ',');
while foundPos <> 0 do
set element = substring(tmpTxt, 1, foundPos-1);
set tmpTxt = substring(tmpTxt, foundPos+1);
set resultingNum = cast(trim(element) as unsigned);
insert into categoryIds (`CategoryId`) values (resultingNum);
set foundPos = instr(tmpTxt,',');
end while;
if tmpTxt <> '' then
insert into categoryIds (`CategoryId`) values (tmpTxt);
end if;
CASE
WHEN sortOrder = 0 THEN
SET #sortString = "ProductResult_Relevance DESC";
WHEN sortOrder = 1 THEN
SET #sortString = "ProductResult_Price ASC";
WHEN sortOrder = 2 THEN
SET #sortString = "ProductResult_Price DESC";
WHEN sortOrder = 3 THEN
SET #sortString = "ProductResult_StockStatus ASC";
END CASE;
SET #theSelect = CONCAT(CONCAT("
SELECT SQL_CALC_FOUND_ROWS
supplier.SupplierId as Supplier_SupplierId,
supplier.Name as Supplier_Name,
supplier.ImageName as Supplier_ImageName,
product_result.ProductId as ProductResult_ProductId,
product_result.SupplierId as ProductResult_SupplierId,
product_result.Name as ProductResult_Name,
product_result.Description as ProductResult_Description,
product_result.ThumbnailUrl as ProductResult_ThumbnailUrl,
product_result.Price as ProductResult_Price,
product_result.DeliveryPrice as ProductResult_DeliveryPrice,
product_result.StockStatus as ProductResult_StockStatus,
product_result.TrackUrl as ProductResult_TrackUrl,
product_result.LastUpdated as ProductResult_LastUpdated,
MATCH(product_result.Name) AGAINST(?) AS ProductResult_Relevance
FROM
product_latest_state product_result
JOIN
supplier ON product_result.SupplierId = supplier.SupplierId
JOIN
category_product ON product_result.ProductId = category_product.ProductId
WHERE
MATCH(product_result.Name) AGAINST (?)
AND
category_product.CategoryId IN (select CategoryId from categoryIds)
ORDER BY
", #sortString), "
LIMIT ?, ?;
");
set #keywords = keywords;
set #startOffset = startOffset;
set #itemsToReturn = itemsToReturn;
PREPARE TheSelect FROM #theSelect;
EXECUTE TheSelect USING #keywords, #keywords, #startOffset, #itemsToReturn;
SET #resultsFound = FOUND_ROWS();
SELECT #resultsFound as 'TotalResults';
IF #resultsFound > 0 THEN
INSERT INTO product_search_query (QueryText, CategoryId) VALUES (keywords, topLevelCategoryId);
END IF;
END $$
DELIMITER ;
Any help is very very much appreciated!
There is little you can do with this query.
Try this:
Create a PRIMARY KEY on categoryIds (categoryId)
Make sure that supplier (supplied_id) is a PRIMARY KEY
Make sure that category_product (ProductID, CategoryID) (in this order) is a PRIMARY KEY, or you have an index with ProductID leading.
Update:
If it's INSERT that causes the problem and product_search_query in a MyISAM table the issue can be with MyISAM locking.
MyISAM locks the whole table if it decides to insert a row into a free block in the middle of the table which can cause the timeouts.
Try using INSERT DELAYED instead:
IF #resultsFound > 0 THEN
INSERT DELAYED INTO product_search_query (QueryText, CategoryId) VALUES (keywords, topLevelCategoryId);
END IF;
This will put the records into the insertion queue and return immediately. The record will be added later asynchronously.
Note that you may lose information if the server dies after the command is issued but before the records are actually inserted.
Update:
Since your table is InnoDB, it may be an issue with table locking. INSERT DELAYED is not supported on InnoDB.
Depending on the nature of the query, DML queries on InnoDB table may place gap locks which will lock the inserts.
For instance:
CREATE TABLE t_lock (id INT NOT NULL PRIMARY KEY, val INT NOT NULL) ENGINE=InnoDB;
INSERT
INTO t_lock
VALUES
(1, 1),
(2, 2);
This query performs ref scans and places the locks on individual records:
-- Session 1
START TRANSACTION;
UPDATE t_lock
SET val = 3
WHERE id IN (1, 2)
-- Session 2
START TRANSACTION;
INSERT
INTO t_lock
VALUES (3, 3)
-- Success
This query, while doing the same, performs a range scan and places a gap lock after key value 2, which will not let insert key value 3:
-- Session 1
START TRANSACTION;
UPDATE t_lock
SET val = 3
WHERE id BETWEEN 1 AND 2
-- Session 2
START TRANSACTION;
INSERT
INTO t_lock
VALUES (3, 3)
-- Locks
Try wrapping your EXECUTE with the following:
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
EXECUTE TheSelect USING #keywords, #keywords, #startOffset, #itemsToReturn;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
I do something similiar in TSQL for all report stored proc and searches where repeatable reads aren't important to reduce locking/blocking issues with other processes running on the database.
Turn on slow queries, that will give you an idea of what is taking so long to execute that there is a timeout.
http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
Pick the slowest query and optimise that. then run for a while and repeat.
There is some excellent information and tools here http://hackmysql.com/nontech
DC
UPDATE:
Either you have a network problem causing the timeout, if you are using a local mysql instance then that is unlikely, OR something is locking a table for far too long causing a timeout. the process that is locking the table or tables for far too long will be listed in the slow log as a slow query. you can also get the slow log query to display any queries that fail to use an index resulting in an inefficient query.
If you can get the problem to occur while you are present then you can also use a tool like phpmyadmin or the commandline to run "SHOW PROCESSLIST\G" this will give you a list of what queries are running while the problem is occurring.
You think the problem is in your insert statement, therefore something is locking that table. therefore you need to find what is locking that table, therefore you need to find what is running so slow its locking the table for far too long. Slow queries is one way to do that.
Other things to look at
CPU - is it idle or running at full pelt
IO - is io causing holdups
RAM - are you swapping all the time (will cause excessive io)
Does the table product_search_query use an index?
What is the primary key?
If your index uses strings that are too long? you may build a huge index file that causes very slow inserts (slow query log will also show that)
And yes the problem may be elsewhere, but you must start somewhere mustn't you.
DC

Resources