Rails ActiveRecord and PostgreSQL Partitioning - ruby-on-rails

I've got a large web app which writes many millions of rows into partitioned tables in PostgreSQL each day (meaning there's a new table for each day's data).
We're using PostgreSQL's table inheritance and partitioning to speed things along:
Due to there being year's worth of data in our DB we can't effectively use insert triggers to route the content to the correct table (the functions are getting very, very long in length).
Long story short, we need ActiveRecord to know which table to insert and update the data on. BUT, not change the table that is used for selects and other DB tasks.
Obviously it's simple to define the table name for a model, but is it possible to override the table name for just particular actions?
Here's a little more detail:
Database:
Table: dashboard.impressions (id, host, data, created_on, etc)
Table: data.impressions_20120801 (inherited from dashboard.impressions, with a constraint of created_on being equal to the tables date)
Impression.create :host=>"localhost", :data=>"{...}", created_on=>DateTime.now should write to the data.impressions_20120801 table, where Impression.where(:host=>"localhost") should search on the dashboard.impressions table, since that contains all the data.
Edit: I'm running PostgreSQL 9.1 and Rails 3.2.6

I don't do Rails so I can't help with the ActiveRecord side, but I can offer a pure Pg fallback solution for if you can't get ActiveRecord to do what you want. It'll cost you a little bit of insert performance so it'll be much better to teach ActiveRecord to do the inserts to the right place.
Personally I'd just do the INSERTs directly via the pg gem and bypass ActiveRecord completely. If you can't do that, or ActiveRecord does caching that means you shouldn't, try this alternate partitioning trigger implementation.
Instead of explicitly listing every partition in your trigger function, consider EXECUTE ... USING for insertion, and generate the partition name using your naming scheme. Something like the untested:
CREATE OR REPLACE FUNCTION partition_trigger() RETURNS trigger AS $$
DECLARE
target_partition text;
BEGIN
IF tg_op = 'INSERT' THEN
target_partition = ( ... work out the partition name ... )
EXECUTE 'INSERT INTO '||quote_ident(target_partition)||' (col1,col2) VALUES ($1, $2)'
USING (NEW.col1, NEW.col2);
END IF;
RETURN NULL;
END;
$$ LANGUAGE 'plpgsql';

Related

How should you backfill a new table in Rails?

I'm creating a new table that needs to be backfilled with data based on User accounts (over a couple dozen thousand) with the following one-time rake task.
What I've decided to do is create a big INSERT string for every 2000 users and execute that query.
Here's what the code roughly looks like:
task :backfill_my_new_table => :environment do
inserts = []
User.find_each do |user|
tuple = # form the tuple based on user and user associations like (1, 'foo', 'bar', NULL)
inserts << tuple
end
# At this point, the inserts array is of size at least 20,000
conn = ActiveRecord::Base.connection
inserts.each_slice(2000) do |slice|
sql = "INSERT INTO my_new_table (ref_id, column_a, column_b, column_c) VALUES #{inserts.join(", ")}"
conn.execute(sql)
end
end
So I'm wondering, is there a better way to do this? What are some drawbacks of the approach I took? How should I improve it? What if I didn't slice the inserts array and simply executed a single INSERT with over a couple dozen thousand VALUES tuples? What are the drawbacks of that method?
Thanks!
Depends on which PG version you are using, but in most cases of bulk loading data to a table this is enough checklist:
try to use COPY instead of INSERT whenever possible;
if using multiple INSERTs, disable autocommit and wrap all INSERTs in a single transaction, i.e. BEGIN; INSERT ...; INSERT ...; COMMIT;
disable indexes and checks/constraints on/of a target table;
disable table triggers;
alter table so it became unlogged (since PG 9.5, don't forget to turn logging on after data import), or increase max_wal_size so WAL wont be flooded
20k of rows is not such a big deal for a PG, so 2k-sliced inserts within one transaction will be just fine, unless there are some very complex triggers/checks involved. It is also worth reading PG manual section on bulk loading.
UPD: and a little bit old, yet wonderful piece from depesz, excerpt:
so, if you want to insert data as fast as possible – use copy (or better yet – pgbulkload). if for whatever reason you can't use copy, then use multi-row inserts (new in 8.2!). then if you can, bundle them in transactions, and use prepared transactions, but generally – they don't give you much.

Are there any extensions to TADOQuery that include client indexes

Quick question (hopefully)
I have a large dataset (>100,000 records) that I would like to use as a lookup to determine existence or non-existence of multiple keys. The purpose of this is to find FK violations before trying to commit them to the database to try and avoid the resultant EDatabaseError messing up my transaction.
I had been using TClientDataSet/TDatasetProvider with the FindKey method, as this allowed a client-side index to be set up and was faster (2s to scan each key rather than 10s for ADO). However, moving to large datasets the population of the CDS is starting to take far more time than the local index is saving.
I see that I have a few options for alternatives:
client cursor with TADOQuery.locate method
ADO SELECT statements for each check (no client cache)
ADO SEEK method
Extend TADOQuery to mimic FindKey
The Locate method seems easiest and doesn't spam the server with the SELECT/SEEK methods. I like the idea of extending the TADOQuery, but was wondering whether anyone knew of any ready-made solutions for this rather than having to create my own?
I would create a temporary table in the database server. Insert all 100,000 records into this temp table. Do bulk inserts of say 3000 records at a time, to minimise round trips to the server. Then run select statements on this temp table to check for foreign key violations etc. If all okay, do an insert SQL from the temp table to the main table.

Is it faster to constantly assign a value or compare

I am scanning an SQLite database looking for all matches and using
OneFound:=False;
if tbl1.FieldByName('Name').AsString = 'jones' then
begin
OneFound:=True;
tbl1.Next;
end;
if OneFound then // Do something
or should I be using
if not(OneFound) then OneFound:=True;
Is it faster to just assign "True" to OneFound no matter how many times it is assigned or should I do the comparison and only change OneFuond the first time?
I know a better way would be to use FTS3, but for now I have to scan the database and the question is more on the approach to setting OneFound as many times as a match is encountered or using the compare-approach and setting it just once.
Thanks
Your question is, which is faster:
if not(OneFound) then OneFound:=True;
or
OneFound := True;
The answer is probably that the second is faster. Conditional statements involve branches which risks branch mis-prediction.
However, that line of code is trivial compared to what is around it. Running across a database one row at a time is going to be outrageously expensive. I bet that you will not be able to measure the difference between the two options because the handling of that little Boolean is simply swamped by the rest of the code. In which case choose the more readable and simpler version.
But if you care about the performance of this code you should be asking the database to do the work, as you yourself state. Write a query to perform the work.
It would be better to change your SQL statement so that the work is done in the database. If you want to know whether there is a tuple which contains the value 'jones' in the field 'name', then a quicker query would be
with tquery.create (nil) do
begin
sql.add ('select name from tbl1 where name = :p1 limit 1');
sql.params[0].asstring:= 'jones';
open;
onefound:= not isempty;
close;
free
end;
Your syntax may vary regarding the 'limit' clause but the idea is to return only one tuple from the database which matches the 'where' statement - it doesn't matter which one.
I used a parameter to avoid problems delimiting the value.
1. Search one field
If you want to search one particular field content, using an INDEX and a SELECT will be the fastest.
SELECT * FROM MYTABLE WHERE NAME='Jones';
Do not forget to create an INDEX on the column, first!
2. Fast reading
But if you want to search within a field, or within several fields, you may have to read and check the whole content. In this case, what will be slow will be calling FieldByName() for each data row: you should better use a local TField variable.
Or forget about TDataSet, and switch to direct access to SQLite3. In fact, using DB.pas and TDataSet requires a lot of data marshalling, so is slower than a direct access.
See e.g. DiSQLite3 or our DB classes, which are very fast, but a bit of higher level. Or you can use our ORM on top of those classes. Our classes are able to read more than 500,000 rows per second from a SQLite3 database, including JSON marshalling into objects fields.
3. FTS3/FTS4
But, as you guessed, the fastest would be indeed to use the FTS3/FTS4 feature of SQlite3.
You can think of FTS4/FTS4 as a "meta-index" or a "full-text index" on supplied blob of text. Just like google is able to find a word in millions of web pages: it does not use a regular database, but full-text indexing.
In short, you create a virtual FTS3/FTS4 table in your database, then you insert in this table the whole text of your main records in the FTS TEXT field, forcing the ID field to be the one of the original data row.
Then, you will query for some words on your FTS3/FTS4 table, which will give you the matching IDs, much faster than a regular scan.
Note that our ORM has dedicated TSQLRecordFTS3 / TSQLRecordFTS4 kind of classes for direct FTS process.

Rails 3 Migration: Autoincrement on (Non-Primary-Key) Column?

I'm looking for a way to create a column that autoincrements the way the automatic :id column does. I could probably handle this somehow in the model, but that seems kludgey. I haven't found anything in stock Rails 3 that handles this; are there gems available that might handle this? I'm surprised it's not already an option, since Rails handles this behavior for primary key columns.
Normally auto-incrementing columns are implemented using database sequences. The advantage of using a sequence over calculating the next increment, is that getting the next value from a sequence is atomic. So if you have multiple processes creating new elements, the sequence will make sure your numbers are really unique.
Sequences can be used in postgresql, oracle, mysql, ...
How to implement this, if you are using postgres for instance:
select the next value from the sequence:
Integer(Operator.connection.select_value("SELECT nextval('#{sequence_name}')"))
create a sequence:
Operator.connection.execute("CREATE sequence #{sequence_name}")
set the start-value of a sequence :
Operator.connection.execute("SELECT setval('#{sequence_name}', #{new_start_serial})")
Hope this helps.
If you really think you need this you could create a before_create filter in the model to check the last record attribute value and add 1 to it. Feels hacking though.

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources