Using an SQL Agent Job to call procedures in a loop - stored-procedures

I am putting together a job on SQL Enterprise Manager 2000 to copy and delete records in a couple database tables. We've run a straight up mass copy and delete stored procedure, but it could be running it on millions of rows, and therefore hangs the server. I was interested in trying to run the service in 100-ish record chunks at a time, so the server doesn't grind to a halt (this is a live web database). I want this service to run once a night, which is why I've put it in an agent job. Is there any way to loop the calls to the stored procedures that actually do the copy and delete, and then "sleep" in between each call to give the server time to catch up? I know there is the WAITFOR command, but I'm unsure if this will hold the processor or let it run other queries in the meantime.
Thanks!

"Chunkifying" your deletes is the preferred way to delete excessive amounts of data without bloating up transaction log files. BradC's post is a reasonable example of this.
Managing such loops is best done within a single stored procedure. To spread such work out over time, I'd still keep it in the procedure. Inserting a WAITFOR in the loop will put a "pause" between each set of deletes, if you deem that necessary to deal with possible concurrency issues. Use a SQL Agent job to determine when the procedure start--and if you need to make sure it stops by a certain time, work that into the loop as well.
My spin on this code would be:
-- NOTE: This is a code sample, I have not tested it
CREATE PROCEDURE ArchiveData
#StopBy DateTime
-- Pass in a cutoff time. If it runs this long, the procedure will stop.
AS
DECLARE #LastBatch int
SET #LastBatch = 1
-- Initialized to make sure the loop runs at least once
WHILE #LastBatch > 0
BEGIN
WAITFOR DELAY '00:00:02'
-- Set this to your desired delay factor
DELETE top 1000 -- Or however many per pass are desired
from SourceTable
-- Be sure to add a where clause if you don't want to delete everything!
SET #LastBatch = ##rowcount
IF getdate() > #StopBy
SET #LastBatch = 0
END
RETURN 0
Hmm. Rereading you post implies that you want to copy the data somewhere first before deleting it. To do that, I'd set up a temp table, and inside the loop first truncate the temp table, then copy in the primary keys of the TOP N items, insert into the "archive" table via a join to the temp table, then delete the source table also via a join to the temp table. (Just a bit more complex than a straight delete, isn't it?)

Don't worry about waiting between loops, SQL server should handle the contention between your maintenance job and the regular activity on the server.
What really causes the problem in these types of situations is that the entire delete process happens all at once, inside a single transaction. This blows up the log for the database, and can cause the kinds of problems it sounds like you are experiencing.
Use a loop like this to delete in manageable chunks:
DECLARE #i INT
SET #i = 1
SET ROWCOUNT 10000
WHILE #i > 0
BEGIN
BEGIN TRAN
DELETE TOP 1000 FROM dbo.SuperBigTable
WHERE RowDate < '2009-01-01'
COMMIT
SELECT #i = ##ROWCOUNT
END
SET ROWCOUNT 0
You can use similar logic for your copy.

WAITFOR will let other processes 'have a go'. I've used this technique to stop large DELETE's locking up the machine. Create a WHILE loop, delete a block of rows, and then WAITFOR a few seconds (or less, whatever is appropriate).

Related

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

Stored Procedure Order of Operations

I have a stored procedure which starts by creating a temp table, then it populates the temp table based on a query.
It then does a Merge statement, basically updating a physical table off of the temp table. Lastly it does an Update on the physical table. Pretty basic stuff.
It takes ~8 sec to run. My question is, at what point does it lock the physical table? Since the whole query is compiled prior to its run, is the physical table locked during the entire execution of the stored procedure, or does it wait until it gets to the statements that actually works with the physical table?
I'm not necessarily trying to resolve an issue, as much as make sure I don't cause one. We have other processes that need to be reworked to alleviate blocking, I don't want to create another.
OK, for SQL Server:
a stored procedure is "compiled" (e.g. an execution plan is determined) before it's first usage (and that execution plan is cached in the plan cache until it's either tossed out due to space constraints, or due to a restart). At the time an execution plan is determined, nothing happens on the table level - no locks, nothing
SQL Server will by default use row-level locks, e.g. a row is locked when it's read, inserted, updated or deleted - then and only then. So your procedure will place shared locks on the tables where it selects the data from to insert those rows into the temp table, it will put exclusive locks on the rows being inserted into the temp table (for the duration of the operation / transaction), and the MERGE will also place locks as and when needed
Most locks (not the shared locks, typically) in SQL Server are by default held until the end of the transaction. Unless you specifically handle transactions explicitly, each operation (MERGE, UPDATE, INSERT) runs inside its own implicit transaction, so any locks held will be released at the end of that transaction (e.g. that statement).
There are lots more aspects to locks - one could write entire books to cover all the details - but does this help you understand locking in SQL Server at least a little bit ?

How can I ensure thread safe when using sequence in Database?

I have a sequence used in a stored proc that update multiple table just like below:
create procedure()
-- retrieve new sequence number
sequence.nextval();
-- update table_A using newly created sequence number
insert into table_A(theID) values(sequence.currval());
-- update table_B using newly created sequence number
insert into table_B(theID) values(sequence.currval());
end procedure;
May I know whether the code above is a thread-safe implementation? For each procedure's execution, can I guarantee theID in table_A and table_B always retrieving the same sequence number when there are more than one execution at a time?
One of Informix's primary objectives is to make sure that procedure works exactly as you need it to, regardless of how many thousands of users are also running the same procedure at the same time. Indeed, you can have multiple concurrent sessions of your own each running the procedure and each session is isolated from all the other sessions.
So, the code shown is 'thread safe'.

Multiple worker threads working on the same database - how to make it work properly?

I have a database that has a list of rows that need to be operated on. It looks something like this:
id remaining delivered locked
============================================
1 10 24 f
2 6 0 f
3 0 14 f
I am using DataMapper with Ruby, but really I think this is a general programming question that isn't specific to the exact implementation I'm using...
I am creating a bunch of worker threads that do something like this (pseudo-ruby-code):
while true do
t = any_row_in_database_where_remaining_greater_than_zero_and_unlocked
t.lock # update database to set locked = true
t.do_some_stuff
t.delivered += 1
t.remaining -= 1
t.unlock
end
Of course, the problem is, these threads compete with each other and the whole thing isn't really thread safe. The first line in the while loop can easily pull out the same row in multiple threads before they get a chance to get locked.
I need to make sure one thread is only working on one row at the same time.
What is the best way to do this?
The key step is when you select an unlocked row from the database and mark it as locked. If you can do that safely then everything else will be fine.
2 ways I know of that can make this safe are pessimistic and optimistic locking. They both rely on your database as the ultimate guarantor when it comes to concurrency.
Pessimistic Locking
Pessimistic locking means acquiring a lock upfront when you select the rows you want to work with, so that no one else can read them.
Something like
SELECT * from some_table WHERE ... FOR UPDATE
works with mysql and postgres (and possibly others) and will prevent any other connection to the database from reading the rows returned to you (how granular that lock is depends on the engine used, indexes etc - check your database's documentation). It's called pessimistic because you are assuming that a concurrency problem will occur and acquire the lock preventatively. It does mean that you bear the cost of locking even when not necessary and may reduce your concurrency depending on the granularity of the lock you have.
Optimistic Locking
Optimistic locking refers to a technique where you don't want the burden of a pessimistic lock because most of the time there won't be concurrent updates (if you update the row setting the locked flag to true as soon as you have read the row, the window is relatively small). AFAIK this only works when updating one row at a time
First add an integer column lock_version to the table. Whenever you update the table, increment lock_version by 1 alongside the other updates you are making. Assume the current lock_version is 3. When you update, change the update query to
update some_table set ... where id=12345 and lock_version = 3
and check the number of rows updated (the db driver returns this). if this updates 1 row then you know everything was ok. If this updates 0 rows then either the row you wanted was deleted or its lock version has changed, so you go back to step 1 in your process and search for a new row to work on.
I'm not a datamapper user so I don't know whether it / plugins for it provide support for these approaches. Active Record supports both so you can look there for inspiration if data mapper doesn't.
I would use a Mutex:
# outside your threads
worker_updater = Mutex.new
# inside each thread's updater
while true
worker_updater.synchronize do
# your code here
end
sleep 0.1 # Slow down there, mister!
end
This guarantees that only one thread at a time can enter the code in the synchronize. For optimal performance, consider what portion of your code needs to be thread-safe (first two lines?) and only wrap that portion in the Mutex.

Can TCustomClientDataset apply updates in a batch mode?

I've got a DB Express TSimpleDataset connected to a Firebird database. I've just added several thousand rows of data to the dataset, and now it's time to call ApplyUpdates.
Unfortunately, this results in several thousand database hits as it tries to INSERT each row individually. That's a bit disappointing. What I'd really like to see is the dataset generate a single transaction with a few thousand INSERT statements in it and send the whole thing at once. I could set that up myself if I had to, but first I'd like to know if there's any method for it built in to the dataset or the DBX framework.
Don't know if it is possible with a TSimpleDataset (never used it), but surely you can if you use a TClientDataset + TDatasetProvider + <put your db dataset here>. You can write a BeforeUpdateRecord to handle the apply process yourself. Basically, it allows you to bypass the standard apply process, access the dataset delta with changes made to records, and then use your own code and components to apply changes to the database. For example you could call stored procedures to modify data, and so on.
However, there is a difference between a transaction and what is called "array DML", "bulk insert" or the like. Even if you use a single transaction (and an "apply" AFAIK happens in a single transaction), within the transaction you may still need to send "n" INSERTs. Some databases supports a way of sending a single INSERT (or update, delete) with an array of parameters to be inserted, reducing the number of single statements to be used - but that may be very database specific and AFAIK dbExpress/Datasnap do not support it - you still could use the BeforeUpdateRecord event to take advantage of specific database capabililties.

Resources