I'm looking for some good library for processing tasks (or 'operations' as we call them in our domain model) for Java or .NET. We save each operation to perform in db and then we need some mechanism for fetching unprocessed tasks from db, process them and update db record with proper status ('processed OK' / 'process error').
The trick is that operation can depend one on another. For example when 'Operation Payment' is being processed the system might discover that we need to perform 'Operation Check Payment Data' before - so it should create new operation row in db, pause performing 'Operation Payment', process 'Operation Check Payment Data' in next turn and after it completes go back to processing 'Operation Payment'.
I'll show you how we manage this at the moment.
We've got db table 'operations'. Cron-like mechanism runs each minute and fetches first 100 unprocessed operations from db and process it. If (while processing) system finds that some other operation (B) is needed to perform current operation (A), then new operation (B) record is created and performing current operation (A) is halted. Next minute cron fetches operations A and B. Operation A is fetches as it is not processed but system sees that dependent operation B is already created so it does not create it once again. Operation B is processed and status 'processed OK' is saved in proper row in db. Next minute cron fetches operation A from db and can finally perform it because dependent task is completed.
We are looking for ways to make it simpler, better and more elegant.
There is a list of open-source Java workflow engines.
Related
I have a stream created in snowflake and a task which moves stream data into another table. I want the task to execute automatically every time there is new data in stream. How to automatically trigger the task when there is new data in stream?
Tasks can only triggered on a schedule but you can have them run as often as every minute
If a run takes more than 1 minute, then the next task is delayed until 1 min after the previous task finishes
While it is not "on demand" a task running every minute should suffice for most situations. Of course it means a warehouse constantly running but the same would be true for an on demand service assuming new data arrives all the time. If data is irregular, you can add a stream has data check so the task does not run if there is no data: https://docs.snowflake.com/en/sql-reference/functions/system_stream_has_data.html
What could trigger a deadlock-message on Firebird when there is only a single transaction writing to the DB?
I am building a webapp with a backend written in Delphi2010 on top of a Firebird 2.1 database. I am getting an concurrent-update error that I cannot make sense of. Maybe someone can help me debug the issue or explain scenarios that may lead to the message.
I am trying an UPDATE to a single field on a single record.
UPDATE USERS SET passwdhash=? WHERE (RECID=?)
The message I am seeing is the standard:
deadlock
update conflicts with concurrent update
concurrent transaction number is 659718
deadlock
Error Code: 16
I understand what it tells me but I do not understand why I am seeing it here as there are no concurrent updates I know of.
Here is what I did to investigate.
I started my appplication server and checked the result of this query:
SELECT
A.MON$ATTACHMENT_ID,
A.MON$USER,
A.MON$REMOTE_ADDRESS,
A.MON$REMOTE_PROCESS,
T.MON$STATE,
T.MON$TIMESTAMP,
T.MON$TOP_TRANSACTION,
T.MON$OLDEST_TRANSACTION,
T.MON$OLDEST_ACTIVE,
T.MON$ISOLATION_MODE
FROM MON$ATTACHMENTS A
LEFT OUTER JOIN MON$TRANSACTIONS T
ON (T.MON$ATTACHMENT_ID = A.MON$ATTACHMENT_ID)
The result indicates a number of connections but only one of them has non-NULLs in the MON$TRANSACTION fields. This connection is the one I am using from IBExperts to query the monitor-tables.
Am I right to think that connection with no active transaction can be disregarded as not contributing to a deadlock-situation?
Next I put a breakpoint on the line submitting the UPDATE-Statement in my application server and executed the request that triggers it. When the breakpoint stopped the application I then reran the Monitor-query above.
This time I could see another transaction active just as I would expect:
Then I let my appserver execute the UPDATE and reap the error-message as shown above.
What can trigger the deadlock-message when there is only one writing transaction? Or are there more and I am misinterpreting the output? Any other suggestions on how to debug this?
Firebird uses MVCC (Multiversion Concurrency Control) for its transaction model. One of the features is that - depending on the transaction isolation - you will only see the last version committed when your transaction started (consistency and concurrency isolation levels), or that were committed when your statement started (read committed). A change to a record will create a new version of the record, which will only become visible to other active transactions when it has been committed (and then only for read committed transactions).
As a basic rule there can only be one uncommitted version of a record. So attempts by two transactions to update the same record will fail for one of those transaction. For historical reasons these type of errors are grouped under the deadlock error family, even though it is not actually a deadlock in the normal concurrency vernacular.
This rule is actually a bit more restrictive depending on your transaction isolation: for consistency and concurrency level there can also be no newer committed versions of a record that is not visible to your transaction.
My guess is that for you something like this happened:
Transaction 1 started
Transaction 2 started with concurrency or consistency isolation
Transaction 1 modifies record (new version created)
Transaction 1 commits
Transaction 2 attempts to modify same record
(Note, step 1+3 and 2 could be in a different order (eg 1,3,2, or 2,1,3))
Step 5 fails, because the new version created in step 3 is not visible to transaction 2. If instead read committed had been used then step 5 would succeed as the new version would be visible to the transaction at that point.
In my process, i need to allow a user to execute the same task ( with different params ) multiple times.
Showing TaskService class i observe that there is just complete method.
The complete method set that task to complete and user cannot execute it again.
Is there a solution to that?
Thanks!
A task can only be executed once. However, your process could allow the same task (node) being triggered multiple times, for example use an intermediate signal start event leading to your task, where depending on the data sent alongside the signal, the task will be triggered. The process could allow retriggering the same task multiple times (during a certain phase of the process for example).
See the updated question below.
Original question:
In my current Rails project, I need to parse large xml/csv data file and save it into mongodb.
Right now I use this steps:
Receive uploaded file from user, store the data into mongodb
Use sidekiq to perform async process of the data in mongodb.
After process finished, delete the raw data.
For small and medium data in localhost, the steps above run well. But in heroku, I use hirefire to dynamically scale worker dyno up and down. When worker still processing the large data, hirefire see empty queue and scale down worker dyno. This send kill signal to the process, and leave the process in incomplete state.
I'm searching a better way to do the parsing, allow the parsing process got killed anytime (saving the current state when receiving kill signal), and allow the process got re-queued.
Right now I'm using Model.delay.parse_file and it don't get re-queued.
UPDATE
After reading sidekiq wiki, I found article about job control. Can anyone explain the code, how it works, and how it preserve it's state when receiving SIGTERM signal and the worker get re-queued?
Is there any alternative way to handle job termination, save current state, and continue right from the last position?
Thanks,
Might be easier to explain the process and the high level steps, give a sample implementation (a stripped down version of one that I use), and then talk about throw and catch:
Insert the raw csv rows with an incrementing index (to be able to resume from a specific row/index later)
Process the CSV stopping every 'chunk' to check if the job is done by checking if Sidekiq::Fetcher.done? returns true
When the fetcher is done?, store the index of the currently processed item on the user and return so that the job completes and control is returned to sidekiq.
Note that if a job is still running after a short timeout (default 20s) the job will be killed.
Then when the job runs again simply, start where you left off last time (or at 0)
Example:
class UserCSVImportWorker
include Sidekiq::Worker
def perform(user_id)
user = User.find(user_id)
items = user.raw_csv_items.where(:index => {'$gte' => user.last_csv_index.to_i})
items.each_with_index do |item, i|
if (i+1 % 100) == 0 && Sidekiq::Fetcher.done?
user.update(last_csv_index: item.index)
return
end
# Process the item as normal
end
end
end
The above class makes sure that each 100 items we check that the fetcher is not done (a proxy for if shutdown has been started), and ends execution of the job. Before the execution ends however we update the user with the last index that has been processed so that we can start where we left off next time.
throw catch is a way to implement this above functionality a little cleaner (maybe) but is a little like using Fibers, nice concept but hard to wrap your head around. Technically throw catch is more like goto than most people are generally comfortable with.
edit
Also you could not make call to Sidekiq::Fetcher.done? and record the last_csv_index on each row or on each chunk of rows processed, that way if your worker is killed without having the opportunity to record the last_csv_index you can still resume 'close' to where you left off.
You are trying to address the concept of idempotency, the idea that processing a thing multiple times with potential incomplete cycles does not cause problems. (https://github.com/mperham/sidekiq/wiki/Best-Practices#2-make-your-jobs-idempotent-and-transactional)
Possible steps forward
Split the file up into parts and process those parts with a job per part.
Lift the threshold for hirefire so that it will scale when jobs are likely to have fully completed (10 minutes)
Don't allow hirefire to scale down while a job is working (set a redis key on start and clear on completion)
Track progress of the job as it is processing and pick up where you left off if the job is killed.
With MS Access single user,
Is it good practice or okay to maintain a persistent connection throughout?
psuedocode:
app.start();
access.connect();
domanymanystuff();
access.disconnect();
app.exit();
--- OR ----
app.start();
access.connect();
doonetask();
access.disconnect();
...
access.connect();
doanothertask();
access.disconnect();
...
app.exit();
?
Honestly it won't matter since most data connection are pooled and will hang around for reuse after you have closed them. You do want to make sure that your transactions are performed in a 'per unit of work' fashion.
Otherwise, even with a single user DB you could find your application locking itself out.
So, try this:
Open connection
Start transaction
Perform unit of work
Commit transaction
...
Start transaction
Perform unit of work
Commit transaction
...
Start transaction
Perform unit of work
Commit transaction
...
Close connection
You can maintain a persistent connection throughout with a single-user database.