Task in DAG tries to run while already running - task

using Apache Ariflow:
We have created a DAG that runs everyday at 07:00 AM: schedule_interval='0 7 * * *'
The task is searching for a new row in a certain table. If it sees a new row, it continues to execute more tasks and so on.
We want the task to run for 19 hours. If it did not find a new row in that table, it will skip the rest of the tasks. The task's timeout is: timeout=60 * 60 * 19
Recently we have found that after 12 hours of running, we get an error which prompts the task to fail. Because we have a retry, the task retrires and then runs fully for 19 hours.
So instead of 19 hours, we get a run of 31 hours.
Here is the error:
INFO - Dependencies not met for <TaskInstance: DAG_NAME.check_for_new_file 2021-06-28T07:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
Has anyone experienced this error? If I seem to understand correctly, the task is trying to run again after 12 hours, so the task is being changed to a 'failed' instead of 'running' state?
Thanks!

Related

Writing graceful timeout for Nagios plugin

From Nagios' Plugin Development Guidelines:
Plugins have a very limited runtime - typically 10 sec. As a result, it is very important for plugins to maintain internal code to exit if runtime exceeds a threshold.
All plugins should timeout gracefully, not just networking plugins.
How can I implement a timeout mechanism into my custom plugin? Basically I want my plugin to return a status code 3 - UNKNOWN instead of the default 1 - CRITICAL when the plugin times out, to reduce the number of false positives generated.
EDIT: My plugin is written in Bash.
You can use timeout. Here is example usage:
timeout 15 ping google.com
if [ $? -eq 124 ]; then
echo "UNKNOWN - Time limit exceeded."
exit 3
if
You will get return exit status 124 from timeout when your command don't finish in defined time - 15 sec.

Rails memory constantly increasing during minitest testing even after GC.start is being called

I am running rake test in a Docker container, where I need to limit the container's max-memory. This has been achieved by passing the -m 1800m flag to docker run command. Unfortunately, the rake process seems to keep growing wrt memory usage, and is finally killed mid-way due to some sort of OOM killer. I've tried putting the following in my test_helper.rb file...
class ActiveSupport::TestCase
teardown :perform_gc
def perform_gc
puts "==> Before GC: #{ObjectSpace.count_objects}"
GC.start(full_mark: true)
puts "==> Before GC: #{ObjectSpace.count_objects}"
end
end
...and am getting this output during my test run.
==> Before GC: {:TOTAL=>1067513, :FREE=>798, :T_OBJECT=>146231, :T_CLASS=>13450, :T_MODULE=>3350, :T_FLOAT=>10, :T_STRING=>448040, :T_REGEXP =>3951, :T_ARRAY=>156744, :T_HASH=>32722, :T_STRUCT=>2162, :T_BIGNUM=>8, :T_FILE=>266, :T_DATA=>113834, :T_MATCH=>20339, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>115807, :T_ICLASS=>9741}
==> After GC: {:TOTAL=>1067920, :FREE=>304019, :T_OBJECT=>92774, :T_CLASS=>13431, :T_MODULE=>3350, :T_FLOAT=>10, :T_STRING=>328707, :T_REGEXP=>3751, :T_ARRAY=>107523, :T_HASH=>25206, :T_STRUCT=>2023, :T_BIGNUM=>7, :T_FILE=>11, :T_DATA=>112605, :T_MATCH=>11, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>64713, :T_ICLASS=>9719}
... test result of test #1 ....
==> Before GC: {:TOTAL=>1598233, :FREE=>338182, :T_OBJECT=>111209, :T_CLASS=>15057, :T_MODULE=>3481, :T_FLOAT=>10, :T_STRING=>570289, :T_REGEXP=>4836, :T_ARRAY=>219746, :T_HASH=>54358, :T_STRUCT=>12047, :T_BIGNUM=>8, :T_FILE=>12, :T_DATA=>138031, :T_MATCH=>2600, :T_COMPLEX=>1, :T_RATIONAL=>389, :T_NODE=>117993, :T_ICLASS=>9984}
==> After GC: {:TOTAL=>1598233, :FREE=>653201, :T_OBJECT=>103708, :T_CLASS=>14275, :T_MODULE=>3426, :T_FLOAT=>10, :T_STRING=>418825, :T_REGEXP=>3773, :T_ARRAY=>137405, :T_HASH=>39734, :T_STRUCT=>7444, :T_BIGNUM=>7, :T_FILE=>12, :T_DATA=>128923, :T_MATCH=>12, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>77590, :T_ICLASS=>9828}
... test result of test #2 ....
==> Before GC: {:TOTAL=>1598233, :FREE=>269630, :T_OBJECT=>114406, :T_CLASS=>14815, :T_MODULE=>3470, :T_FLOAT=>10, :T_STRING=>611637, :T_REGEXP=>4352, :T_ARRAY=>248693, :T_HASH=>58757, :T_STRUCT=>12208, :T_BIGNUM=>8, :T_FILE=>25, :T_DATA=>139671, :T_MATCH=>2288, :T_COMPLEX=>1, :T_RATIONAL=>83, :T_NODE=>108278, :T_ICLASS=>9901}
==> After GC: {:TOTAL=>1598233, :FREE=>635044, :T_OBJECT=>105028, :T_CLASS=>14358, :T_MODULE=>3427, :T_FLOAT=>10, :T_STRING=>429137, :T_REGEXP=>3775, :T_ARRAY=>140654, :T_HASH=>41626, :T_STRUCT=>8085, :T_BIGNUM=>7, :T_FILE=>12, :T_DATA=>129507, :T_MATCH=>15, :T_COMPLEX=>1, :T_RATIONAL=>59, :T_NODE=>77631, :T_ICLASS=>9857}
... test result of test #3 ....
... and so on ....
The value of ObjectSpace.count_objects[:TOTAL] is constantly growing after each test!
1067920 (after GC is run at the end of 1st test)
5250321 (after GC is run at the end of 18th test)
8631313 (after GC is run at the end of last, but ten, test)
8631313 (this number remains the same for the next 10 tests)
8631313 (after GC is run at the end of last, but three, test)
8631313 (after GC is run at the end of last, but two, test)
8631313 (after GC is run at the end of last, but one, test)
8631721 (after GC is run at the end of last test, after which rake aborts)
I'm also monitoring the process's memory usage via docker stats and ps aux --sort -rss. The memory consumption stabilises around the 1.77 GB mark, which is dangerously close to the 1800 mb limit set for the container. This is also validated by the value of ObjectSpace.count_object[:TOTAL] not changing for the last 10-15 test before rake is killed/aborted.
The error message in docker logs at the time of crash/abort/kill is:
PANIC: could not write to log file 00000001000000000000000A at offset 14819328, length 16384: Cannot allocate memory
LOG: unexpected EOF on client connection with an open transaction
LOG: WAL writer process (PID 262) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2018-02-26 06:35:38 UTC
How do I get to the bottom of this and ensure that my tests run in constant memory?

Neo4j 2.1.2 incremental backup fails but full backup succeeds

We recently upgraded our database from 2.0.1 to 2.1.2 (Enterprise) using the explicit upgrade procedure.
When trying to take a backup post-upgrade, full backups succeed, but incremental backups fail.
When running this command the first time, it succeeds:
~/neo4j-enterprise-2.1.2/bin/neo4j-backup -from single://127.0.0.1 -to /mnt/backups/neo4j-test-backup
Running it a second time gives the following error:
Performing backup from '127.0.0.1'
00:18:44.907 [main] INFO o.n.k.InternalAbstractGraphDatabase - No locking implementation specified, defaulting to 'forseti'
Transactions applied
Exception in thread "main" org.neo4j.consistency.ConsistencyCheckingError: Inconsistencies in transaction:
Start[3,xid=GlobalId[NEOKERNL|2772027681176372421|40044|-1], BranchId[ 52 49 52 49 52 49 ],master=-1,me=-1,time=2014-06-23 23:56:53.637+0000/1403567813637,lastCommittedTxWhenTransactionStarted=752027]
1PC[3, txId=752028, 2014-06-23 23:56:53.647+0000/1403567813647]
ConsistencySummaryStatistics{
Number of errors: 2
Number of warnings: 0
Number of inconsistent RELATIONSHIP records: 2
}
at org.neo4j.consistency.checking.incremental.intercept.CheckingTransactionInterceptor.complete(CheckingTransactionInterceptor.java:181)
at org.neo4j.kernel.impl.transaction.xaframework.LogEntryVisitorAdapter.apply(LogEntryVisitorAdapter.java:62)
at org.neo4j.kernel.impl.transaction.xaframework.LogEntryVisitorAdapter.apply(LogEntryVisitorAdapter.java:28)
at org.neo4j.kernel.impl.nioneo.xa.command.LogFilter.endLog(LogFilter.java:87)
at org.neo4j.kernel.impl.transaction.xaframework.XaLogicalLog.applyTransaction(XaLogicalLog.java:1120)
at org.neo4j.kernel.impl.transaction.xaframework.XaResourceManager.applyCommittedTransaction(XaResourceManager.java:856)
at org.neo4j.kernel.impl.transaction.xaframework.XaDataSource.applyCommittedTransaction(XaDataSource.java:246)
at org.neo4j.com.ServerUtil.applyReceivedTransactions(ServerUtil.java:461)
at org.neo4j.backup.BackupService.unpackResponse(BackupService.java:401)
at org.neo4j.backup.BackupService.incrementalWithContext(BackupService.java:315)
at org.neo4j.backup.BackupService.doIncrementalBackup(BackupService.java:257)
at org.neo4j.backup.BackupService.doIncrementalBackup(BackupService.java:210)
at org.neo4j.backup.BackupService.doIncrementalBackupOrFallbackToFull(BackupService.java:231)
at org.neo4j.backup.BackupTool.doBackup(BackupTool.java:240)
at org.neo4j.backup.BackupTool.run(BackupTool.java:168)
at org.neo4j.backup.BackupTool.main(BackupTool.java:71)
Any help/workarounds are appreciated.
Update: The same behavior persists after upgrading to 2.1.3
Could you please check again in the issue is resolved with 2.1.4? I darkly remember a resolved issue regarding incremental backups.

Rake Task killed probably by out-of-memory issue

I have a rake task and when I run it in console, it is killed. This rake task operates with a table of cca 40.000 rows, I guess that may be a problem with Out of memory.
Also, I believe that this query used is optimized for dealing with long tables:
MyModel.where(:processed => false).pluck(:attribute_for_analysis).find_each(:batch_size => 100) do |a|
# deal with 40000 rows and only attribute `attribute_for_analysis`.
end
This task will not be run in the future on regular basis, so I want to avoid some job monitoring solutions like God etc...but considering background jobs e.g.Rescue job.
I work with Ubuntu, ruby 2.0 and rails 3.2.14
> My free memory is as follows:
Mem: total used free shared buffers cached
3891076 1901532 1989544 0 1240 368128
-/+ buffers/cache: 1532164 2358912
Swap: 4035580 507108 3528472
QUESTIONS:
How to investigate why rake task is always killed (answered)
How to make this rake task running ( not answered - still is killed )
What is the difference between total-vm, aton-rs, file-rss (not answered)
UPDATE 1
-Can someone explain the difference between?:
total-vm
anon-rss
file-rss
$ grep "Killed process" /var/log/syslog
Dec 25 13:31:14 Lenovo-G580 kernel: [15692.810010] Killed process 10017 (ruby) total-vm:5605064kB, anon-rss:3126296kB, file-rss:988kB
Dec 25 13:56:44 Lenovo-G580 kernel: [17221.484357] Killed process 10308 (ruby) total-vm:5832176kB, anon-rss:3190528kB, file-rss:1092kB
Dec 25 13:56:44 Lenovo-G580 kernel: [17221.498432] Killed process 10334 (ruby-timer-thr) total-vm:5832176kB, anon-rss:3190536kB, file-rss:1092kB
Dec 25 15:03:50 Lenovo-G580 kernel: [21243.138675] Killed process 11586 (ruby) total-vm:5547856kB, anon-rss:3085052kB, file-rss:1008kB
UPDATE 2
modified query like this and rake task is still killed.
MyModel.where(:processed => false).find_in_batches do |group|
p system("free -k")
group.each do |row| # process
end
end

rake jobs:work working fine. problem with script/delayed_job start

I am calling function with LoadData.send_later(:test).
LoadData is my class and test is my method.
It's working fine while i am running rake jobs:work.
But when i am running script/delayed_job start or run that time delayed_job.log shows error like
TEastern Daylight Time: *** Starting job worker delayed_job host:KShah pid:5968
TEastern Daylight Time: * [Worker(delayed_job host:KShah pid:5968)] acquired lock on LoadData.load_test_data_with_delayed_job
Could not load object for job: uninitialized constant LoadData
TEastern Daylight Time: * [JOB] delayed_job host:KShah pid:5968 completed after 0.0310
TEastern Daylight Time: 1 jobs processed at 10.6383 j/s, 0 failed ...
Any solution??
Try putting include LoadData in an initializer. I seem to remember DelayedJob including activerecord classes, notifiers etc, but not custom classes. Personally I'd put the class in your models directory. It's still dealing with data, even if it's not activerecord.
Try doing this:
Delayed::Job.enqueue LoadData.test
Also, a big gotcha that took me while to realize... if you make changes to the code restart rake jobs:work!

Resources