Docker + Airflow (Some tasks remains on "scheduled" forever) - docker

I'm using Docker to run/test airflow DAGs locally, before I release changes to production. For whatever reason, some DAGs (=the first task) remain forever on "scheduled", while others work perfectly fine. There is an interesting pattern: Smaller DAGs seem to run every now and then, while bigger DAGs with many simultaneous tasks always remain scheduled.
When I click on the "scheduled" task and go to "Instance Details", then I can see this message: ('Not scheduling since there are %s open slots in pool %s and require %s pool slots', -1, 'dbrelay', 1)
Does anyone know whats going on here?
Thank you in advance!!!
Ps. This problem has never ocured in production, only locally
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(2022, 2, 18),
"email": ["xxx#xxx.de"],
"retries": 1,
"retry_delay": timedelta(minutes=5),
"on_failure_callback": task_fail_slack_alert, # slack alert on all task failures
"location": "eu",
"bigquery_conn_id": BIGQUERY_CONN_ID,
"provide_context": True,
}
with DAG(
"dag_name",
default_args=default_args,
concurrency=3, concurrent tasks
catchup=False,
schedule_interval="30 05,17 * * *",
max_active_runs=1,
params={
"PROJECT": PROJECT,
"TMP_VIEW_DATASET": TMP_VIEW_DATASET,
"HOURS_TO_SUBTRACT": 4,
"REPORT_DATE": REPORT_DATE,
},
) as dag:
with TaskGroup(group_id="dbrelays") as tg_dbrelays:
for index, relay in enumerate(dbrelays):
op_dbrelay = EPBigQueryInsertJobOperator(
task_id=f"db_relay_{relay}",
sql="templates/generic/create_or_replace_snapshot.sql",
pool="dbrelay",
params={
"SOURCE_PROJECT": PROJECT,
"DESTINATION_PROJECT": PROJECT,
"SOURCE_DATASET": "dbrelay",
"DESTINATION_DATASET": TMP_VIEW_DATASET,
"TABLE": relay,
},
)

Related

Cypress 12 + angular 15 + input chain failling randomly

I just migrate my application from angular 12 to angular 15 (and material 15).
The cypress also migrated from 8.7.0 to 12.3.0
Since the migration, the existing cypress tests are not constant in execution.
I have too kinds of issue:
Cannot get element by id or css class. The error is “…is being covered by another element“
Synchronisation is not perfect on input chaining. As example :
cy.get('#birthPlace_id') .clear() .type('London') .should('have.class', 'ng-valid');
In this test code execution, the type starts meanwhile the clear instruction is not completely finished. This give a wrong input value with a mix of previous and new typed values.
Here is my configuration:
`
defaultCommandTimeout: 60000,
execTimeout: 60000,
pageLoadTimeout: 60000,
requestTimeout: 60000,
responseTimeout: 60000,
taskTimeout: 60000,
videoUploadOnPasses: true,
screenshotOnRunFailure: false,
videoCompression: false,
numTestsKeptInMemory: 0,
animationDistanceThreshold: 20,
waitForAnimations: false,
env: { 'NO_COLOR': '1' },
retries: {
runMode: 4,
openMode: 0
},
fileServerFolder: '.,',
modifyObstructiveCode: false,
video: false,
chromeWebSecurity: true,
component: {
devServer: {
framework: 'angular',
bundler: 'webpack'
}
}
`
I already tried to:
Add “force: true”
Add a wait(1000) or another value
Use click() method before
I increased the timeout in the config file for all
But same comportment randomly it can also work perfectly, but major time not at all.
I would expect that the call to clear(), type(), should() are perfectly synchronised and does not start before the previous one is not finished yet.
My question: is there a better way to do chaining ? Does something change since Cypress 8 to chain instruction on element ?
In this test code execution, the type starts meanwhile the clear instruction is not completely finished.
You can guard against this by adding an assertion after .clear().
This retries the test flow until the control has been cleared.
cy.get('#birthPlace_id')
.clear()
.should('have.value', '')
.type('London')
.should('have.value', 'London')

Twilio Flex Voicemail Flex Task Assignment Workflow

Hello I have a workflow here
It sends callers to a queue with a single worker.
I want to implement voicemail and have followed the instructions here https://support.twilio.com/hc/en-us/articles/360021082934-Implementing-Voicemail-with-Twilio-Flex-TaskRouter-and-Insights
If I put the filter first in the workflow then it always goes to voicemail, if next then it never goes.
How can I make it so that if calls are maybe reserved but not answered they go to the voicemail queue?
{
"task_routing": {
"filters": [
{
"filter_friendly_name": "Voicemail ",
"expression": "1==1",
"targets": [
{
"queue": "theRealqueue",
"timeout": 10
},
{
"queue": "voicemail",
"timeout": 10
}
]
}
],
"default_filter": {
"queue": "gfhfghgfhghfghgfhfghgfh"
}
}
}
I believe that the timeout you have set for each of the targets counts as the time the task spends in that queue regardless of the target. In the case of the workflow you shared, when a task spends 10 seconds in theRealqueue and then times out, it also times out of the voicemail queue because that was also 10 seconds.
Try setting the voicemail queue target to a bigger timeout than theRealqueue.
"targets": [
{
"queue": "theRealqueue",
"timeout": 10
},
{
"queue": "voicemail",
"timeout": 20
}
]
There were 3 problems.
I decided to put in a,
"skip_if": "workers.available == 0" On the first filter and it makes sense now as in the GUI, it is DO NOT SKIP as the default, so to my thinking that means what it says.
And WOW it worked, I had earlier set a TASK RESERVATION TIMEOUT of 8 seconds but when I tried increasing this it never got to voicemail/never went to next step.
I could only get it to work with that 8 second TASK RESERVATION TIMEOUT, not a larger value so then I looked in the Studio Flow,
SEND TO FLEX had a timeout of 10 second, my bad. Increased and all good now.
The documentation/tutorial here is terrible, https://support.twilio.com/hc/en-us/articles/360021082934-Implementing-Voicemail-with-Twilio-Flex-TaskRouter-and-Insights
Select the default Assign to Anyone workflow, or the appropriate workflow if you have modified this.
Click Add a Filter.
Name your Filter Voicemail (or something similarly identifiable), and then change the TIMEOUT to 30 seconds. Click Add a Step when finished.
Click the QUEUE value, and then select the Voicemail queue you created in the previous section. Press Save when finished.
That is all not so relevant here, it seems that the thing that controls the going to next step is the TASK RESERVATION TIMEOUT and that no step would be skipped/passed unless a "skip_if": is defined.
I would really love to get clarification on all this.
But the steps that I have done have provided a solution. I banged my head into walls for a few days here.

Can ActiveRecord validations leak IMEMO objects?

Background
My team is attempting to track down a memory leak in our Rails 6.1 application. We used the technique described here of taking three consecutive heap dumps and diffing them. We used rbtrace to get the dumps and rbheap to do the diffing. We tried this several times with different intervals between the samples.
Versions:
Rails 6.1.6.1
Ruby 3.0.3
Results
About 85% of the results in the diff look like the examples shown below. They are related to ActiveRecord's numerically validation, which we use in one of our models. This is the validation's source code. The strange thing is these allocations are IMEMO objects, which according to this article store information about the compiled code.
Validation in our model
validates :msrp, numericality: { less_than_or_equal_to: MAX_INT }, allow_nil: true
Example IMEMO object allocations
{
"address": "0x5632f3df7588",
"type": "IMEMO",
"class": "0x5632f654de48",
"imemo_type": "callcache",
"references": ["0x5632f654dbc8"],
"file": "/app/vendor/bundle/ruby/3.0.0/gems/activerecord-6.1.6.1/lib/active_record/validations/numericality.rb",
"line": 10,
"method": "validate_each",
"generation": 9233,
"memsize": 40,
"flags": {
"wb_protected": true,
"old": true,
"uncollectible": true,
"marked": true
}
}
{
"address": "0x5632f3e0f070",
"type": "IMEMO",
"class": "0x5632f7dc23d0",
"imemo_type": "callinfo",
"file": "/app/vendor/bundle/ruby/3.0.0/gems/activerecord-6.1.6.1/lib/active_record/validations/numericality.rb",
"line": 10,
"method": "validate_each",
"generation": 6225,
"memsize": 40,
"flags": {
"wb_protected": true,
"old": true,
"uncollectible": true,
"marked": true
}
}
Questions
Has anyone witnessed similar behavior of memory leaks related to ActiveRecord validations?
Does anyone have a theory as to why so many IMEMO objects are allocated and leaked for the same line of code?
This looks like a red herring. We removed the validation and still saw the same memory growth. A subsequent heap diff showed many more IMEMO objects from other methods.

Prometheus return nothing after while

We are using Prometheus and Grafana for our monitoring and we have a panel for response time however I noticed after while the metrics are missing and there are a lots of gap in the panel (only for response time panel) and they comeback as soon as I restart the app (redeploying it in openshift). the service has been written in Go and the logic for the gathering response time is quite simple.
we declared the metric
var (
responseTime = promauto.NewSummaryVec(prometheus.SummaryOpts{
Namespace: "app",
Subsystem: "rest",
Name: "response_time",
}, []string{
"path",
"code",
"method",
})
)
and fill it in our handler
func handler(.......) {
start := time.Now()
// do stuff
....
code := "200"
path := r.URL.Path
method := r.Method
elapsed := float64(time.Since(start)) / float64(time.Second)
responseTime.WithLabelValues(path, code, method).Observe(elapsed)
}
and query in the Grafana panel is like:
sum(rate(app_rest_response_time_sum{path='/v4/content'}[5m]) /
rate(app_rest_response_time_count{path='/v4/content'}[5m])) by (path)
but the result is like this!!
can anyone explain what do we do wrong or how to fix this issue? is it possible that we facing some kind of overflow issue (the average RPS is about 250)? I'm suspecting this because this happen more often to the routes with higher RPS and response time!
Prometheus records the metrics continuously normally and if you query it, it returns all the metrics it collected for the time you queried.
If there is no metric when you query, that has typically three reasons:
the metric was not there (it happens when the instance restarts and you have a dynamic set of labels and there was no request yet for the label value you queried (in your case there was no query for path='/v4/content'). In such case you should see other metrics of the same job (at least up).
Prometheus had problems storing the metrics. (see the log files of prometheus for that timeframe).
Prometheus was down for that timeframe and therefore did not collect any metrics. (In that case you should have no metrics at all for that timeframe.

Using Sucker Punch with Active Job, Is there a way to cancel a queued job?

So I have
MyJob.perform_in(60, #user),
Which will perform my job in 60 seconds.
I want to cancel this Job if this line of code is ran again replacing it in the queue.
I have had no luck researching.
To get the stats related to SuckerPunch jobs
[14] pry(main)> SuckerPunch::Queue.stats
{
"CreateVVLinkJob" => {
"workers" => {
"total" => 1,
"busy" => 0,
"idle" => 1
},
"jobs" => {
"processed" => 1,
"failed" => 0,
"enqueued" => 0
}
}
}
And To Clear the previous jobs
[24] pry(main)> SuckerPunch::Queue.clear
[
[0] #<SuckerPunch::Queue:0x0000000b150da0 #__lock__=#<Mutex:0x0000000b150d50>, #__condition__=#<Thread::ConditionVariable:0x0000000b150d28>, #running=false, #name="CreateVVLinkJob", #pool=#<Concurrent::ThreadPoolExecutor:0x0000000b146ad0 #__lock__=#<Mutex:0x0000000b1469e0>, #__condition__=#<Thread::ConditionVariable:0x0000000b1469b8>, #min_length=2, #max_length=2, #idletime=60, #max_queue=0, #fallback_policy=:abort, #auto_terminate=false, #pool=[], #ready=[], #queue=[], #scheduled_task_count=1, #completed_task_count=1, #largest_length=1, #ruby_pid=22314, #gc_interval=30, #next_gc_time=25973.834404648, #StopEvent=#<Concurrent::Event:0x0000000b1468c8 #__lock__=#<Mutex:0x0000000b146878>, #__condition__=#<Thread::ConditionVariable:0x0000000b146850>, #set=true, #iteration=0>, #StoppedEvent=#<Concurrent::Event:0x0000000b1467d8 #__lock__=#<Mutex:0x0000000b146788>, #__condition__=#<Thread::ConditionVariable:0x0000000b146760>, #set=true, #iteration=0>>>
]
Hope this is useful !
There is no built-in method inside of the SuckerPunch framework, that I can see from the source code, of canceling a single queued up job or a job that is currently executing. It appears that clearing the job queue is an all or nothing function.
That said, it should be a trivial matter to add an extension method that queries the underlying ConcurrentTask framework and matches up your new job with an already queued job based on the value of the #user parameter.

Resources