Airflow Task Failed without empty Log and doesn't send email - task

I have a DAG with 60 tasks (PythonsOperators) and in some executions different tasks are marked as failed, but I don’t know the reason, when I go to "View Log" the log is empty and when i pass over the red square it says Operator:null what does that mean?
It seems like it hasn't executed the task, but I don't understand why.
The questions are:
Why Airflow mark it as failed but there is no execution showed on log?
Why hasn't it send email of error if the tasks it's marked as failed?
Here is the python code associated to the dag:
DEFAULT_ARGS = {
'owner': 'blablabla',
'depends_on_past': False,
'start_date': datetime(2018, 5, 8),
'catchup': False,
'email': ['mail#mail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'max_active_runs': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG('dag_name',
default_args=DEFAULT_ARGS,
schedule_interval='20 0 * * *')
mylist = get_codes_list()
for item in mylist:
healthcheckerName = 'healthchecker_' + item
healthchecker = PythonOperator(
dag=dag,
task_id=healthcheckerName,
python_callable=prime_ops.check_last_budget_calculation(item),
queue=SPECIFIC_QUEUE,
pool=DEFAULT_PPC_POOL
)

The worker may die. I would suggest increase the memory allocation.
If a worker dies before the buffer flushes, logs are not emitted. Task failure without logs is an indication that the Airflow workers are restarted due to out-of-memory (OOM).
You can read more here Task fails without emitting logs.

This happened to me. I found that some worker nodes were out of disk space, so they were failing tasks because they couldn't write the log.
Go into the Docker container for the worker node and search logs/worker.log for "No space left on device".
If this is the case, there are a couple of easy ways to mitigate:
Manually delete log files older than a certain date; or
Kill and restart the affected Docker containers. This is what we ended up doing. You do lose ALL of your worker logs if you do this.
Longer term it might be worthwhile to do log rotation or some automated cleanup of log files (oldest first).

Related

Unable to search/delete a continuously retrying sidekiq job

One of my Sidekiq worker classes had validations for data size missing, hence one of the enqueued job is pulling in huge data from the database and failing abruptly with following message and immediately enqueuing another job with the same job_id.
Error performing MyWorkerClass (Job ID:
my_job_id) from Sidekiq(my_queue_name) in
962208.79ms: Sidekiq::Shutdown (Sidekiq::Shutdown):
As soon as I get this message, a new job is enqueued.
Performing MyWorkerClass (Job ID: my_job_id) from
Sidekiq(my_queue_name) with arguments: 1, {"param1"=>"param1_value",
"param2"=>"param2_value", "param3"=>"param3_value"}
I am figuring out a way to fix this problem but for now I want to stop this particular job from running continuously. I couldn't find this job on my sidekiq UI dashboard.
Also I tried to find and delete this job using following methods but couldn't find the job. All the variables printed below are Nil.
a = Sidekiq::Queue.new('my_queue_name').find_job("my_job_id")
b = Sidekiq::ScheduledSet.new.find_job("my_job_id")
c = Sidekiq::RetrySet.new.find_job("my_job_id")
d = Sidekiq::JobSet.new('my_queue_name').find_job("my_job_id")
puts a.inspect
puts b.inspect
puts c.inspect
puts d.inspect
I want help with the following:
How to avoid this abrupt shutdown for long running jobs in the future
Find the long running job and kill it.
Thank you in Advance !

Can I make flex template jobs take less than 10 minutes before they start to process data?

I am using terraform resource google_dataflow_flex_template_job to deploy a Dataflow flex template job.
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
}
}
Its all working fine however without my requesting it the job seems to be using flexible resource scheduling (flexRS) mode, I say this because the job takes about ten minutes to start and during that time has state=QUEUED which I think is only applicable to flexRS jobs.
Using flexRS mode is fine for production scenarios however I'm currently still developing my dataflow job and when doing so flexRS is massively inconvenient because it takes about 10 minutes to see the effect of any changes I might make, no matter how small.
In Enabling FlexRS it is stated
To enable a FlexRS job, use the following pipeline option:
--flexRSGoal=COST_OPTIMIZED, where the cost-optimized goal means that the Dataflow service chooses any available discounted resources or
--flexRSGoal=SPEED_OPTIMIZED, where it optimizes for lower execution time.
I then found the following statement:
To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.
at Specifying pipeline execution parameters > Setting other Cloud Dataflow pipeline options
I interpret that to mean that flexrs_goal=SPEED_OPTIMIZED will turn off flexRS mode. However, I changed the definition of my google_dataflow_flex_template_job resource to:
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
"flexrs_goal" = "SPEED_OPTIMIZED"
}
}
(note the addition of "flexrs_goal" = "SPEED_OPTIMIZED") but it doesn't seem to make any difference. The Dataflow UI confirms I have set SPEED_OPTIMIZED:
but it still takes too long (9 minutes 46 seconds) for the job to start processing data, and it was in state=QUEUED for all that time:
2021-01-17 19:49:19.021 GMTStarting GCE instance, launcher-2021011711491611239867327455334861, to launch the template.
...
...
2021-01-17 19:59:05.381 GMTStarting 1 workers in europe-west1-d...
2021-01-17 19:59:12.256 GMTVM, launcher-2021011711491611239867327455334861, stopped.
I then tried explictly setting flexrs_goal=COST_OPTIMIZED just to see if it made any difference, but this only caused an error:
"The workflow could not be created. Causes: The workflow could not be
created due to misconfiguration. The experimental feature
flexible_resource_scheduling is not supported for streaming jobs.
Contact Google Cloud Support for further help. "
This makes sense. My job is indeed a streaming job and the documentation does indeed state that flexRS is only for batch jobs.
This page explains how to enable Flexible Resource Scheduling (FlexRS) for autoscaled batch pipelines in Dataflow.
https://cloud.google.com/dataflow/docs/guides/flexrs
This doesn't solve my problem though. As I said above if I deploy with flexrs_goal=SPEED_OPTIMIZED then still state=QUEUED for almost ten minutes, yet as far as I know QUEUED is only applicable to flexRS jobs:
Therefore, after you submit a FlexRS job, your job displays an ID and a Status of Queued
https://cloud.google.com/dataflow/docs/guides/flexrs#delayed_scheduling
Hence I'm very confused:
Why is my job getting queued even though it is not a flexRS job?
Why does it take nearly ten minutes for my job to start processing any data?
How can I speed up the time it takes for my job to start processing data so that I can get quicker feedback during development/testing?
UPDATE, I dug a bit more into the logs to find out what was going on during those 9minutes 46 seconds. These two consecutive log messages are 7 minutes 23 seconds apart:
2021-01-17 19:51:03.381 GMT
"INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template/requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']"
2021-01-17 19:58:26.459 GMT
"INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi"
Whatever is going on between those two log records is the main contributor to the long time spent in state=QUEUED. Anyone know what might be the cause?
As mentioned in the existing answer you need to extract the apache-beam modules inside your requirements.txt:
RUN pip install -U apache-beam==<version>
RUN pip install -U -r ./requirements.txt
While developing, I prefer to use DirectRunner, for the fastest feedback.

Dask with tls connection can not end the program with to_parquet method

I am using dask to process 10 files which the size of each file is about 142MB. I build a method with delayed tag, following is an example:
#dask.delayed
def process_one_file(input_file_path, save_path):
res = []
for line in open(input_file_path):
res.append(line)
df = pd.DataFrame(line)
df.to_parquet(save_path+os.path.basename(input_file_path))
if __name__ == '__main__':
client = ClusterClient()
input_dir = ""
save_dir = ""
print("start to process")
cvss = [process_one_file(input_dir+filename, save_dir) for filename in os.listdir(input_dir)]
dask.compute(csvs)
However, dask does not always run successfully. After processing all files, the program often hangs.
I used the command line to run the program. The program often huangs after printing start to process. I know the program runs correctly, since I can see all output files after a while.
But the program never stops. If I disabled tls, the program can run successfully.
It was so strange that dask can not stop the program if I enable tls connection. How can I solve it?
I found that if I add to_parquet method, then the program cannot stop, while if I remove the method, it runs successfully.
I have found the problem. I set 10GB for each process. That means I set memory-limit=10GB. I totally set 2 workers and each has 2 processes. Each process has 2 threads.
Thus, each machine will have 4 processes which occupy 40GB. However, my machine only have 32GB. If I lower the memory limit, then the program will run successfully!

IPython.parallel client is hanging while waiting for result of map_async

I am running 7 worker processes on a single machine with 4 cores. I may have made a poor choice with this loop while waiting for the result of map_async:
while not result.ready():
time.sleep(10)
for out in result.stdout:
print out
rec_file_list = result.get()
result.stdout keeps growing with all the printed output from the 7 processes running, and it caused the console that initiated the map to hang. The activity monitor on my MacBook Pro shows the 7 processes are still running, and the terminal running the Controller is still active. What are my options here? Is there any way to acquire the result once the processes have completed?
I found an answer:
Remote introspection of ASyncResult objects is possible from another client as long as a 'database backend' has been enabled by the controller with:
ipcontroller --dictb # or --mongodb or --sqlitedb
Then, it is possible to create a new client instance and retrieve the results with:
client.get_result(task_id)
where the task_ids can be retrieved with:
client.hub_history()
Also, a simple way to avoid the buffer overflow I encountered is to periodically print just the last few lines from each engine's stdout history, and to flush the buffer like:
from IPython.display import clear_output
import sys
while not result.ready():
clear_output()
for stdout in result.stdout:
if stdout:
lines = stdout.split('\n')
for line in lines[-4:-1]:
if line:
print line
sys.stdout.flush()
time.sleep(30)

How to use ssh_connection:exec in Erlang?

This is an interesting situation, focused on the behavior of erlang ssh modules. I had spent a few hours troubleshooting a problem that turned out to reveal that the Erlang ssh_connection *exec/4* function operates asynchronously.
If you issue the ssh_connection:exec/4 function to run a script that takes several seconds to complete, and then in your erlang program you close the ssh connection, the script execution will terminate. My expectation was that the ssh_connection:exec would be synchronous rather than asynchronous.
Because the time to complete the remote script invoked by ssh_connection:exec is unknown, I chose to not issue the closure ssh:close(). I would like to understand the consequences of that:
Will the gc clear it at some point ?
Will it stay open for good during the whole node existence ?
Is there a way to make the ssh_connection:exec synchronous, as I would believe it should be.
Here is an example of the test erl program that I used to verify this issue. As a script you can run a simple sleep 10 (sleep 10 seconds) to emulate a slow running program.
-module(testssh).
-export([test/5]).
test (ServerName, Port, Command, User, Password) ->
crypto:start(),
ssh:start(),
{ok, SshConnectionRef} = ssh:connect(ServerName, Port, [ {user, User}, {password, Password} , {silently_accept_hosts, true} ], 60000 ),
{ok, SshConnectionChannelRef} = ssh_connection:session_channel(SshConnectionRef, 60000),
Status = ssh_connection:exec(SshConnectionRef, SshConnectionChannelRef, Command, 60000),
ssh:close(SshConnectionRef).
Remote script:
#!/bin/sh
sleep 10
I never had to use the ssh application myself, but you should be reading something wrong, it is clear in the doc that the result will be delivered as messages to the caller:
[...] the result will be several messages according to the following pattern. Note that the last message will be a channel close message, as the exec request is a one time execution that closes the channel when it is done[...]
See http://www.erlang.org/doc/man/ssh_connection.html#exec-4
So after you call ssh_connection:exec/4 , test with a loop like this:
wait_for_response(ConnectionRef) ->
receive
{ssh_cm, ConnectionRef, Msg} ->
case Msg of
{closed, _ChannelId} ->
io:format("Done");
_ ->
io:format("Got: ~p", [Msg]),
wait_for_response(ConnectionRef)
end
end.
You should receive the command output, and other ssh messages, and finally a 'closed' message that is your signal that the ssh command has properly finished.

Resources