When I run my Airflow, no task is created - task

I have a problem, the demo is very simple, but after deployment on the airflow, execution does not achieve the desired effect. Here's my code
"""
import pytz
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.latest_only_operator import LatestOnlyOperator
from airflow.operators.python_operator import PythonOperator
tz = pytz.timezone('Asia/Shanghai')
dt = datetime.now(tz)
utc_dt = dt.astimezone(pytz.utc).replace(tzinfo=None)
default_args = {
'owner': 'syroot',
"start_date": utc_dt - timedelta(minutes=2),
"depends_on_past": False,
'email': ['zhaosw#sunnyoptical.com'],
'email_on_failure': False,
'email_on_retry': False,
"retries": 1,
"retry_delay": timedelta(seconds=5)
}
dag = DAG(
"demo1",
catchup=False,
default_args=default_args,
schedule_interval="*/2 * * * *",
)
def print_hello():
return 'Hello world!'
hello_operator = PythonOperator(
task_id='hello_task',
python_callable=print_hello,
dag=dag)
"""
But the results were not so good, dag run success, but no task create. I can not find any info in task instances menu, but I can find dag run log in DAG Runs menu.

I don't believe your scheduler is able to run the DAG due to your dynamic start date.
Try changing "start_date": utc_dt - timedelta(minutes=2), to a static date like "start_date": datetime(2019,12,9),. That should allow the scheduler to pick it up!
It's generally recommended not to set your start_date dynamically.
Taken from Airflow FAQ:
We recommend against using dynamic values as start_date, especially
datetime.now() as it can be quite confusing. The task is triggered
once the period closes, and in theory an #hourly DAG would never get
to an hour after now as now() moves along.

You have to specify which task execute. It looks like you are missing:
hello_operator
in your DAG:
def print_hello():
return 'Hello world!'
hello_operator = PythonOperator(
task_id='hello_task',
python_callable=print_hello,
dag=dag)
hello_operator # <-- add this

Related

Setting timezone in AsyncIOScheduler

I'm in the Pacific timezone and I'm creating a discord bot to send a message at 8am in CENTRAL time.
import os
import discord
from discord.ext import commands
from dotenv import load_dotenv
from rich import print
from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.triggers.cron import CronTrigger
load_dotenv()
TOKEN = os.getenv('DISCORD_TOKEN')
intents = discord.Intents.default()
intents.members = True
bot = commands.Bot(command_prefix = '!', intents=intents)
# Will become the good morning message
async def gm():
c = bot.get_channel(channel_id_removed)
await c.send("This will be the good morning message.")
#bot.event
async def on_ready():
for guild in bot.guilds:
print(
f'{bot.user} is connected to the following guild:\n'
f'\t{guild.name} (id: {guild.id})'
)
#initializing scheduler for time of day sending
scheduler = AsyncIOScheduler()
# Attempts to set the timezone
# scheduler = AsyncIOScheduler(timezone='America/Chicago')
# scheduler = AsyncIOScheduler({'apscheduler.timezone': 'America/Chicago'})
# scheduler.configure(timezone='America/Chicago')
# Set the time for sending
scheduler.add_job(gm, CronTrigger(hour="6", minute="0", second="0"))
#starting the scheduler
scheduler.start()
#bot.event
async def on_member_join(member):
general_channel = None
guild_joined = member.guild
print(guild_joined)
general_channel = discord.utils.get(guild_joined.channels, name='general')
print(f'General Channel ID: {general_channel}')
if general_channel:
embed=discord.Embed(title="Welcome!",description=f"Welcome to The Dungeon {member.mention}!!")
await general_channel.send(embed=embed)
bot.run(TOKEN)
Environment:
Windows 10
Python 3.10.4
APScheduler 3.9.1
pytz 2022.1
pytz-deprecation-shim 0.1.0.post0
tzdata 2022.1
tzlocal 4.2
I'm just wondering if I'm doing something wrong? Or if what I'm trying to do simply isn't supported? It works if I use my local time so I know the function is ok.
You are using the asyncio scheduler but you're not running an asyncio event loop, so there is no way this could work. Copy/paste from the provided example:
from datetime import datetime
import asyncio
import os
from apscheduler.schedulers.asyncio import AsyncIOScheduler
def tick():
print('Tick! The time is: %s' % datetime.now())
if __name__ == '__main__':
scheduler = AsyncIOScheduler()
scheduler.add_job(tick, 'interval', seconds=3)
scheduler.start()
print('Press Ctrl+{0} to exit'.format('Break' if os.name == 'nt' else 'C'))
# Execution will block here until Ctrl+C (Ctrl+Break on Windows) is pressed.
try:
asyncio.get_event_loop().run_forever()
except (KeyboardInterrupt, SystemExit):
pass
The reason it is not working is because, while scheduler.start() instantiates an event loop as a side effect, it expects the loop to be run elsewhere so that the scheduler can do its work.

An example of workers with the right resource allocation in the dask distributed

Does anyone have a working sample code that shows you can use CPU and GPU workers selectively with the client.submit api that dask distributed provides here?
I am trying to train xgboost with dask-cudf in a distributed manner on GPU machines but I am not able to make it respects the resource tags I provide for different tasks
My friend and coworker, #pentschev (github), wanted to point you to this example from here:
https://github.com/dask/distributed/pull/4869#issue-909265778
import asyncio
import threading
import dask
from dask.distributed import Client, Scheduler, Worker
from distributed.threadpoolexecutor import ThreadPoolExecutor
def get_thread_name(prefix):
return prefix + threading.current_thread().name
async def main():
async with Scheduler() as s:
async with Worker(
s.address,
nthreads=5,
executor={
"GPU": ThreadPoolExecutor(1, thread_name_prefix="Dask-GPU-Threads")
},
resources={"GPU": 1, "CPU": 4},
) as w:
async with Client(s.address, asynchronous=True) as c:
with dask.annotate(resources={"CPU": 1}, executor="default"):
print(await c.submit(get_thread_name, "CPU-"))
with dask.annotate(resources={"GPU": 1}, executor="GPU"):
print(await c.submit(get_thread_name, "GPU-"))
if __name__ == "__main__":
asyncio.get_event_loop().run_until_complete(main())
Output:
CPU-Dask-Default-Threads'-29802-2
GPU-Dask-GPU-Threads-29802-3

airflow tasks in specific batches

I want to run a set of tasks like this:
a >> [b,c,d] >> [e,f,g] >> [h,i,j,k,l,m]
First run task a, when that is done, run b,c,d in parallel, then when the last of b,c,d is done. start running e,f,g in parallel etc.
But i'm getting an error with unsupported operand type(s) for >>: 'list' and 'list'
what is the correct syntax for what I want to do?
The error you are getting is related to the fact that dependencies between lists using bitwise operator are not supported, [task_a, task_b] >> [task_c, task_d] won't work.
IMHO the easiest and cleaner way to achieve what you are looking for (there are others) is to use TaskGroup and set depenencies between them, like this:
Graph view:
from time import sleep
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.task_group import TaskGroup
default_args = {
'start_date': days_ago(1)
}
def _execute_task(**kwargs):
print(f"Task_id: {kwargs['ti'].task_id}")
sleep(10)
def _create_python_task(name):
return PythonOperator(
task_id=f'task_{name}',
python_callable=_execute_task)
with DAG('parallel_tasks_example', schedule_interval='#once',
default_args=default_args, catchup=False) as dag:
task_a = DummyOperator(task_id='task_a')
with TaskGroup('first_group') as first_group:
for name in list('bcd'):
task = _create_python_task(name)
with TaskGroup('second_group') as second_group:
for name in list('efg'):
task = _create_python_task(name)
with TaskGroup('third_group') as third_group:
for name in list('hijk'):
task = _create_python_task(name)
task_a >> first_group >> second_group >> third_group
From TaskGroup class definition:
A collection of tasks. When set_downstream() or set_upstream() are called on
the TaskGroup, it is applied across all tasks within the group if necessary.
You can find an official example about here .

neo4j query running slowly in Python for Windows

I'm developing locally on a Windows 10 machine using Python 2.7. I'm using Neo4j 3.0.5 with the Bolt driver for Python.
Connection string is as follows:
db = GraphDatabase.driver("bolt://localhost:7687", auth=basic_auth("USERNAME", "PASSWORD"), encrypted=False)
When running queries I'm using the following syntax:
with db.session() as s:
with s.begin_transaction() as tx:
results = tx.run(
"MATCH (a:User{username:{username}}) RETURN a.username",
{
"username": username
})
For some reason there is about a 2 second latency.
I ran the following tests (with slightly different syntax, but exactly the same result) and the slow elements seem to be loading the database driver and then the FIRST run of the session.
from neo4j.v1 import GraphDatabase
import timeit
start_time = 0
elapsed = 0
task = ""
def startTimer(taskIn):
global task
global start_time
task = taskIn
start_time = timeit.default_timer()
def endTimer():
global task
global start_time
global elapsed
elapsed = round(timeit.default_timer() - start_time, 3)
print(task, elapsed)
startTimer("GraphDatabase.driver")
db = GraphDatabase.driver("bolt://localhost:7687", auth=("USERNAME", "PASSWORD"))
endTimer()
startTimer("db.session")
session = db.session()
endTimer()
startTimer("query1")
result = session.run(
"MATCH (a:User{username:{username}}) RETURN a.username, a.password_hash ",
{"username": "Pingu"})
endTimer()
startTimer("query2")
result = session.run(
"MATCH (a:User{username:{username}}) RETURN a.username, a.password_hash ",
{"username": "Pingu"})
endTimer()
startTimer("db.close")
session.close()
endTimer()
The results were as follow:
('GraphDatabase.driver', 1.308)
('db.session', 0.0)
('query1', 1.017)
('query2', 0.001)
('db.close', 0.009)
The string is the test step and the number is the number of seconds for execution.
I'm developing a Flask API and so I can get past the database driver load time by loading it once and then referencing the loaded instance.
However, I can't seem to get past the query1 issue.
Running the exact same code on an Ubuntu Server Virtual box runs like lightening, so this seems to be something to do with Windows implementation.
Any ideas how this can be resolved please?
Thank you very much!

Grails Quartz Plugin concurrent not working

I'm having trouble getting my Quartz Job in Grails to run concurrently as expected. Here is what my job looks like. I've commented out the lines that use the Executor plugin but when I don't comment them out, my code works as expected.
import java.util.concurrent.Callable
import org.codehaus.groovy.grails.commons.ConfigurationHolder
class PollerJob {
def concurrent = true
def myService1
def myService2
//def executorService
static triggers = {
cron name: 'pollerTrigger', startDelay:0, cronExpression: ConfigurationHolder.config.poller.cronExpression
}
def execute() {
def config = ConfigurationHolder.config
//Session session = null;
if (config.runPoller == true) {
//def result = executorService.submit({
myService1.doStuff()
myService2.doOtherStuff()
//} as Callable)
}
}
}
In my case, the myService2.doOtherStuff() is taking a very long time to complete which overlaps the next time this job should trigger. I don't mind if they overlap which is why I explicitly added def concurrent = true but it isn't working.
I have version 0.4.2 of the Quartz plugin and Grails 1.3.7. Am I doing something wrong here? Seems like a pretty straightforward feature to use. I'm not opposed to using the Executor plugin but it seems like I shouldn't have to.
I'm not sure it matters but the cronExpression I'm loading from config in this case is meant to execute this job every minute: "0 * * * * ?"
Apparently, there was a hidden config that I was not aware of that was keeping this from working. In my conf folder there was a file called quartz.properties which contained the following property:
org.quartz.threadPool.threadCount = 1
After increasing this number, my job was triggering even when it had not finished the previous execution.

Resources