I've been using Celery for a while a now, in production I use RabbitMQ as the broker and Redis for the backend in a K8s cluster with no problems so far. Locally, I run a docker compose with a few services (Flask API, 2 different Workers, Beat, Redis, Flower, Hasura), using Redis as both the Broker and the Backend.
I haven't experienced problems with this setup for the past months, but yesterday I started getting erratic behavior while accessing task results.
Tasks are sent to queue, the worker recognizes it and performs the task, but while querying for the task state I sometimes get DisabledBackend. Normally on the first request, and then it works. Couldn't find a pattern of when it works and when it doesn't, it's erratic.
I've read somewhere that Celery didn't work very well with flask's builtin server so I switched to uWSGI with pretty much the same setup I have in production:
[uwsgi]
wsgi-file = app/uwsgi.py
callable = application
http = :8080
processes = 4
threads = 2
master = true
chmod-socket = 660
vacuum = true
die-on-term = true
buffer-size = 32768
enable-threads = true
req-logger = python:uwsgi
I've seen a similar question in Django in which the problem seemed to be on WSGI Mod with Apache, which is not my case, but the behavior seems similar. Every other question I've seen was related to misconfiguration of the backend, which is not my case.
Any ideas on what might be causing this?
Thanks.
So it seems that I need to access AsyncResult only via my Celery app instance, instead of through Celery, or pass the Celery app instance as an argument.
So, this doesn't work:
from celery.result import AsyncResult
#app.route('/status/<task_id>')
def get_status(task_id):
task = AsyncResult(task_id)
return task.state
This works:
from app import my_celery # Your own Celery Application Instance
#app.route('/status/<task_id>')
def get_status(task_id):
task = my_celery.AsyncResult(task_id)
return task.state
This also works:
from app import my_celery
from celery.result import AsyncResult
#app.route('/status/<task_id>')
def get_status(task_id):
task = AsyncResult(task_id, app=my_celery)
return task.state
I'm guessing what happens is that by calling AsyncResult directly from Celery, it doesn't access Celery's configurations, hence it thinks that there's no backend configured to query results to.
But that would only explain complete failure of the function, and not the erratic behavior. I'm guessing this is because of different threads, and situations in which the app instance is being importante, so Celery finds it, not too sure though.
I've ran a couple of tests and seems to be working fine again after changing the imported AsyncResult, but I'll keep digging.
Related
I'm having trouble changing the temporary directory in Dask. When I change the temporary-directory in dask.yaml for some reason Dask is still writing out in /tmp (which is full). I now want to try and debug this, but when I use client.get_worker_logs() I only get INFO output.
I start my cluster with
from dask.distributed import LocalCluster, Client
cluster = LocalCluster(n_workers=1, threads_per_worker=4, memory_limit='10gb')
client = Client(cluster)
I already tried adding distributed.worker: debug to the distributed.yaml, but this doesn't change the output. I also check I am actually changing the configuration by calling dask.config.get('distributed.logging')
What am I doing wrong?
By default LocalCluster silences most logging. Try passing the silence_logs=False keyword
cluster = LocalCluster(..., silence_logs=False)
I am trying to use "completed_count()" to track how many tasks are left in a group in Celery.
My "client" runs this:
from celery import group
from proj import do
wordList=[]
with open('word.txt') as wordData:
for line in wordData:
wordList.append(line)
readAll = group(do.s(i) for i in wordList)
result = readAll.apply_async()
while not result.ready():
print(result.completed_count())
result.get()
The 'word.txt" is just a file with one word on each line.
Then I have the celery worker(s) set to run the do task as:
#app.task(task_acks_late = True)
def do(word):
sleep(1)
return f"I'm doing {word}"
My broker is pyamqp and I use rpc for the backend.
I thought it would print an increasing count of tasks for each loop on the client side but all I get are "0"s.
The problem is not in completed_count method. You are getting zeros because of result.ready() stays False after all the tasks have been completed. It seems like we have a bug with rpc backend, there is an issue on github. Consider to change the backend setting to amqp, it is working correctly as I can see
I have 3 ubuntu machine(CPU). my dask scheduler and client both are present on the same machine, whereas the two dask workers are running on other two machines. when I launch first task, it gets scheduled on first worker, but then upon launching second worker, while the first one still executing, it does not get scheduled on second worker. here is the sample client code that I tried.
### client.py
from dask.distributed import Client
import time, sys, os, random
def my_task(arg):
print("doing something in my_task")
time.sleep(2)
print("inside my task..", arg)
print("again doing something in my_task")
time.sleep(2)
print("return some random value")
value = random.randint(1,100)
print("value::", value)
return value
client = Client("172.25.49.226:8786")
print("client::", client)
future = client.submit(my_task, "hi")
print("future result::", future.result())
print("closing the client..")
client.close()
I am running "python client.py" two times almost at the same time from two different terminal/machines. both the client seems to be executing, but it results in exactly the same output which it should not because the return type of the my_task() is a random value. I tested this on ubuntu machines.
However a month back, I was able to run same tasks in parallel on CentOs machines. And now if check back and ran same two tasks from those CentOs machines, the problem persist. This is strange. it did not run in parallel. Not able to figure out this behavior by dask. Am I missing any OS level settings or something else.?
Run the below almost at the same time,
python client.py # from one machine/terminal
python client.py # from another machine/terminal
these two tasks should run in parallel, each task should run on different worker(we have two free workers available), but this is not happening. I can't see any log on the second worker console nor on the scheduler, while the first task continues to execute. At the end I noticed both the tasks finishes exactly at the same time with exactly same output.
However the above client code works well in "parallel" in windows OS, each task running through multiple terminals. but I would like to run it on Ubuntu machines.
By default if you call the same function on the same inputs Dask will assume that this will produce the same value, and only compute it once. You can override this behavior with the pure=False keyword
future = client.submit(func, *args, pure=False)
I'm trying to process large amount of data using Python and maintaining processing status in MySQL. However, I'm surprised there is no standard connection pool for python-mysql (like HikariCP in Java).
I initially started with PyMySQL, things were great until the program ran for first few hours. After few hours, things started to fail. I was getting lot of errors like:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 99] Cannot assign requested address)")
Moreover, lot of ports were stuck in TIME_WAIT state because I'm opening and closing connections too frequently because of lack of connection pooling
/d/p/950 ❯❯❯ netstat -nt | wc -l
84752
Per this and this, I tried to set tcp_fin_timeout and ip_local_port_range, but hardly anything improved.
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 15000 65000 > /proc/sys/net/ipv4/ip_local_port_range
Then I found out that MySQL provides mysql.connector which comes with pooling functionality. After doing all that performance actually deteriorated. More processes started to get failed. I'm using Python's multiprocessing module to simultaneously run 29 processes(multiprocessing.Pool picked this no by default) on a 24 core machine. Following was the code, of course I was using .my.cnf to pass all the credential to avoid committing them to git :
import mysql.connector
from mysql.connector import pooling
conn_pool = pooling.MySQLConnectionPool(pool_name="mypool1",
pool_size=pooling.CNX_POOL_MAXSIZE,
option_files=MYSQL_CONFIG,
option_groups=MYSQL_GROUP_NODE1,
allow_local_infile=True)
conn = conn_pool.get_connection()
Finally, reverted back to old code. Still using PyMySQL and though errors are less frequent it is still causing a major problem. I looked at SQLAlchemy and couldn't really found much of a documentation around pooling.
I'm wondering how's everyone else dealing with mysql-python connection pooling issue? I really believe there should be something out there so that I don't have to reinvent the wheel.
Any pointers are much appreciated.
DBUtils implements MySQL (and generally claims to support abritrary DB-API 2 compliant database interfaces) user-sized connection pool PooledDB, thead-mapped pool PersistentDB and SteadyDB (see functionality section). The latter should fit your case where multiprocessing.Pool creates worker processes with managed persistent database connection each. It is described as:
DBUtils.SteadyDB is a module implementing "hardened" connections to a database, based on ordinary connections made by any DB-API 2 database module. A "hardened" connection will transparently reopen upon access when it has been closed or the database connection has been lost or when it is used more often than an optional usage limit.
You can use it with PyMySQL like:
import pymysql
from DBUtils.SteadyDB import connect
db = connect(
creator = pymysql, # the rest keyword arguments belong to pymysql
user = 'guest', password = '', database = 'name',
autocommit = True, charset = 'utf8mb4',
cursorclass = pymysql.cursors.DictCursor)
Also see this related question for more examples.
I've implemented quartz.net in windows service to run tasks. And everything works fine on local workstation. But once it's deployed to remote win server host, it just hangs after initialization.
ISchedulerFactory schedFact = new StdSchedulerFactory();
// get a scheduler
var _scheduler = schedFact.GetScheduler();
// Configuration of triggers and jobs
var trigger = (ICronTrigger)TriggerBuilder.Create()
.WithIdentity("trigger1", "group1")
.WithCronSchedule(job.Value)
.Build();
var jobDetail = JobBuilder.Create(Type.GetType(job.Key)).StoreDurably(true)
.WithIdentity("job1", "group1").Build();
var ft = _scheduler.ScheduleJob(jobDetail, trigger);
Everything seems to be standard. I have private static pointer to scheduler, logging process stops right after jobs are initialized and added to scheduler. Nothing else happens after.
I'd appreciate any advices.
Thanks.
PS:
Found some strange events in event viewer mb according quartz.net:
Restart Manager - Starting session 2 - 2012-07-09T15:14:15.729569700Z.
Restart Manager - Ending session 2 started 2012-07-09T15:14:15.729569700Z.
Based on your question and the additional info you gave in comments, I would guess there is something going wrong in the onStart method of your service.
Here are some things you can do to help figure out and solve the problem:
Place the code in your onStart method in a try/catch block, and try to install and start the service. Then check windows logs to see if it was installed correctly, started correctly, etc.
The fact that restart manager is running leads me to believe that your service may be dependent on a process which is already in use. Make sure that any dependencies of your service are closed before installing it.
This problem can also be caused by putting data-intense or long running operations in your onStart method. Make sure that you keep this kind of code out of onStart.
I had a similar problem to this and it was caused by having dots/periods in the assembly name e.g. Project.Update.Service. When I changed it to ProjectUpdateService it worked fine.
Strangely it always worked on the development machine. Just never on the remote machine.
UPDATE: It may have been the length of the service that has caused this issue. By removing the dots I shortened the service name. It looks like the maximum length is 25 characters.