Memory leak issue : ray + docker + fastapi - docker

When I initialize ray with ray.init() in docker container, the memory usage increases over time(the mem useage in docker stats increases) and container dies when memory over limit (only ray.init() can cause this issue)
Also, too many duplicated processes spawns when ray.init. (RAY:IDLE, ray dashboard, something ray-related processes)
I reproduced this issue with official ray image : https://pypi.org/project/ray/#history
P.S : Our use-case is combination of docker container, fastapi scheduler and ray. (i.e : we initialize ray instance once, and do ray.put, ray.get every pre-defined cycle.)
Let me share my test design pattern to reproduce this issue.
ray.init(num_cpus=4,dashboard_host='0.0.0.0',dashboard_port=8888, configure_logging=False)
app = FastAPI()
#app.on_event("startup")
#repeat_every(seconds=1, raise_exceptions=True)
#app.get("/test")
def test():
dd = []
bb = ray.put(dd)
fut = []
for i in range(10):
fut.append(aa.remote(bb))
ss = ray.get(fut)
#ray.remote
def aa(ss):
a = np.random.rand(380,640)
ss.append(a)
return ss

This doesn't help solve your issues, but your issue could be related https://github.com/tiangolo/fastapi/issues/1624 There appears to be memory leak issue with FastApi and Docker

Related

DisabledBackend: Erratic Behavior with Celery, Redis & Flask

I've been using Celery for a while a now, in production I use RabbitMQ as the broker and Redis for the backend in a K8s cluster with no problems so far. Locally, I run a docker compose with a few services (Flask API, 2 different Workers, Beat, Redis, Flower, Hasura), using Redis as both the Broker and the Backend.
I haven't experienced problems with this setup for the past months, but yesterday I started getting erratic behavior while accessing task results.
Tasks are sent to queue, the worker recognizes it and performs the task, but while querying for the task state I sometimes get DisabledBackend. Normally on the first request, and then it works. Couldn't find a pattern of when it works and when it doesn't, it's erratic.
I've read somewhere that Celery didn't work very well with flask's builtin server so I switched to uWSGI with pretty much the same setup I have in production:
[uwsgi]
wsgi-file = app/uwsgi.py
callable = application
http = :8080
processes = 4
threads = 2
master = true
chmod-socket = 660
vacuum = true
die-on-term = true
buffer-size = 32768
enable-threads = true
req-logger = python:uwsgi
I've seen a similar question in Django in which the problem seemed to be on WSGI Mod with Apache, which is not my case, but the behavior seems similar. Every other question I've seen was related to misconfiguration of the backend, which is not my case.
Any ideas on what might be causing this?
Thanks.
So it seems that I need to access AsyncResult only via my Celery app instance, instead of through Celery, or pass the Celery app instance as an argument.
So, this doesn't work:
from celery.result import AsyncResult
#app.route('/status/<task_id>')
def get_status(task_id):
task = AsyncResult(task_id)
return task.state
This works:
from app import my_celery # Your own Celery Application Instance
#app.route('/status/<task_id>')
def get_status(task_id):
task = my_celery.AsyncResult(task_id)
return task.state
This also works:
from app import my_celery
from celery.result import AsyncResult
#app.route('/status/<task_id>')
def get_status(task_id):
task = AsyncResult(task_id, app=my_celery)
return task.state
I'm guessing what happens is that by calling AsyncResult directly from Celery, it doesn't access Celery's configurations, hence it thinks that there's no backend configured to query results to.
But that would only explain complete failure of the function, and not the erratic behavior. I'm guessing this is because of different threads, and situations in which the app instance is being importante, so Celery finds it, not too sure though.
I've ran a couple of tests and seems to be working fine again after changing the imported AsyncResult, but I'll keep digging.

How can I keep a PBSCluster running?

I have access to a cluster running PBS Pro and would like to keep a PBSCluster instance running on the headnode. My current (obviously broken) script is:
import dask_jobqueue
from paths import get_temp_dir
def main():
temp_dir = get_temp_dir()
scheduler_options = {'scheduler_file': temp_dir / 'scheduler.json'}
cluster = dask_jobqueue.PBSCluster(cores=24, memory='100GB', processes=1, scheduler_options=scheduler_options)
if __name__ == '__main__':
main()
This script is obviously broken because after the cluster is created the main() function exits and the cluster is destroyed.
I imagine I must call some sort of execute_io_loop function, but I can't find anything in the API.
So, how can I keep my PBSCluster alive?
I'm thinking that the section of the Python API (advanced) in the docs might be a good way to try to solve this issue.
Mind you this is an example of how to create Schedulers and Workers, but I'm assuming that the logic could be used in a similar way for your case.
import asyncio
async def create_cluster():
temp_dir = get_temp_dir()
scheduler_options = {'scheduler_file': temp_dir / 'scheduler.json'}
cluster = dask_jobqueue.PBSCluster(cores=24, memory='100GB', processes=1, scheduler_options=scheduler_options)
if __name__ == "__main__":
asyncio.get_event_loop().run_until_complete(create_cluster())
You might have to change the code a bit, but it should keep your create_cluster running until it finished.
Let me know if this works for you.

Dask with tls connection can not end the program with to_parquet method

I am using dask to process 10 files which the size of each file is about 142MB. I build a method with delayed tag, following is an example:
#dask.delayed
def process_one_file(input_file_path, save_path):
res = []
for line in open(input_file_path):
res.append(line)
df = pd.DataFrame(line)
df.to_parquet(save_path+os.path.basename(input_file_path))
if __name__ == '__main__':
client = ClusterClient()
input_dir = ""
save_dir = ""
print("start to process")
cvss = [process_one_file(input_dir+filename, save_dir) for filename in os.listdir(input_dir)]
dask.compute(csvs)
However, dask does not always run successfully. After processing all files, the program often hangs.
I used the command line to run the program. The program often huangs after printing start to process. I know the program runs correctly, since I can see all output files after a while.
But the program never stops. If I disabled tls, the program can run successfully.
It was so strange that dask can not stop the program if I enable tls connection. How can I solve it?
I found that if I add to_parquet method, then the program cannot stop, while if I remove the method, it runs successfully.
I have found the problem. I set 10GB for each process. That means I set memory-limit=10GB. I totally set 2 workers and each has 2 processes. Each process has 2 threads.
Thus, each machine will have 4 processes which occupy 40GB. However, my machine only have 32GB. If I lower the memory limit, then the program will run successfully!

inconsistency between docker stats command and docker rest api memory stats

when looking at a running container with docker stats command, I can see that the memory usage of a container is 202.3MiB.
However, when looking at the same container through the REST API with
GET /containers/container_name/stats -> memory_stats-> usage , the usage there shows 242.10 MiB.
there is a big difference between those values.
What might be the reason for the difference? I know that the docker client uses the REST API to get its stats, but what am I missing here?
Solved my problem. Initially, I did not take into account cache memory when calculating memory usage.
Say "stats" is the returned json from
GET /containers/container_name/stats,
the correct formula is:
memory_usage = stats["memory_stats"]["usage"] - stats["memory_stats"]["stats"]["cache"]
limit = stats["memory_stats"]["limit"]
memory_utilization = memory_usage/limit * 100
Use the rss value i.e (rss = usage - cache)
"memory_stats": {
"stats": {
"cache": 477356032,
"rss": 345579520,
},
"usage": 822935552
}
On Linux, the Docker CLI reports memory usage by subtracting page cache usage from the total memory usage.
The API does not perform such a calculation but rather provides the total memory usage and the amount from the page cache so that clients can use the data as needed. (https://docs.docker.com/engine/reference/commandline/stats/)
The accepted answer is incorrect for recent Docker versions (version greater than 19.03).
The correct way that gets the same number docker stats reports is:
memory = stats['memory_stats']['usage'] - stats["memory_stats"]["stats"]["total_inactive_file"]
memory_limit = stats['memory_stats']['limit']
memory_perc = (memory / memory_limit) * 100
JavaScript code according to the Docker cli source code:
const memStats = stats.memory_stats
const memoryUsage = memStats.stats.total_inactive_file && memStats.stats.total_inactive_file < memStats.usage
? memStats.usage - memStats.stats.total_inactive_file
: memStats.usage - memStats.stats.inactive_file

Image processing in TensorFlow distributed session

I am testing out the Tensorflow Distributed (https://www.tensorflow.org/deploy/distributed) with my local machine (Windows) and Ubuntu VM. Where,
I have followed this link Distributed tensorflow replicated training example: grpc_tensorflow_server - No such file or directory and set up the Tensorflow so called server like as per below.
import tensorflow as tf
parameter_servers = ["10.0.3.15:2222"]
workers = ["10.0.3.15:2222","10.0.3.15:2223"]
cluster = tf.train.ClusterSpec({"local": parameter_servers, "worker": workers})
server = tf.train.Server(cluster, job_name="local", task_index=0)
server.join()
Where “10.0.3.15” – is my Ubuntu local ip address.
In the windows host machine – I am doing some simple image preprocessing using open cv and extending the graph session to the VM. I have used following code for that.
*import tensorflow as tf
from OpenCVTest import *
with tf.Session("grpc:// 10.0.3.15:2222") as sess:
### Open CV calling section ###
img = cv2.imread('Data/ball.jpg')
grey_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
flat_img_array = img.flatten()
x = tf.placeholder(tf.float32, shape=(flat_img_array[0],flat_img_array[1]))
y = tf.multiply(x, x)
sess.run(y)*
I can see that my session is running on my Ubunu machine. Please see below screenshot.
Test_result
[ Note- In the image you would notice, in Windows console I am calling the session and Ubuntu terminal is listening to that same session. ]
But strange thing I have observed is that for the OpenCV preprocessing operation (grey_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)) it’s leveraging local OpenCV package. I was in assumption that when I am running a session on any other server it should do all the operation on that server. In my case as I am running the session on Ubuntu VM, it should run all the operation which has been defined with tf.Session("grpc:// 10.0.3.15:2222") in this should also be running on that ubuntu VM leveraging VM’s local packages, but that’s not happening.
Is my understanding of the sess.run(y) distributed correct ? When we run the session in a distributed manner. Does it only extend the graph computation load to another machine through gRPC ?
I would summarize my ask like this - “I am planning to do large pre-preprocessing before feeding the value to the tensor and I want to do it in a distributed way. What would be the better approach to follow ? My initial understanding was I can with tensorflow distributed but with this test I think I may not able to do it.“
Any thoughts would be of real help.
Thank you.

Resources