Mlflow UI can't show artifacts - docker

I have mlflow running on an azure VM and connected to Azure Blob as the artifact storage.
After uploading artifacts to the storage from the Client.
I tried the MLflow UI and successfully was able to show the uploaded file.
The problem happens when I try to run MLFLOW with Docker, I get the error:
Unable to list artifacts stored under {artifactUri} for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory
Dockerfile:
FROM python:3.7-slim-buster
# Install python packages
RUN pip install mlflow pymysql
RUN pip install azure-storage-blob
ENV AZURE_STORAGE_ACCESS_KEY="#########"
ENV AZURE_STORAGE_CONNECTION_STRING="#######"
docker-compose.yml
web:
restart: always
build: ./mlflow_server
image: mlflow_server
container_name: mlflow_server
expose:
- "5000"
networks:
- frontend
- backend
environment:
- AZURE_STORAGE_ACCESS_KEY="#####"
- AZURE_STORAGE_CONNECTION_STRING="#####"
command: mlflow server --backend-store-uri mysql+pymysql://mlflow_user:123456#db:3306/mlflow --default-artifact-root wasbs://etc..
I tried multiple solutions:
Making sure that boto3 is installed (Didn't do anything)
Adding Environment Variables in the Dockerfile so the command runs after they're set
I double checked the url of the storage blob
And MLFLOW doesn't show any logs it just kills the process and restarts again.
Anyone got any idea what might be the solution or how can i access the logs
here're the docker logs of the container:
[2022-07-28 12:23:33 +0000] [10] [INFO] Starting gunicorn 20.1.0
[2022-07-28 12:23:33 +0000] [10] [INFO] Listening at: http://0.0.0.0:5000 (10)
[2022-07-28 12:23:33 +0000] [10] [INFO] Using worker: sync
[2022-07-28 12:23:33 +0000] [13] [INFO] Booting worker with pid: 13
[2022-07-28 12:23:33 +0000] [14] [INFO] Booting worker with pid: 14
[2022-07-28 12:23:33 +0000] [15] [INFO] Booting worker with pid: 15
[2022-07-28 12:23:33 +0000] [16] [INFO] Booting worker with pid: 16
[2022-07-28 12:24:24 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:14)
[2022-07-28 12:24:24 +0000] [14] [INFO] Worker exiting (pid: 14)
[2022-07-28 12:24:24 +0000] [21] [INFO] Booting worker with pid: 21

Related

Cannot run Airflow on Docker using MacOS - Even with more memory allocated

I am trying to run Airflow on Docker (using a MacOS, Version 12.6.2, with 8 GB of Memory). I have downloaded the docker-compose.yaml file here (apache/airflow:2.4.2), and have set my .env file to this:
AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2
AIRFLOW_UID=50000
When I run docker-compose up -d, and wait, the webserver containers never become healthy:
As suggested by numerous people on MacOS, I have upped my Docker memory:
I have tried numerous combinations of Docker memory (all 8GB of Memory, 7 GB of Memory, 6 GB of Memory, 5 and 4 GB of Memory) as well as testing different combinations of CPU, Swap, and Virtual Disk Limit (I have not tried going higher than 160 GB in the Virtual Disk Limit). I have also seen that it is a bad idea to use all 4 CPUs, so I have not tried that.
Here is the log I get for the webserver container:
2023-01-12 03:26:30 [2023-01-12 11:26:30 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:215)
2023-01-12 03:26:31 [2023-01-12 11:26:31 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:216)
2023-01-12 03:26:33 [2023-01-12 11:26:32 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:217)
2023-01-12 03:26:33 [2023-01-12 11:26:33 +0000] [79] [WARNING] Worker with pid 215 was terminated due to signal 9
2023-01-12 03:26:34 [2023-01-12 11:26:34 +0000] [262] [INFO] Booting worker with pid: 262
2023-01-12 03:26:36 [2023-01-12 11:26:36 +0000] [79] [WARNING] Worker with pid 217 was terminated due to signal 9
2023-01-12 03:26:36 [2023-01-12 11:26:36 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:219)
2023-01-12 03:26:36 [2023-01-12 11:26:36 +0000] [263] [INFO] Booting worker with pid: 263
2023-01-12 03:26:37 [2023-01-12 11:26:37 +0000] [79] [WARNING] Worker with pid 216 was terminated due to signal 9
2023-01-12 03:26:38 [2023-01-12 11:26:38 +0000] [265] [INFO] Booting worker with pid: 265
2023-01-12 03:26:39 [2023-01-12 11:26:39 +0000] [79] [WARNING] Worker with pid 219 was terminated due to signal 9
2023-01-12 03:26:40 [2023-01-12 11:26:40 +0000] [266] [INFO] Booting worker with pid: 266
2023-01-12 03:28:34 [2023-01-12 11:28:33 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:262)
2023-01-12 03:28:36 [2023-01-12 11:28:36 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:263)
2023-01-12 03:28:38 [2023-01-12 11:28:38 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:265)
2023-01-12 03:28:39 [2023-01-12 11:28:39 +0000] [79] [CRITICAL] WORKER TIMEOUT (pid:266)
...And the "worker timeout-booting worker-worker timeout" cycle continues forever.
Now, if I comment out (remove) the redis, airflow-workflow, and airflow-triggerrer parts of the compose file as suggested by this article under the "Airflow Installation -- Lite Version" section of the article, everything runs fine and everything is healthy. But, I know that I'm going to need those containers in the future.
If I've maxed out my MacOS resources, what do you suggest I do?
(NOTE: This question on Stack Overflow mentions improving Docker Memory when running on desktop as the solution. However, as you can see by my screenshot and text above, I have already tried that and it did not work.)

Changing port of Gunicorn Server with FastAPI in Docker context

I use gunicorn to start my Webserver in Docker like this:
CMD ["gunicorn", "--bind", "0.0.0.0:8000", "--reload", "-k", "uvicorn.workers.UvicornWorker", "app.main:app"]
I have a CLI Tool with should be able to overrule the port set in the Dockerfile:
# runtime.command()
def run(branch: Optional[str] = "master", port: Optional[int] = 8000):
subprocess.call(f"docker run -it -p 7000:7000 test", shell=True)
When I run this command, the Service is still running on port 8000, not on port 7000.
[2021-11-25 15:40:20 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-11-25 15:40:20 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2021-11-25 15:40:20 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2021-11-25 15:40:20 +0000] [11] [INFO] Booting worker with pid: 11
[2021-11-25 15:40:21 +0000] [11] [INFO] Started server process [11]
[2021-11-25 15:40:21 +0000] [11] [INFO] Waiting for application startup.
Is there a way to override the port which was set in my Dockerfile?
If you want to change the behavior of the docker container, and you want to do so without creating a new image (which could be achieved with a docker commit workflow), then you must overwrite the container entrypoint and start it as you wish:
$ docker run --rm -it --entrypoint=gunicorn your_image_name --bind 0.0.0.0:7000 --reload -k uvicorn.workers.UvicornWorker app.main:app
substitute your_image_name with the name of your image.

Running Gunicorn Flask app in Docker [CRITICAL] WORKER TIMEOUT when starting up

I want to run a Flask web services app with gunicorn in Docker. Upon startup, the app loads a large machine learning model.
However, when I run gunicorn within Docker I received the following timeouts and it just keeps spawning workers.
[2019-12-12 21:52:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1198)
[2019-12-12 21:52:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1204)
[2019-12-12 21:52:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1210)
[2019-12-12 21:52:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1211)
[2019-12-12 21:52:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1222)
[2019-12-12 21:52:42 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1223)
[2019-12-12 21:52:42 +0000] [1264] [INFO] Booting worker with pid: 1264
[2019-12-12 21:52:42 +0000] [1265] [INFO] Booting worker with pid: 1265
[2019-12-12 21:52:42 +0000] [1276] [INFO] Booting worker with pid: 1276
[2019-12-12 21:52:42 +0000] [1277] [INFO] Booting worker with pid: 1277
[2019-12-12 21:52:42 +0000] [1278] [INFO] Booting worker with pid: 1278
[2019-12-12 21:52:42 +0000] [1289] [INFO] Booting worker with pid: 1289
Running it as a flask app within Docker or running the flask app with (or without) gunicorn from the command line works fine. It also works with gunicorn if I remove the machine learning model.
For example:
$python app.py
$gunicorn -b 0.0.0.0:8080 --workers=2 --threads=4 app:app
$gunicorn app:app
Here is my Dockerfile with the Flask development server. Works fine.
ADD . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD python app.py
If I run gunicorn as follows it just keeps spawning workers:
CMD gunicorn -b 0.0.0.0:8080 --workers=2 --threads=4 app:app
or
CMD ["gunicorn", "app:app"]
gunicorn has a --timeout=30 parameter. Defaults to 30 seconds which I increased to 300. This did not appear to have an affect.
Note: I rewrote the app for the Starlette library and received the same results!
Any guidance is appreciated.
Thanks,
Jay
I needed to add the gunicorn --timeout as follows:
CMD gunicorn --timeout 1000 --workers 1 --threads 4 --log-level debug --bind 0.0.0.0:8000 app:app
I also ran into problems deploy on Google Cloud Platform. The log only showed a kill message. Increasing the memory in the compute instance solved that problem.
try this
CMD["gunicorn", "--timeout", "1000", "--workers=1","-b", "0.0.0.0:8000","--log-level", "debug", "manage"]

Gunicorn continually booting workers when run in a Docker image on Kubernetes

I've dockerized a Flask app, using gunicorn to serve it. The last line of my Dockerfile is:
CMD source activate my_env && gunicorn --timeout 333 --bind 0.0.0.0:5000 app:app
When running the app locally – either straight in my console, without docker, or with
docker run -dit \
--name my-app \
--publish 5000:5000 \
my-app:latest
It boots up fine. I get a log like:
[2018-12-04 19:32:30 +0000] [8] [INFO] Starting gunicorn 19.7.1
[2018-12-04 19:32:30 +0000] [8] [INFO] Listening at: http://0.0.0.0:5000 (8)
[2018-12-04 19:32:30 +0000] [8] [INFO] Using worker: sync
[2018-12-04 19:32:30 +0000] [16] [INFO] Booting worker with pid: 16
<my app's output>
When running the same image in k8s I get
[2018-12-10 21:09:42 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2018-12-10 21:09:42 +0000] [5] [INFO] Listening at: http://0.0.0.0:5000 (5)
[2018-12-10 21:09:42 +0000] [5] [INFO] Using worker: sync
[2018-12-10 21:09:42 +0000] [13] [INFO] Booting worker with pid: 13
[2018-12-10 21:10:52 +0000] [16] [INFO] Booting worker with pid: 16
[2018-12-10 21:10:53 +0000] [19] [INFO] Booting worker with pid: 19
[2018-12-10 21:14:40 +0000] [22] [INFO] Booting worker with pid: 22
[2018-12-10 21:16:14 +0000] [25] [INFO] Booting worker with pid: 25
[2018-12-10 21:16:25 +0000] [28] [INFO] Booting worker with pid: 28
<etc>
My k8s deployment yaml looks like
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-deployment
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
imagePullSecrets:
- name: regcred
containers:
- name: my-frontend
image: my-registry/my-frontend:latest
ports:
- containerPort: 80
- name: my-backend
image: my-registry/my-backend:latest
ports:
- containerPort: 5000
Here, the container in question is my-backend. Any ideas why this is happening?
Update: As I wrote this, the events list that is printed with kubectl describe pods was updated with the following:
Warning FailedMount 9m55s kubelet, minikube MountVolume.SetUp failed for volume "default-token-k2shm" : Get https://localhost:8443/api/v1/namespaces/default/secrets/default-token-k2shm: net/http: TLS handshake timeout
Warning FailedMount 9m53s (x2 over 9m54s) kubelet, minikube MountVolume.SetUp failed for volume "default-token-k2shm" : secrets "default-token-k2shm" is forbidden: User "system:node:minikube" cannot get secrets in the namespace "default": no path found to object
Normal SuccessfulMountVolume 9m50s kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-k2shm"
Not sure if it's relevant to my issue
I solved this by adding resources under the container - mine needed more memory.
resources:
requests:
memory: "512Mi"
cpu: 0.1
limits:
memory: "1024Mi"
cpu: 1.0
Hope that helps.

why won't the gunicorn worker boot when running in Docker?

UPDATED SHORT VERSION: if I have a docker image that fails to run (but leaves very little detail as to why), is there a way to connect to the container running the image to debug it?
(Many thanks to those who pointed out the issue with multiple CMD entries in the Dockerfile. I'm still trying to wrap my head around that)
SHORT VERSION: I can't bring up my web service with "docker run" because "Worker failed to boot." How can I find out what is failing?
LONG VERSION:
I can run my Django project with gunicorn in the foreground:
(trans3) chris#chi:~/website$ gunicorn --bind 0.0.0.0:8000 hero.wsgi
[2018-07-02 15:18:36 +0000] [21541] [INFO] Starting gunicorn 19.7.1
[2018-07-02 15:18:36 +0000] [21541] [INFO] Listening at: http://0.0.0.0:8000 (21541)
[2018-07-02 15:18:36 +0000] [21541] [INFO] Using worker: sync
[2018-07-02 15:18:36 +0000] [21546] [INFO] Booting worker with pid: 21546
^C
[2018-07-02 15:18:39 +0000] [21541] [INFO] Handling signal: int
[2018-07-02 15:18:39 +0000] [21541] [INFO] Shutting down: Master
I have a small Dockerfile.web for my service:
(trans3) chris#chi:~/website$ cat Dockerfile.web
# start with our Django app image
FROM dockersite:latest
# collectstatic.
CMD python manage.py collectstatic --noinput
# run the server
CMD gunicorn --bind 0.0.0.0:$PORT hero.wsgi
I build my image
(trans3) chris#chi:~/website$ docker build -t dockersite/web -f Dockerfile.web .
Sending build context to Docker daemon 137.5MB
Step 1/3 : FROM dockersite:latest
---> 56b1488f8e27
Step 2/3 : CMD python manage.py collectstatic --noinput
---> Using cache
---> 59585027568d
Step 3/3 : CMD gunicorn --bind 0.0.0.0:$PORT hero.wsgi
---> Using cache
---> c17429800329
Successfully built c17429800329
Successfully tagged dockersite/web:latest
I try to run my image:
(trans3) chris#chi:~/website$ docker run -e PORT=8000 -p 8000:8000 --env-file=.env dockersite/web:latest
[2018-07-02 19:23:26 +0000] [8] [INFO] Starting gunicorn 19.7.1
[2018-07-02 19:23:26 +0000] [8] [INFO] Listening at: http://0.0.0.0:8000 (8)
[2018-07-02 19:23:26 +0000] [8] [INFO] Using worker: sync
[2018-07-02 19:23:26 +0000] [12] [INFO] Booting worker with pid: 12
MEMCACHEDCLOUD_SERVERS not found, using LocMem cache
[2018-07-02 19:23:27 +0000] [8] [INFO] Shutting down: Master
[2018-07-02 19:23:27 +0000] [8] [INFO] Reason: Worker failed to boot.
How can I get more information out of Gunicorn to tell me why it's failing? Is there a logger in Python that I can adjust? (I tried --spew and got lots of information, none of it useful.)
Sometimes docker & Github & gunicorn together fails, so try switching branches and switching back to see if it makes a difference.
You may have a python error that the logs for which disappear and gunicorn refuses to show you your python errors.
Use this in your launch for gunicorn in your docker file:
CMD ["gunicorn", "--preload"]

Resources