How to view resource allocation of Azure Pipeline agent? - docker

I am running integration tests in Azure Pipelines. I spin up two Docker containers. One container holds my test project and another container has the Postgres database.
When I run the docker compose on my local machine, the tests run successfully and take about 6 minutes.
When I run the same docker containers in the pipeline, the job doesn't finish. The job is canceled because of the 60 min limit.
The job running on agent Hosted Agent ran longer than the maximum time of 60 minutes
I do not see any helpful data in the logs.
What tools/logs can I use to diagnose this issue?
It might have to do with RAM or CPU allocation.
Is there a way to do docker stats to see how many resources are allocated to docker containers?
Also, I have multiple test projects and I'm testing them (in the pipeline) one at a time. There are projects that succeeded with this setup. So this approach works, however, when it fails as described, there isn't a way forward to troubleshoot.
The pipeline:
pool:
vmImage: ubuntu-latest
stages:
- stage: Build
displayName: Docker compose build & up
jobs:
- job: Build
displayName: Build
steps:
- script: |
docker compose build --no-cache
docker compose up --abort-on-container-exit
displayName: 'Docker Compose Build & Up'
The docker compose that pipeline calls:
version: "3.8"
services:
test_service:
container_name: test_service
image: test_service_image
build:
context: .
dockerfile: Dockerfile
environment:
ASPNETCORE_ENVIRONMENT: Staging
WAIT_HOSTS: integration_test_db_server:5432
volumes:
- ./TestResults:/var/temp
depends_on:
- integration_test_db_server
deploy:
resources:
limits:
memory: 4gb
integration_test_db_server:
image: postgres
container_name: db_server
restart: always
ports:
- "2345:5432"
environment:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: db_server
Dockerfile refernced by test_service:
FROM mcr.microsoft.com/dotnet/sdk:6.0
WORKDIR /
COPY . ./
ADD https://github.com/ufoscout/docker-compose-wait/releases/download/2.9.0/wait /wait
#RUN chmod +x /wait
RUN /bin/bash -c 'ls -la /wait; chmod +x /wait; ls -la /wait'
CMD /wait && dotnet test ./src/MyPorject/MyProject.Tests.csproj --logger trx --results-directory /var/temp
UPDATE - Jan 3rd 2023:
I was able to reproduce this on my local machine. Because the MSFT agent is limited to 2 cores, I made that same restriction in the docker-compose file.
This caused a test to run for a very long time (over 8 minutes for one test). At that time, the CPU usage was < 3%.
Running docker stats while the test is running
So restricting the number of CPU cores, causes less CPU usage? I am confused as to what's happening here.

So there was an issue with "thread pool starvation". This didn't happen on my machine because I allocated all 4 CPU cores. However, once I limited the docker container to 2 cores, the problem appeared locally and I was able to figure out the underlying cause.
So lesson learned. Try to reproduce the issue locally. Set container resources close to MSFT agent specs. In this case 2 core CPU, and 7 GB of RAM.
Also, if your tests run for a long time and never finish, you can get more information by using blame-hang-timeout flag. This will set a time limit on a test.
dotnet test <your project> --blame-hang-timeout 2min
After that time limit a "hangdump" file will be generated with information on the errors. That's how I found out about the underlying issue.

Update on 1/4
Microsoft hosted agents have limited performance running the pipelines due to the fixed hardware configurations and network service.
Microsoft-hosted agents that run Windows and Linux images are
provisioned on Azure general purpose virtual machines with a 2 core
CPU, 7 GB of RAM, and 14 GB of SSD disk space.
Agents that run macOS images are provisioned on Mac pros with a 3 core
CPU, 14 GB of RAM, and 14 GB of SSD disk space.
If you pipeline have high performance required job, it's suggested to run your pipeline via self-hosted agent or VMSS.
================================================================
Origin
I suppose that your issue could be related to the Build job timeout setting. You could check it with the screenshot below.
By the way, you could look into this doc about Parallel Job Time Durance Limit for more reference.
===============================================================
first updated
I suppose that the duration of the pipeline could be effected by multiple factors, like network health, data and files transmission speed or agent machine performance. If your task contains large quantity of single-file transmission, you could try to use archive task when uploading to agent workspace and extract task when building or testing the project.

Related

docker-compose wait on other service before build

There are a few approaches to fix container startup order in docker-compose, e.g.
depends_on
docker-compose-wait
Docker Compose wait for container X before starting Y (Asked 7 years, 6 months ago, Modified 7 months ago, Viewed 483k times)
...
However, if one of the services in a docker-compose file includes a build directive, it seems docker-compose will try to build the image first (ignoring depends_on basically - or interpreting depends_on as start dependency, not build dependency).
Is it possible for a build directive to specify that it needs another service to be up, before starting the build process?
Minimal Example:
version: "3.5"
services:
web:
build: # this will run before postgres is up
context: .
dockerfile: Dockerfile.setup # needs postgres to be up
depends_on:
- postgres
...
postgres:
image: postgres:10
...
Notwithstanding the general advice that programs should be written in a way that handles the unavailability of services (at least for some time) gracefully, are there any ways to allow builds to start only when other containers are up?
Some other related questions:
multi-stage build in docker compose?
Update/Solution: Solved the underlying problem by pushing all the (database) setup required to the CMD directive of a bootstrap container:
FROM undertest-base:latest
...
CMD ./wait && ./bootstrap.sh
where wait waits for postgres and bootstrap.sh contains the code for setting up the postgres database with fixtures so the over system becomes fully testable after that script.
With that, setting up an ephemeral test environment with database setup becomes a simple docker-compose up again.
There is no option for this in Compose, and also it won't really work.
The output of an image build is a self-contained immutable image. You can do things like docker push an image to a registry, and Docker's layer cache will avoid rebuilding an image that it's already built. So in this hypothetical setup, if you could access the database in a Dockerfile, but you ran
docker-compose build
docker-compose down -v
docker-compose up -d --build
the down -v step will remove the storage the database uses. While the up --build option will cause the image to be rebuilt, the build sequence will skip all of the steps and produce the same image as originally, and whatever changes you might have made to the database won't have happened.
At a more mechanical layer, the build sequence doesn't use the Compose-provided network, so you also wouldn't be able to connect to the database container.
There are occasional use cases where a dependency in build: would be handy, in particular if you're trying to build a base image that other images in your Compose setup share. But neither the stable Compose file v3 build: block nor the less-widely-supported Compose specification build: supports any notion of an image build depending on anything else.

Updating a docker container from image; leaves old images on server

My process for updating a docker image to production (a docker swarm) is as follows:
On dev environment:
docker-compose build
docker push myrepo/name
Then on the prod server, which is a docker swarm:
docker pull myrepo/name
docker service update --image myrepo/name --with-registry-auth containername
This works perfectly; the swarm is updated with the latest image.
However, it always leaves the old image on the live servers and I'm left with something like this:
docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
myrepo/name latest abcdef 14 minutes ago 1.15GB
myrepo/name <none> bcdefg 4 days ago 1.22GB
myrepo/name <none> cdefgh 6 days ago 1.22GB
Which, over time results in a heap of disk space being unnecessarily used.
I've read that docker system prune is not safe to run on production especially in a swarm.
So, I am having to regularly, manually remove old images e.g.
docker image rm bcdefg cdefgh
Am I missing a step in my update process, or is it 'normal' that old images are left over to be manually removed?
Thanks in advance
since you are using docker swarm and probably multi node setup you could deploy a global service which would do the cleanup for you. We are using Bret Fisher's approach on it:
version: '3.9'
services:
image-prune:
image: internal-image-registry.org/proxy-cache/library/docker:20.10
command: sh -c "while true; do docker image prune -af --filter \"until=4h\"; sleep 14400; done"
networks:
- bridge
volumes:
- /var/run/docker.sock:/var/run/docker.sock
deploy:
mode: global
labels:
- "env=devops"
- "application=cleanup-image-prune"
networks:
bridge:
external: true
name: bridge
When adding new hosts it gets deployed automatically on it with our own base docker image and then does the cleanup job for us.
We are still missing some time to inspect newer docker service types which are scheduled on their own. It would probably be wise to move cleanup jobs to the global service replicated jobs provided by docker instead of an infinite loop in a script. It just works for us so we did not make it high priority enough to swap over to it. More info on the replicated jobs

Docker compose, two services using same image: first fails with "no such image", second runs normally

TL;DR: I have two almost identical services in my compose file except for the name of the service and the published ports. When deploying with docker stack deploy..., why does the first service fail with a no such image error, while the second service using the same image runs perfectly fine?
Full: I have a docker-compose file with two Apache Tomcat services pulling the same image from my private git repository. The only difference between the two services in my docker-compose.yml is the name of the service (*_dev vs. *_prod) and the published ports. I deploy this docker-compose file on my swarm using the Gitlab CI with the gitlab-ci.yml. For the deployment of my docker-compose in this gitlab-ci.yml I use two commands:
...
script:
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose.yml webapp1 --with registry-auth
...
(I use a docker pull [image] command to have the image on the right node, since my --with-registry-auth is not working properly, but this is not my problem currently).
Now the strange thing is that for the first service, I obtain a No such image: error and the service is stopped, while for the second service everything seems to run perfectly fine. Both services are on the same worker node. This is what I get if I docker ps:
:~$ docker service ps webapp1_tomcat_dev
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xxx1 webapp1_tomcat_dev.1 url/repo:tag worker1 node Shutdown Rejected 10 minutes ago "No such image: url/repo:tag#xxx…"
xxx2 \_ webapp1_tomcat_dev.1 url/repo:tag worker1 node Shutdown Rejected 10 minutes ago "No such image: url/repo:tag#xxx…"
:~$ docker service ps webapp1_tomcat_prod
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xxx3 webapp1_tomcat_prod.1 url/repo:tag worker1 node Running Running 13 minutes ago
I have used the --no-trunc obtain to see that the IMAGE used by *_prod and *_dev is identical.
The restart_policy in my docker-compose explains why the first service fails three minutes after the second service started. Here is my docker-compose:
version: '3.2'
services:
tomcat_dev:
image: url/repo:tag
deploy:
restart_policy:
condition: on-failure
delay: 60s
window: 120s
max_attempts: 1
ports:
- "8282:8080"
tomcat_prod:
image: url/repo:tag
deploy:
restart_policy:
condition: on-failure
delay: 60s
window: 120s
max_attempts: 1
ports:
- "8283:8080"
Why does the first service fail with a no such image error? Is it for example just not possible to have two services, that use the same image, work on the same worker node?
(I cannot simply scale-up one service, since I need to upload files to the webapp which are different for production and development - e.g. dev vs prod licenses - and hence I need two distinct services)
EDIT: Second service works because it is created first:
$ docker stack deploy -c docker-compose.yml webapp1 --with-registry-auth
Creating service webapp1_tomcat_prod
Creating service webapp1_tomcat_dev
I found a workaround by separating my services over two different docker compose files (docker-compose-prod.yml and docker-compose-dev.yml) and perform the docker stack deploy command in my gitlab-ci.yml twice:
...
script:
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose-prod.yml webapp1 --with registry-auth
- docker pull $REGISTRY:$TAG
- docker stack deploy -c docker-commpose-dev.yml webapp1 --with registry-auth
...
My gut says my restart_policy in my docker-compose was too strict as well (had a max_attempts: 1) and may be due to this the image couldn't be used in time / within one restart (as suggested by #Ludo21South). Hence I allowed more attempts, but since I already separated the services over two files (which worked already) I have not checked if this hypothesis is true.

Cypress test failing with "out of memory" error in docker

I have cypress tests running without any issues in local.
But when I run them in a docker container, it is failing with "out of memory" error. logs - https://pastebin.com/0TEYnfqq
I saw a suggestion in this issue(cypress-io/cypress#350) to use --ipc=host but the issue keeps occurring.
During the tests are running, I see RAM usage of docker container is around 1.6GB Max, but the VM on which the docker is running has around 6GB free.
I ultimately want to run these tests in AWS Fargate, any idea what is the equivalent of --ipc=host in fargate?
Any help is much appreciated. Thank you.
"Out of memory" error you see because chrome under docker has 64MB restricted memory by default which sometimes is not enough. It has nothing about RAM. And when you run tests locally you dont have this restriction and whats why your tests are running smooth locally.
To increase this restriction run docker with 2 additional params
docker run -it --ipc=host --shm-size=1024M
docker compose
version: "3"
services:
name:
image: image_name
environment:
- SERVER_URL=http://server:8111
- AGENT_NAME=docker-agent-1
- DOCKER_IN_DOCKER=start
privileged: true
container_name: docker_agent_1
ipc: host
shm_size: 1024M
example of docker settings in TeamCity CI
Surely using Cypress configuration numTestsKeptInMemory could help reduce the memory consumption.
The default is "50" kept tests results (include heavy DOM snapshots); reduce it to 5 and see if that helps.

Putting file into HDFS using docker-compose

Is there a way to put some file, let's say data.json, into HDFS automatically right from Docker-compose/Dockerfile?
When I start namenode and datanode I can enter into containers with
docker exec -it namenode [datanode] bash, and use
hdfs dfs -put data.json hdfs:/ (when safe mode is finished)
and that works, but I need a way to run this automatically. When I try to build containers from Dockerfile and put comands:
FROM bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8
WORKDIR /data
ADD hdfs_writer/data.json /data
# ADD python_script.py /data
CMD ["hdfs dfsadmin -safemode wait && hdfs dfs -put ./data.json hdfs:/"]
# CMD ["python python_script.py"]
Container namenode immediately terminates. I also tried with the python script, that I add to container and run it with CMD.
python_script
import time
import os
os.system("hdfs dfsadmin -safemode wait")
os.system("hdfs dfs -put -f data.json hdfs:/")
while True:
time.sleep(5)
in that case, container is running, but if I check logs and try to list hdfs with hdfs dfs -ls hdfs:/, there is following error
safemode: Call From 662aae005e8b/172.20.0.5 to namenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
19/04/18 14:36:36 WARN ipc.Client: Failed to connect to server: namenode/172.20.0.5:8020: try once and fail.
I read recommended link from error log, and to be honest, I am not sure that I understand what should I do.
Any your suggestions or ideas about possible solution is highly valuable for me, as I am new to this field and I don't have much experience.
If you need some more info, I will be happy to provide it.
docker-compose.yml (just part of it)
namenode:
#docker-compose.yml and Dockerfile are in the dame directory
build: .
volumes:
- ./data/namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=cluster
env_file:
- ./hadoop.env
ports:
- 50070:50070
datanode:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.8-java8
depends_on:
- namenode
volumes:
- ./data/datanode:/hadoop/dfs/data
env_file:
- ./hadoop.env
hadoop.env
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_blocksize=1m
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
You can't write to networked services in a Dockerfile. Imagine running docker build, running your combined application, tearing it down, and running it again. You'll reuse the same built image without re-running the Dockerfile steps; only the content in the image itself is kept. In most cases you need some minor amount of setup to communicate between services (Docker Compose can do this for you) but that is not set up during a build sequence. This is the same answer as "you can't run database migrations from a Dockerfile", but it applies equally to Hadoop.
A container only does one thing. Your sample Dockerfile sets a different CMD that waits for the namenode to be running and sets it up. This happens instead of starting the namenode process. A Docker container runs one main command and one main command only; there is not a way to run a main command and also a side support script of some form. The container you show would probably work, but you'd need to run it as a separate container alongside the namenode container.
You don't need to be "in Docker" to access Docker-hosted services. You can use a Docker Compose ports: directive to make services visible to the host, at which point you can use ordinary clients to interact with them. The docker exec path is the equivalent of "I ssh to my server as root, and then...", which isn't how you normally deal with any service at all.
Your server containers should only run servers. In your example you're both trying to launch an HDFS namenode and also populate the server from the same container; you'd be better off having the namenode container only be the namenode and running the setup job from another container or from the host. (See the standard postgres image's entrypoint script for some idea of the gyrations required otherwise.)
Docker Compose isn't great for one-off jobs. Every time you run docker-compose up it will discover that your setup container isn't running and try to start it again. Other more powerful orchestrators could be a better fit; for example, a Kubernetes Job is a reasonable fit for what you're describing.

Resources