pam limits in docker containers aren't working

pam limits in docker containers aren't working - docker

I added something in /etc/security/limits.conf in a docker container to limit the max number of user processes for user1, but when I run bash in the container under the user user1, ulimit -a doesn't reflect that limits defined in the pam limits file (/etc/security/limits.conf).
How can I get this to work?
I've also added the line session required pam_limits.so to /etc/pam.d/common-session, so that's not the problem.
I start the docker container with something like sudo docker run --user=user1 --rm=true <container-name> bash
Also, sudo docker run ... --user=user1 ... cmd doesn't apply the pam limits, but sudo docker run ... --user=root ... su user1 -c 'cmd' does

The /etc/security/limits.conf is just a file that is read by PAM infrastructure on boot. Docker containers are clones of the kernel in pristine state after it just started. This means that none of the inherited initializations of the environment will apply to container. You have to use the 'limit' command directly to set the process limits.
Better way to do that would be to use container limits, unfortunately current version of docker doesn't support limits on number of processes. Looks like the support will be coming in version 1.6 when it comes out, as #thaJeztah has mentioned.

Related

how to set container ulimits in Container-Optimized OS

I need to set ulimits on the container. For example, docker run --ulimit memlock="-1:-1" <image>. However, I'm not sure how to do this when deploying a container-optimised VM on Compute Engine as it handles the startup of the container.
I'm able to deploy a VM with options like --privileged, -e for environment variables, and even an overriding CMD. How can I deploy a VM with ulimits set for the container?

I received an official reply:
Unfortunately the Containers on Compute Engine feature does not currently support setting the ulimit options for containers.
A workaround would be to set ulimit inside the container. For example:
gcloud beta compute instances create-with-container INSTANCE --zone=ZONE --container-image=gcr.io/google-containers/busybox --container-privileged --container-command=sh --container-arg=-c --container-arg=ulimit\ -n\ 100000
Unfortunately this method requires running the container as privileged.
Best regards,...
This reply gave me inspiration to do the following. Create a wrapper script that is referred to from your docker image's ENTRYPOINT. Within this wrapper script, set the ulimit(s) prior to starting the process(es) subjected to the ulimit(s).
As a quick example:
$HOME/example/wrapper.sh
#! /bin/bash
# set memlock to unlimited
ulimit -l unlimited
# start the elasticsearch node
# (found this from the base images dockerfile on github)
/usr/local/bin/docker-entrypoint.sh eswrapper
$HOME/example/Dockerfile
FROM docker.elastic.co/elasticsearch/elasticsearch:6.3.2
COPY wrapper.sh .
RUN chmod 777 wrapper.sh
ENTRYPOINT ./wrapper.sh
local image build
docker image build -t gcr.io/{GCLOUD_PROJECT_ID}/example:0.0.0 $HOME/example
deploy to gcr.io
docker push gcr.io/{GCLOUD_PROJECT_ID}/example:0.0.0
create an instance via gcloud
gcloud beta compute instances create-with-container example-instance-1 \
--zone us-central1-a \
--container-image=gcr.io/{GCLOUD_PROJECT_ID}/example:0.0.0 \
--container-privileged \
--service-account={DEFAULT_COMPUTE_ENGINE_SERVICE_ACC_ID}-compute#developer.gserviceaccount.com \
--metadata=startup-script="echo 'vm.max_map_count=262144' > /etc/sysctl.conf; sysctl -p;"
Note the following. The above startup script is only necessary for running a container of this image. The service account is necessary for pulling from your private google container registry. The --container-privileged argument is imperative as running the container with privileged is required to set ulimits within it.
verifying ulimits are set for your process(es)
On the vm HOST, ps -e and find the PID(s) of the process(es) that were executed within your wrapper script. In this case, find the PID whose command was java. For each PID, cat /proc/{PID}/limits. In this case, I only set memlock to unlimited. You can see that it is indeed set to unlimited.

There doesn't seem to be a document for setting ulimit when creating a Container Optimized OS or in the doc for Configuring Options to Run Container.
Currently, it doesn't seem to be supported having the option of automatically setting ulimit of containers when deploying a container-optimised VM as in the docs here and here. You can submit a feature request for that here under 'Compute'. The document on Configuring Options to Run Container doesn't include that either.
However, you can run containers on a Container-Optimized OS (COS) instance. Thereby, you can run a docker with setting ulimit like here.

I have successfully used the following.
From within the VM or from a start script for the Container Optimized OS:
sudo echo "vm.max_map_count=262144" | tee -a /etc/sysctl.conf
sudo sysctl -p

How can I see which user launched a Docker container?

I can view the list of running containers with docker ps or equivalently docker container ls (added in Docker 1.13). However, it doesn't display the user who launched each Docker container. How can I see which user launched a Docker container? Ideally I would prefer to have the list of running containers along with the user for launched each of them.

You can try this;
docker inspect $(docker ps -q) --format '{{.Config.User}} {{.Name}}'
Edit: Container name added to output

There's no built in way to do this.
You can check the user that the application inside the container is configured to run as by inspecting the container for the .Config.User field, and if it's blank the default is uid 0 (root). But this doesn't tell you who ran the docker command that started the container. User bob with access to docker can run a container as any uid (this is the docker run -u 1234 some-image option to run as uid 1234). Most images that haven't been hardened will default to running as root no matter the user that starts the container.
To understand why, realize that docker is a client/server app, and the server can receive connections in different ways. By default, this server is running as root, and users can submit requests with any configuration. These requests may be over a unix socket, you could sudo to root to connect to that socket, you could expose the API to the network (not recommended), or you may have another layer of tooling on top of docker (e.g. Kubernetes with the docker-shim). The big issue in that list is the difference between the network requests vs a unix socket, because network requests don't tell you who's running on the remote host, and if it did, you'd be trusting that remote client to provide accurate information. And since the API is documented, anyone with a curl command could submit a request claiming to be a different user.
In short, every user with access to the docker API is an anonymized root user on your host.
The closest you can get is to either place something in front of docker that authenticates users and populates something like a label. Or trust users to populate that label and be honest (because there's nothing in docker validating these settings).
$ docker run -l "user=$(id -u)" -d --rm --name test-label busybox tail -f /dev/null
...
$ docker container inspect test-label --format '{{ .Config.Labels.user }}'
1000
Beyond that, if you have a deployed container, sometimes you can infer the user by looking through the configuration and finding volume mappings back to that user's home directory. That gives you a strong likelihood, but again, not a guarantee since any user can set any volume.

I found a solution. It is not perfect, but it works for me.
I start all my containers with an environment variable ($CONTAINER_OWNER in my case) which includes the user. Then, I can list the containers with the environment variable.
Start container with environment variable
docker run -e CONTAINER_OWNER=$(whoami) MY_CONTAINER
Start docker compose with environment variable
echo "CONTAINER_OWNER=$(whoami)" > deployment.env # Create env file
docker-compose --env-file deployment.env up
List containers with the environment variable
for container_id in $(docker container ls -q); do
echo $container_id $(docker exec $container_id bash -c 'echo "$CONTAINER_OWNER"')
done

As far as I know, docker inspect will show only the configuration that
the container started with.
Because of the fact that commands like entrypoint (or any init script) might change the user, those changes will not be reflected on the docker inspect output.
In order to work around this, you can to overwrite the default entrypoint set by the image with --entrypoint="" and specify a command like whoami or id after it.
You asked specifically to see all the containers running and the launched user, so this solution is only partial and gives you the user in case it doesn't appear with the docker inspect command:
docker run --entrypoint "" <image-name> whoami
Maybe somebody will proceed from this point to a full solution (:
Read more about entrypoint "" in here.

If you are used to ps command, running ps on the Docker host and grep with parts of the process your process is running. For example, if you have a Tomcat container running, you may run the following command to get details on which user would have started the container.
ps -u | grep tomcat
This is possible because containers are nothing but processes managed by docker. However, this will only work on single host. Docker provides alternatives to get container details as mentioned in other answer.

this command will print the uid and gid
docker exec <CONTAINER_ID> id

ps -aux | less
Find the process's name (the one running inside the container) in the list (last column) and you will see the user ran it in the first column

Can one docker user hide data from another?

Alice and Bob are both members of the docker group on the same host. Alice wants to run some long-running calculations in a docker container, then copy the results to her home folder. Bob is very nosy, and Alice doesn't want him to be able to read the data that her calculation is using.
Is there anything that the system administrator can do to keep Bob out of Alice's docker containers?
Here's how I think Alice should get data in and out of her container, based on named volumes and the docker cp command, as described in this question and this one.
$ pwd
/home/alice
$ date > input1.txt
$ docker volume create sandbox1
sandbox1
$ docker run --name run1 -v sandbox1:/data alpine echo OK
OK
$ docker cp input1.txt run1:/data/input1.txt
$ docker run --rm -v sandbox1:/data alpine sh -c "cp /data/input1.txt /data/output1.txt && date >> /data/output1.txt"
$ docker cp run1:/data/output1.txt output1.txt
$ cat output1.txt
Thu Oct 5 16:35:30 PDT 2017
Thu Oct 5 23:36:32 UTC 2017
$ docker container rm run1
run1
$ docker volume rm sandbox1
sandbox1
$
I create an input file, input1.txt and a named volume, sandbox1. Then I start a container named run1 just so I can copy files into the named volume. That container just prints an "OK" message and quits. I copy the input file, then run the main calculation. In this example, it copies the input to the output and adds a second timestamp to it.
After the calculation finishes, I copy the output file, then remove the container and the named volume.
Is there any way to stop Bob from loading his own container that mounts the named volume and shows him Alice's data? I've set up Docker to use a user namespace, so Alice and Bob don't have root access to the host, but I can't see how to make Alice and Bob use different user namespaces.

Alice and Bob have been granted virtual root access to the host by being in the docker group.
The docker group grants them access to the Docker API via a socket file. There is no facility in Docker at the moment to differentiate between users of the Docker API. The Docker daemon runs as root and by virtue of what the Docker API allows, Alice and Bob will be able to work around any barriers that you did try to put in place.
User Namespaces
The use of the user namespace isolation stops users inside a container breaking out of a container as a privileged or different user, so in effect the container process is now running as an unprivileged user.
An example would be
Alice is given ssh access to container A running in namespace_a.
Bob is given ssh access to container B in namespace_b.
Because the users are now only inside the container, they won't be able to modify each others files on the host. Say if both containers mapped the same host volume, files without world read/write/execute will be safe from each others containers. As they have no control over the daemon, they can't do anything to break out.
Docker Daemon
The namespace doesn't secure the Docker daemon and API itself, which is still a privileged process. The first way around a user name space is setting the host namespace on the command line:
docker run --privileged --userns=host busybox fdisk -l
The docker exec, docker cp and docker export commands will give someone with access to the Docker API the contents of any created containers.
Restricting Docker Access
It is possible to restrict access to the API but you can't have users with shell access in the docker group.
Allowing a limited set of docker commands via sudo or providing sudo access to scripts that hard code the docker parameters:
#!/bin/sh
docker run --userns=whom image command
For automated systems, access can be provided via an additional shim API with appropriate access controls in front of the Docker API that then passes on the "controlled" request to Docker. dockerode or docker-py can be easily plugged into a REST service and interface with Docker.

Setting absolute limits on CPU for Docker containers

I'm trying to set absolute limits on Docker container CPU usage. The CPU shares concept (docker run -c <shares>) is relative, but I would like to say something like "let this container use at most 20ms of CPU time every 100ms. The closest answer I can find is a hint from the mailing list on using cpu.cfs_quota_us and cpu.cfs_period_us. How does one use these settings when using docker run?
I don't have a strict requirement for either LXC-backed Docker (e.g. pre0.9) or later versions, just need to see an example of these settings being used--any links to relevant documentation or helpful blogs are very welcome too. I am currently using Ubuntu 12.04, and under /sys/fs/cgroup/cpu/docker I see these options:
$ ls /sys/fs/cgroup/cpu/docker
cgroup.clone_children cpu.cfs_quota_us cpu.stat
cgroup.event_control cpu.rt_period_us notify_on_release
cgroup.procs cpu.rt_runtime_us tasks
cpu.cfs_period_us cpu.shares

I believe I've gotten this working. I had to restart my Docker daemon with --exec-driver=lxc as I
could not find a way to pass cgroup arguments to libcontainer. This approach worked for me:
# Run with absolute limit
sudo docker run --lxc-conf="lxc.cgroup.cpu.cfs_quota_us=50000" -it ubuntu bash
The necessary CFS docs on bandwidth limiting are here.
I briefly confirmed with sysbench that this does seem to introduce an absolute limit, as shown below:
$ sudo docker run --lxc-conf="lxc.cgroup.cpu.cfs_quota_us=10000" --lxc-conf="lxc.cgroup.cpu.cfs_period_us=50000" -it ubuntu bash
root#302e651c0686:/# sysbench --test=cpu --num-threads=1 run
<snip>
total time: 90.5450s
$ sudo docker run --lxc-conf="lxc.cgroup.cpu.cfs_quota_us=20000" --lxc-conf="lxc.cgroup.cpu.cfs_period_us=50000" -it ubuntu bash
root#302e651c0686:/# sysbench --test=cpu --num-threads=1 run
<snip>
total time: 45.0423s

How can I set negative niceness of a docker process?

I have a testenvironment for code in a docker image which I use by running bash in the container:
me#host$ docker run -ti myimage bash
Inside the container, I launch a program normally by saying
root#docker# ./myprogram
However, I want the process of myprogram to have a negative niceness (there are valid reasons for this). However:
root#docker# nice -n -7 ./myprogram
nice: cannot set niceness: Permission denied
Given that docker is run by the docker daemon which runs as root and I am root inside the container, why doesn't this work and how can force a negative niceness?
Note: The docker image is running debian/sid and the host is ubuntu/12.04.

Try adding
--privileged=true
to your run command.
[edit] privileged=true is the old method. Looks like
--cap-add=SYS_NICE
Should work as well.

You could also set the CPU priority of the whole container with -c for CPU shares.
Docker docs: http://docs.docker.com/reference/run/#runtime-constraints-on-cpu-and-memory
CGroups/cpu.shares docs: https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart