Where did Kubernetes Security context runAsUser 1000 come from? - docker

I learnt that to run a container as rootless, you need to specify either the SecurityContext:runAsUser 1000 or specify the USER directive in the DOCKERFILE.
Question on this is that there is no UID 1000 on the Kubernetes/Docker host system itself.
I learnt before that Linux User Namespacing allows a user to have a different UID outside it's original NS.
Hence, how does UID 1000 exist under the hood? Did the original root (UID 0) create a new user namespace which is represented by UID 1000 in the container?
What happens if we specify UID 2000 instead?

Hope this answer helps you
I learnt that to run a container as rootless, you need to specify
either the SecurityContext:runAsUser 1000 or specify the USER
directive in the DOCKERFILE
You are correct except in runAsUser: 1000. you can specify any UID, not only 1000. Remember any UID you want to use (runAsUser: UID), that UID should already be there!
Often, base images will already have a user created and available but leave it up to the development or deployment teams to leverage it. For example, the official Node.js image comes with a user named node at UID 1000 that you can run as, but they do not explicitly set the current user to it in their Dockerfile. We will either need to configure it at runtime with a runAsUser setting or change the current user in the image using a derivative Dockerfile.
runAsUser: 1001 # hardcode user to non-root if not set in Dockerfile
runAsGroup: 1001 # hardcode group to non-root if not set in Dockerfile
runAsNonRoot: true # hardcode to non-root. Redundant to above if Dockerfile is set USER 1000
Remmeber that runAsUser and runAsGroup ensures container processes do not run as the root user but don’t rely on the runAsUser or runAsGroup settings to guarantee this. Be sure to also set runAsNonRoot: true.
Here is full example of securityContext:
# generic pod spec that's usable inside a deployment or other higher level k8s spec
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
# basic container details
- name: my-container-name
# never use reusable tags like latest or stable
image: my-image:tag
# hardcode the listening port if Dockerfile isn't set with EXPOSE
ports:
- containerPort: 8080
protocol: TCP
readinessProbe: # I always recommend using these, even if your app has no listening ports (this affects any rolling update)
httpGet: # Lots of timeout values with defaults, be sure they are ideal for your workload
path: /ready
port: 8080
livenessProbe: # only needed if your app tends to go unresponsive or you don't have a readinessProbe, but this is up for debate
httpGet: # Lots of timeout values with defaults, be sure they are ideal for your workload
path: /alive
port: 8080
resources: # Because if limits = requests then QoS is set to "Guaranteed"
limits:
memory: "500Mi" # If container uses over 500MB it is killed (OOM)
#cpu: "2" # Not normally needed, unless you need to protect other workloads or QoS must be "Guaranteed"
requests:
memory: "500Mi" # Scheduler finds a node where 500MB is available
cpu: "1" # Scheduler finds a node where 1 vCPU is available
# per-container security context
# lock down privileges inside the container
securityContext:
allowPrivilegeEscalation: false # prevent sudo, etc.
privileged: false # prevent acting like host root
terminationGracePeriodSeconds: 600 # default is 30, but you may need more time to gracefully shutdown (HTTP long polling, user uploads, etc)
# per-pod security context
# enable seccomp and force non-root user
securityContext:
seccompProfile:
type: RuntimeDefault # enable seccomp and the runtimes default profile
runAsUser: 1001 # hardcode user to non-root if not set in Dockerfile
runAsGroup: 1001 # hardcode group to non-root if not set in Dockerfile
runAsNonRoot: true # hardcode to non-root. Redundant to above if Dockerfile is set USER 1000
sources:
Kubernetes Pod Specification Good Defaults
Configure a Security Context for a Pod or Container
10 Kubernetes Security Context settings you should understand

Something at the container layer calls the setuid(2) system call with that numeric user ID. There's no particular requirement to "create" a user; if you are able to call setuid() at all, you can call it with any numeric uid you want.
You can demonstrate this with plain Docker pretty easily. The docker run -u option takes any numeric uid, and you can docker run -u 2000 and your container will (probably) still run. It's common enough to docker run -u $(id -u) to run a container with the same numeric user ID as the host user even though that uid doesn't exist in the container's /etc/passwd file.
At a Kubernetes layer this is a little less common. A container can't usefully access host files in a clustered environment (...on which host?) so there's no need to have a user ID matching the host's. If the image already sets up a non-root user ID, you should be able to just use it as-is without setting it at the Kubernetes layer.

Related

How to grant non root user wrote permissions on kubernetes SBM flexvolume mount?

Please help! I'm struggling with this for a few days now...
I'm trying to write to a mount in a Kubernetes pod with a non-root user and getting access denied.
In the Kubernetes manifest, I am mounting a windows shared folder like this:
kind: Deployment
apiVersion: apps/v1
metadata:
name: centos-deployment
spec:
template:
spec:
volumes:
- name: windows-mount
flexVolume:
driver: microsoft.com/smb
secretRef:
name: centos-credentials
options:
mountOptions: 'cifs,dir_mode=0777,file_mode=0777'
source: //100.200.300.400/windows-share
containers:
- name: centos-pod
image: 'centos:latest'
command:
- sh
- '-c'
- sleep 1000000
volumeMounts:
- name: windows-mount
mountPath: /var/windows-share
and in the Dockerfile I'm changing to application user like so:
# Drop from 'root' user to 'nobody' (user with no privileges).
USER nobody:nobody
But now, the mount is owned by "root". The "root" user can write to the path but the user "nobody" cannot.
I tried init container to run chmod -R 775 on the folder, but it looks like the root user cannot change the permissions or ownership of the mount. (umask command returned 022)
If I exec into the pod, I can see the mount is set with 755 permissions instead of 777
"file_mode=0755,dir_mode=0755"
[root#centos-deployment-5d46bd8b89-tzghs /]# mount | grep windows-share
//100.200.300.400/windows-share on /var/windows-share type cifs (rw,relatime,vers=default,cache=strict,username=*******,domain=,uid=0,noforceuid,gid=0,noforcegid,addr=100.200.300.400,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=1048576,wsize=1048576,echo_interval=60,actimeo=1)
Any idea how to mount a Windows share so that it is writable by non-root user?
Thanks! any help will be very appreciated
Full reference: https://linux.die.net/man/8/mount.cifs
Try playing with the mountOptions. For example:
uid=arg - sets the uid that will own all files or directories on the mounted filesystem when the server does not provide ownership information. It may be specified as either a username or a numeric uid. When not specified, the default is uid 0. The mount.cifs helper must be at version 1.10 or higher to support specifying the uid in non-numeric form. See the section on FILE AND DIRECTORY OWNERSHIP AND PERMISSIONS below for more information.
volumes:
- name: windows-mount
...
options:
mountOptions: 'cifs,uid=<YOUR_USERNAME>,dir_mode=0777,file_mode=0777'
If this doesen't work you can also try adding noperm mount option.
noperm - Client does not do permission checks. This can expose files on this mount to access by other users on the local client system. It is typically only needed when the server supports the CIFS Unix Extensions but the UIDs/GIDs on the client and server system do not match closely enough to allow access by the user doing the mount. Note that this does not affect the normal ACL check on the target machine done by the server software (of the server ACL against the user name provided at mount time).
volumes:
- name: windows-mount
...
options:
mountOptions: 'cifs,dir_mode=0777,file_mode=0777,noperm'
About using chmod/chown on CIFS
The core CIFS protocol does not provide unix ownership information or mode for files and directories. Because of this, files and directories will generally appear to be owned by whatever values the uid= or gid= options are set, and will have permissions set to the default file_mode and dir_mode for the mount. Attempting to change these values via chmod/chown will return success but have no effect.

Kubernetes - Too many open files

I'm trying to evaluate the performance of one of my go server running inside the pod. However, receiving an error saying too many open files. Is there any way to set the ulimit in kubernetes?
ubuntu#ip-10-0-1-217:~/ppu$ kubectl exec -it go-ppu-7b4b679bf5-44rf7 -- /bin/sh -c 'ulimit -a'
core file size (blocks) (-c) unlimited
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 15473
max locked memory (kb) (-l) 64
max memory size (kb) (-m) unlimited
open files (-n) 1048576
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) unlimited
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
Deployment file.
---
apiVersion: apps/v1
kind: Deployment # Type of Kubernetes resource
metadata:
name: go-ppu # Name of the Kubernetes resource
spec:
replicas: 1 # Number of pods to run at any given time
selector:
matchLabels:
app: go-ppu # This deployment applies to any Pods matching the specified label
template: # This deployment will create a set of pods using the configurations in this template
metadata:
labels: # The labels that will be applied to all of the pods in this deployment
app: go-ppu
spec: # Spec for the container which will run in the Pod
containers:
- name: go-ppu
image: ppu_test:latest
imagePullPolicy: Never
ports:
- containerPort: 8081 # Should match the port number that the Go application listens on
livenessProbe: # To check t$(minikube docker-env)he health of the Pod
httpGet:
path: /health
port: 8081
scheme: HTTP
initialDelaySeconds: 35
periodSeconds: 30
timeoutSeconds: 20
readinessProbe: # To check if the Pod is ready to serve traffic or not
httpGet:
path: /readiness
port: 8081
scheme: HTTP
initialDelaySeconds: 35
timeoutSeconds: 20
Pods info:
ubuntu#ip-10-0-1-217:~/ppu$ kubectl get pods
NAME READY STATUS RESTARTS AGE
go-ppu-7b4b679bf5-44rf7 1/1 Running 0 18h
ubuntu#ip-10-0-1-217:~/ppu$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 19h
ppu-service LoadBalancer 100.64.171.12 74d35bb2a5f30ca13877-1351038893.us-east-1.elb.amazonaws.com 8081:32623/TCP 18h
When I used locust to test the performance of the server receiving the following error.
# fails Method Name Type
3472 POST /supplyInkHistory ConnectionError(MaxRetryError("HTTPConnectionPool(host='74d35bb2a5f30ca13877-1351038893.us-east-1.elb.amazonaws.com', port=8081): Max retries exceeded with url: /supplyInkHistory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x....>: Failed to establish a new connection: [Errno 24] Too many open files',))",),)
May you have a look at https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
You but you need enable few features to make it work.
securityContext:
sysctls:
- name: fs.file-max
value: "YOUR VALUE HERE"
There was a few cases regarding setting --ulimit argument, you can find them here or check this article. This resource limit can be set by Docker during the container startup. As you add tag google-kubernetes-engine answer will be related to GKE environment, however on other cloud it could work similar.
If you would like to set unlimit for open files you can modify configuration file /etc/security/limits.conf. However, please not it will not persist across reboots.
Second option would be edit /etc/init/docker.conf and restart docker service. As default it have a few limits like nofile or nproc, you can add it here.
Another option could be to use instance template. Instance template would include a start-up script that set the required limit.
After that, you would need to use this new instance template for the instance group in the GKE. More information here and here.

scdf 2.1 k8s config security context non root no fs writable

I need config scdf2 skipper , scdf and app pods to run without root and no write into filesystem pod .
i made changes into config yamls
data:
application.yaml: |-
spring:
cloud:
skipper:
server:
platform:
kubernetes:
accounts:
default:
namespace: default
deploymentServiceAccountName: scdf2-server-data-flow
securityContext:
runAsUser: 2000
allowPrivilegeEscalation: false
limits:
Colla
And scdf start runs with user "2000", (there is a problem with writeable local maven repo, fixed with a pvc nfs)...
But, the app pods always starts as root user, no 2000 users.
I've change skipper-config with securitycontext, .. any clues?
TX
What you set as deploymentServiceAccountName is one of the Kubernetes deployer properties that can be used for deploying streaming applications or launching task applications.
Looks like the above configuration is not applied to your SCDF or Skipper server configuration properties as they should at least get applied when deploying applications.
For the SCDF server and Skipper servers, in your SCDF/Skipper server deployment configurations, you need to explicitly set your serviceAccountName (not as deploymentServiceAccountName as its name suggests, the deploymentServiceAccountName is internally converted into the actual serviceAccountName for the respective stream/task apps when they get deployed).
We got it. We use it into skipp/scdf deploy, not in pods deploymente.
Your request:
Into scdf / skipper cfg deployment got:
spec:
containers:
- name: {{ template "scdf.fullname" . }}-server
image: {{ .Values.server.image }}:{{ .Values.server.version }}
imagePullPolicy: {{ .Values.server.imagePullPolicy }}
volumeMounts:
...
serviceAccountName: {{ template "scdf.serviceAccountName" . }}
Do you tell me to change config map scdf/skipper to task and streams? Another property into or before config about deployment
How is it relation about "serviceaccount" and user running process into pod?
How related serviceaccount with running process user "2000"
I cant understand.
Please help, it is very important to running without root and no use local filesystem from pod excepts "tmp" files

How to mount HostPath Volume in Kubernetes with SELinux

I am trying to mount a hostPath volume into a Kubernetes Pod. An example of a hostPath volume specification is shown below, which is taken from the docs. I am deploying to hosts that are running RHEL 7 with SELinux enabled.
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
# this field is optional
type: Directory
When my Pod tries to read from a file that has been mounted from the underlying host, I get a "Permission Denied" error. When I run setenforce 0 to turn off SELinux, the error goes away and I can access the file. I get the same error when I bind mount a directory into a Docker container.
The issue is described here and, when using Docker, can be fixed by using the z or Z bind mount flag, described in the Docker docs here.
Whilst I can fix the issue by running
chcon -Rt svirt_sandbox_file_t /path/to/my/host/dir/to/mount
I see this as a nasty hack, as I need to do this on every host in my Kubernetes cluster and also because my deployment of Kubernetes as described in the YAML spec is not a complete description of what it is that needs to be done to get my YAML to run correctly. Turning off SELinux is not an option.
I can see that Kubernetes mentions SELinux security contexts in the docs here, but I haven't been able to successfully mount a hostPath volume into a pod without getting the permission denied error.
What does the YAML need to look like to successfully enable a container to mount a HostPath volume from an underlying host that is running SELinux?
Update:
The file I am accessing is a CA certificate that has these labels:
system_u:object_r:cert_t:s0
When I use the following options:
securityContext:
seLinuxOptions:
level: "s0:c123,c456"
and then check the access control audit errors via ausearch -m avc -ts recent, I can see that there is a permission denied error where the container has a level label of s0:c123,c456, so I can see that the level label works. I have set the label to be s0.
However, if I try to change the type label to be cert_t, the container doesn't even start, there's an error :
container_linux.go:247: starting container process caused "process_linux.go:364: container init caused \"write /proc/self/task/1/attr/exec: invalid argument\""
I don't seem to be able to change the type label of the container.
Expanding on the answer from VAS as it is partially correct:
You can only specify the level portion of an SELinux label when relabeling a path destination pointed to by a hostPath volume. This is automatically done so by the seLinuxOptions.level attribute specified in your securityContext.
However attributes such as seLinuxOptions.type currently have no effect on volume relabeling. As of this writing, this is still an open issue within Kubernetes
You can assign SELinux labels using seLinuxOptions:
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: k8s.gcr.io/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
securityContext:
seLinuxOptions: # it may don’t have the desired effect
level: "s0:c123,c456"
securityContext:
seLinuxOptions:
level: "s0:c123,c456"
volumes:
- name: test-volume
hostPath:
# directory location on host
path: /data
# this field is optional
type: Directory
According to documentation:
Thanks to Phil for pointing that out. It appears to be working only in Pod.spec.securityContext according to the issue comment
seLinuxOptions: Volumes that support SELinux labeling are relabeled to be accessible by the label specified under seLinuxOptions. Usually you only need to set the level section. This sets the Multi-Category Security (MCS) label given to all Containers in the Pod as well as the Volumes.
You could try with full permissions:
...
image: k8s.gcr.io/test-webserver
securityContext:
privileged: true
...
Using selinux can solve this problem. Reference article:
https://zhimin-wen.medium.com/selinux-policy-for-openshift-containers-40baa1c86aa5
In addition: You can refer to the selinux parameters to set the addition, deletion, and modification of the mount directory
https://selinuxproject.org/page /ObjectClassesPerms
my setting:
If a directory's selinux is unconfined_u:object_r:kubernetes_file_t:s0 , you kan defind a selinux policy is:
module myapp 1.0;
require {
type kubernetes_file_t;
type container_t;
class file { create open read unlink write getattr execute setattr link };
class dir { add_name create read remove_name write };
}
#============= container_t ==============
#!!!! This avc is allowed in the current policy
allow container_t kubernetes_file_t:dir { add_name create read remove_name write };
#!!!! This avc is allowed in the current policy
allow container_t kubernetes_file_t:file { create open read unlink write getattr execute setattr link };
run command on node:
sudo checkmodule -M -m -o myapp.mod myapp.te
sudo semodule_package -o myapp.pp -m myapp.mod
sudo semodule -i myapp.pp

configure docker variables with ansible

I have a docker image for an FTP server in a repository, this image will be used for several machines, I need to deploy container and change PORT variable depending on the destination machine.
This is my image (I've deleted lines for proftpd installation cause it is not relevant to this case):
FROM alpine:3.5
ARG vcs_ref="Unknown"
ARG build_date="Unknown"
ARG build_number="1"
LABEL org.label-schema.vcs-ref=$vcs_ref \
org.label-schema.build-date=$build_date \
org.label-schema.version="alpine-r${build_number}"
ENV PORT=10000
COPY assets/port.conf /usr/local/etc/ports.conf
COPY replace.sh /replace.sh
#It is for a proFTPD Server
CMD ["/replace.sh"]
My port.conf file (Also deleted not relevant information for this case)
# This is a basic ProFTPD configuration file (rename it to
# 'proftpd.conf' for actual use. It establishes a single server
# and a single anonymous login. It assumes that you have a user/group
# "nobody" and "ftp" for normal operation and anon.
ServerName "ProFTPD Default Installation"
ServerType standalone
DefaultServer on
# Port 21 is the standard FTP port.
Port {{PORT}}
.
.
.
And replace.sh script is:
#!/bin/bash
set -e
[ -z "${PORT}" ] && echo >&2 "PORT is not set" && exit 1
sed -i "s#{{PORT}}#$PORT#g" /usr/local/etc/ports.conf
/usr/local/sbin/proftpd -n -c /usr/local/etc/proftpd.conf
... Is there any way to avoid using replace.sh and use ansible as the one who replace PORT variable in /usr/local/etc/proftpd.conf the file inside the container?
My actual ansible script for container is:
- name: (ftpd) Run container
docker_container:
name: "myimagename"
image: "myimage"
state: present
pull: true
restart_policy: always
env:
"PORT": "{{ myportUsingAnsible}}"
networks:
- name: "{{ network }}"
Resuming all that I need is to use Ansible to replace configuration variable instead of using a shell script that replaces variables before running services, is it possible?
Many thanks
You are using the docker_container module which will need a pre-built image. The file port.conf is baked inside the image. What you need to do is set a static port inside this file. Inside the container, you always use
the static port 21 and depending on the machine, you map this port onto a different port using ansible.
Inside port.conf always use port 21
# Port 21 is the standard FTP port.
Port 21
The ansible task would look like:
- name: (ftpd) Run container
docker_container:
name: "myimagename"
image: "myimage"
state: present
pull: true
restart_policy: always
networks:
- name: "{{ network }}"
ports:
- "{{myportUsingAnsible}}:21"
Now when you connect to the container, you need to use the <hostnamne>:{{myportUsingAnsible}}. This is the standard docker way of doing things. The port inside the image is static and you change the port mappings based on the
available ports that you have.

Resources