What does lxc-container-default-with-nesting AppArmor profile do? - lxc

I'm using nested LXC with lxc-container-default-with-nesting profile which looks like the following.
profile lxc-container-default-with-nesting flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/lxc/container-base>
#include <abstractions/lxc/start-container>
# Uncomment the line below if you are not using cgmanager
# mount fstype=cgroup -> /sys/fs/cgroup/**,
deny /dev/.lxc/proc/** rw,
deny /dev/.lxc/sys/** rw,
mount fstype=proc -> /var/cache/lxc/**,
mount fstype=sysfs -> /var/cache/lxc/**,
mount options=(rw,bind),
}
and I have two questions about the following line.
mount fstype=proc -> /var/cache/lxc/**,
Why is it safe to allow container to mount /proc ?
Why container needs to mount /proc under /var/cache/lxc ?

Nested Container Configuration
That config file allows you to create nested LXC containers, one inside another. By default, this is disabled since it bypasses some of the default cgroup restrictions (more info here).
In general, it changes apparmor rules to allow lxc to re-mount certain system resources (with certain restrictions) inside the container.
lxc.container.conf
If you look at man lxc.container.conf, this section explains settings you can edit for how proc is mounted. I think it uses proc:mixed by default (but I haven't confirmed this!)
lxc.mount.auto
specify which standard kernel file systems should be
automatically mounted. This may dramatically simplify
the configuration. The file systems are:
· proc:mixed (or proc):
mount /proc as read-write, but
remount /proc/sys and
/proc/sysrq-trigger read-only
for security / container isolation purposes.
· proc:rw: mount
/proc as read-write
Unprivileged LXC
As an aside, if you're not using unprivileged LXC, you should be. Seriously. It adds an additional layer of protection that restricts what the root user in the container can do (It actually maps it to a non-root user outside the container). This provides an additional layer of protection for /proc in case something slips by the apparmour rules.
As far as why it uses /var/cache/lxc, I have no idea. A guess would be that it has to do with not conflicting with cgmanager. Looking at the source might be a good place to start if you're interested in the reasoning.

Related

Iotedge windows container volume access

I have a windows container module that is supposed to write to a simple text file inside the volumes folder on the host machine.
The module is hardcoded to write the same thing to the same file on start up (this is for testing purposes).
Expected behavior
The module is initialized and a volume is created on the host machine and a text file is created in that volume.
Actual Behavior
The module is not allowed to write to its volume and I get the below access permission issue.
Volume Access Permission Issue
If I add "Users" to the volume folder and give that group permission to modify the volume then everything works.
Question
Is there a way to do this without changing volume access options manually every time? If not what is the best practice for allowing volume access to its windows container?
Device Info
Windows 10 Enterprise LTSC
iotedge 1.1.3
Do you have the same behavior in the default path for the Moby engine volumes?
Path: C:\ProgramData\iotedge-moby\volumes
Command to create/set:
docker -H npipe:////./pipe/iotedge_moby_engine volume create testmodule
In this volume I never had a problem (currently we use Edge Runtime 1.1.4 + Windows Server 2019).
If we use a directory outside this "default" volume, we need to manually authorize the "Authenticated Users" (Modify, Read, Write, List and Execute) to allow the container/Moby engine to write/read there.

rbind usage on local volume mounting

I have a directory that is configured to be managed by an automounter (as described here). I need to use this directory (and all directories that are mounted inside) in multiple pods as a Local Persistent Volume.
I am able to trigger the automounter within the containers, but there are some use-cases when this directory is not empty when the container starts up. This makes sub-directories appear as empty and not being able to trigger the automounter (whithin the container)
I did some investigation and discovered that when using Local PVs, there is a mount -o bind command between the source directory and some internal directory managed by the kubelet (this is the line in the source code).
What I actually do need is rbind to be used (recursive binding - here is a good explanation).
Using rbind also requires some changes to the part that unmounts the volume (recursive unmounting is needed)
I don't want to patch the kubelet and recompile it..yet.
So my question is: are there some official methods to provide to Kubernetes some custom mounter/unmounter?
Meanwhile, I did find a solution for this use-case.
Based on Kubernetes docs there is something called Out-Of-Tree Volume Plugins
The Out-of-tree volume plugins include the Container Storage Interface (CSI) and FlexVolume. They enable storage vendors to create custom storage plugins without adding them to the Kubernetes repository
Even that CSI is encouraged to be used, I chose FlexVolume to implement my custom driver. Here is a detailed documentation.
This driver is actually a py script that supports three actions: init/mount/unmount (--rbind is used to mount that directory managed by automounter and unmounts it like this). It is deployed using a DaemonSet (docs here)
And this is it!

How to control file operations made to a volume in docker?

The situation is that I have a user space file system which can provide a bunch of posix like interface in user space. Like this:
open
read
write
mkdir
...
I want to make a volume on this file system and pass it to a docker. My question is how can I control the way docker access this volume so that it can be redirected to my posix like interface?
Right now my file system can't be mounted on the host. It is a completely user space file system.
I think fuse can support this, but I don't want to go there unless I have no choice.
You dont need a volume here. If you can access your POSIX interfaces from your application running in docker. You access it and perform read, write etc operations. If you really need a volume implementation, you need to store it into another volume and have a watch dog app sync the changes to your user file system
Docker does not implement any file or directory access. It's simply not what docker does, as a matter of design.
All docker does when launching a container is to create a bunch of mounts in such a way that the processes inside the container can issue their regular POSIX calls. When a process inside a container calls write(), the call goes directly to the Linux kernel, without docker's knowledge or intervention.
Now, there's a missing piece in your puzzle that has to be implemented one way or another: The application calls e.g. POSIX write() function, and your filesystem is not able to intercept this write() function.
So you have a couple of options:
Option 1: Implement your userspace filesystem in a library:
The library would override the write() function.
You compile the library and put it in some directory e.g. /build/artifacts/filesystem.so.
You use that directory as a volume when running the container, e.g. docker run -v /build/artifacts/filesystem.so:/extralibs/filesystem.so ...
you add this filesystem as a preloaded library: docker run ... --env LD_PRELOAD=/extralibs/filesystem.so ...
This will make all calls in the container use your library, so it should forward all the irrelevant files (e.g. /bin/bash, /etc/passwd etc.) to the real filesystem.
If you have control over the images, then you can set it up such that only particular commands execute with this LD_PRELOAD.
Fair warning: implementing a library which overrides system calls and libc has a lot of pitfalls that you'll need to work around. One example is that if the program uses e.g. fprintf(), then you have to override fprintf() as well, even though fprintf() calls write().
Option 2: Modify the application to just call your filsystem functions.
This is assuming you can modify the application and the docker image.
If your filesystem is a service, run it in the container and issue the appropriate RPCs.
If it needs to be shared with other containers, then the backing store for your filesystem can be a volume.
Option 3: Make your userspace filesystem available natively within the container.
Meaning any command can issue a write() that goes to the kernel directly, and the kernel redirects it to your filesystem.
This essentially means implementing your filesystem as a fuse daemon, mounting it on the host (seeing how you can't mount it inside containers), and using it as a docker volume.
If there's a specific limitation that you're not allowed to mount your filesystem on the host, then you have a lot of work to do to make option 1 work. Otherwise I would advise you to implement your filesystem with fuse and mount it on the host - it has the highest ROI.

Mount network share with nfs with username / password

I am trying to mount a NAS using nfs for an application.
The Storage team has exported it to the host server and I can access it at /nas/data.
I am using containerized application and this file system export to the host machine will be a security issue as any container running on the host will be able to use the share. So this linux to linux mounting will not work for me.
So the only alternate solution I have is mounting this nas folder during container startup with a username /password.
The below command works fine on a share supporting Unix/Windows. I can mount on container startup
mount -t cifs -osec=ntlmv2,domain=mydomain,username=svc_account,password=password,noserverino //nsnetworkshare.domain.company/share/folder /opt/testnas
I have been told that we should use nfs option instead of cifs.
So just trying to find out whether using nfs or cifs will make any difference.
Specifying nfs option gives below error.
mount -t nfs -o nfsvers=3,domain=mydomain,username=svc_account,password=password,noserverino //nsnetworkshare.domain.company/share/folder /opt/testnas
mount.nfs: remote share not in 'host:dir' format
Below command doesnt' seem to work either.
mount -t nfs -o nfsvers=3,domain=mydomain,username=svc_account,password=password,noserverino nsnetworkshare.domain.company:/share/folder /opt/testnas
mount.nfs: an incorrect mount option was specified
I couldn't find a mount -t nfs option example with username /password. So I think we can't use mount -t nfs with credentials.
Please pour in ideas.
Thanks,
Vishnu
CIFS is a file sharing protocol. NFS is a volume sharing protocol. The difference between the two might not initially be obvious.
NFS is essentially a tiny step up from directly sharing /dev/sda1. The client actually receives a naked view of the shared subset of the filesystem, including (at least as of NFSv4) a description of which users can access which files. It is up to the client to actually manage the permissions of which user is allowed to access which files.
CIFS, on the other hand, manages users on the server side, and may provide a per-user view and access of files. In that respect, it is similar to FTP or WebDAV, but with the ability to read/write arbitrary subsets of a file, as well as a couple of other features related to locking.
This may sound like NFS is distinctively inferior to CIFS, but they are actually meant for a different purpose. NFS is most useful for external hard drives connected via Ethernet, and virtual cloud storage. In such cases, it is the intention to share the drive itself with a machine, but simply do it over Ethernet instead of SATA. For that use case, NFS offers greater simplicity and speed. A NAS, as you're using, is actually a perfect example of this. It isn't meant to manage access, it's meant to not be exposed to systems that shouldn't access it, in the first place.
If you absolutely MUST use NFS, there are a couple of ways to secure it. NFSv4 has an optional security model based on Kerberos. Good luck using that. A better option is to not allow direct connection to the NFS service from the host, and instead require going through some secure tunnel, like SSH port forwarding. Then the security comes down to establishing the tunnel. However, either one of those requires cooperation from the host, which would probably not be possible in the case of your NAS.
Mind you, if you're already using CIFS and it's working well, and it's giving you good access control, there's no good reason to switch (although, you'd have to turn the NFS off for security). However, if you have a docker-styled host, it might be worthwhile to play with iptables (or the firewall of your choice) on the docker-host, to prevent the other containers from having access to the NAS in the first place. Rather than delegating security to the NAS, it should be done at the docker-host level.
Well I would say go with CIFS as NFS (Old) few of linux/Unix bistro even stopped support for it.
NFS is the “Network File System” specifically used for Unix and Linux operating systems. It allows files communication transparently between servers and end users machines like desktops & laptops. NFS uses client- server methodology to allow user to view read and write files on a computer system. A user can mount all or a portion of a file system via NFS.
CIFS is abbreviation for “Common Internet File System” used by Windows operating systems for file sharing. CIFS also uses the client-server methodology where A client makes a request of a server program for accessing a file .The server takes the requested action and returns a response. CIFS is a open standard version of the Server Message Block Protocol (SMB) developed and used by Microsoft and it uses the TCP/IP protocol.
If I have a Linux <-> Linux I would choose nfs but if it's a Windows <-> Linux cifs would be the best option.

Preventing USER root in Dockerfile (or at least being able to spot it)

We want to provide a base image (base OS + Java JRE) to developers. Since this is for a paranoid organization we want to make sure that the container will always run with the application id (a.k.a. "app"). Putting a USER app at the end of the Dockerfile for the base image is not enough since a Dockerfile for a derived image can use USER root. Is there either:
a way to prevent using the USER root in derived images, or, failing this
a watertight way to check an image for this (is searching the history for USER [root] statements enough? Or could this be concealed in some way (multistage-image...)).
If you give users access to the Docker API without additional measures, such as "authorization" ("authz") plugins, those users effectively have root permissions on the host where the Docker daemon runs; https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface.
You can configure the daemon to use user namespaces; https://docs.docker.com/engine/security/userns-remap/. When using user-namespaces, users inside the container are "remapped" to unprivileged users on the host, so root inside a container is a non-privileged user outside of the container.
There are limitations when running with user-namespaces enabled, for example, when bind mounting directories from the host, the container's process may not be able to access/write to those files (depending on configuration); this is by design, and part of the protection that user-namespaces provide, but may be an issue, depending on your situation.

Resources