Tech Review: Linux Container Review 1 of 4: Execution Drivers

Docker offers easy-to-use containers as a high performance, low footprint VM alternative in a variety of settings. Before integrating containers into my workflow, I wanted to understand the components, features, and restrictions. In this blog post I'll investigate the following:

Container means... - Define "container" and list its components
Bottoms Up! - Delve into execution driver implementation features: cgroups and namespaces
container.sh - Use a shell script to demonstrate cgroups and namespaces
Contain Yourself! - Explore some execution drivers: libcontainer, LXC, and systemd nspawn
Right Tool for the Job - Summarize (impatient readers should skip to here)

Container means...

"Lightweight container", or simply "container" is too broad. Documentation from the execution drivers disagree somewhat, but all agree that a "container" runs on the same shared kernel of the host, differentiating it from a VM. For clarification I'll borrow from systemd nspawn's sub-categorization to clarify two types of containers: system versus application sandbox. "system containers" have a mostly complete OS with its own init system. "application sandbox containers", on the other hand, might run only one or some small subset of applications. System containers still share the kernel, so they would still typically better utilize server global resources than virtual machines.

We can better understand containers by categorizing the component technologies. Partly borrowing Docker terminology, the container components are:

Execution Driver
Storage Backend
Management Layer
Security Features

When migrating existing workloads to Linux containers, I find it easier to move and test individual components using a "bottom up" perspective.

Bottoms Up!

The execution driver for containers on Linux have been made feasible primarily by two kernel features:

cgroups
namespaces

cgroups, Linux Control Groups, provide the functionality to constrain a process hierarchy for a given set of resources, e.g. core affinity, max memory use, etc. Process hierarchies are maintained, regardless of whether a process has daemonized itself, e.g. via the "double fork" technique. Classic system resource management tools like taskset and ulimit can provide similar restrictions, but are not as easily applied and enforced on an entire process hierarchy. cgroup common knowledge has been increasing, and it has a file system API easily accessed by systems administrators.

Namespaces, on the other hand, are accessed only via system calls, typically not seen outside of execution drivers and a handful of system utilities, like "unshare" and "ip". Also available, "nsenter", if you have util-linux package v2.23 or later. A namespace wraps a global system resource for a set of processes so that (1) the resources can be translated, and (2) the global resource state can not be seen. The translation performed is usually to disguise a local resource as the global resource, making the process(es) appear to have sole access.

container.sh

First, we'll demonstrate the functionality of cgroups and namespaces as controlled by a shell script. Suppose that we are load testing an application, and we need restrictions such that: (1) a CPU core is reserved for administrative control on the host server, (2) memory use is limited , and (3) the application is allowed to bind to local network interfaces, but prevents external connectivity. The memory restriction enables targeted out-of-memory operations, which removes the possibility that the out-of-memory killer will kill a host system service, or that the system will slow to a crawl due to excessive swap paging.

#  Assumes running as root on a 2 core system with over 2G of mem
#  on CentOS 7+ with cgroups enabled.  cgroup memory sub-systems are
#  frequently disabled, but can be enabled with kernel CONFIG flags or
#  kernel command-line parameters: cgroup_enable=memory swapaccount=1
#  The following was tested on a GCE, Google Compute Engine, CentOS 7 instance

#  Install libcgroup to establish the /sys/fs/cgroup file system
yum -y install libcgroup
#  Create the CPU cgroup "mygroup"
mkdir /sys/fs/cgroup/cpuset/mygroup
#  Assign CPU 1, zero indexed in "mygroup",
#  which means that it can NOT use CPU 0.
echo "1" > /sys/fs/cgroup/cpuset/mygroup/cpuset.cpus
#  Initialize NUMA "mems" to existing setting.  NUMA is out-of-scope for this example.
#  However, init is required for cgroup cpuset, even if not used.
cat /sys/fs/cgroup/cpuset/cpuset.mems > /sys/fs/cgroup/cpuset/mygroup/cpuset.mems

#  Create memory cgroup using the same name for consistency
mkdir /sys/fs/cgroup/memory/mygroup
#  Update to 2GB max virtual mem in "mygroup"
echo "2000000000" > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes

#  Setup a loopback network namespace
ip netns add mynetns
ip netns exec mynetns ip link set dev lo up

#  Script to run the command using these resources.
#  This program will first constrain itself in a cgroup,
#  and then exec into the private ip namespace using the
#  given arguments:
cat >/tmp/container.sh <<'EOF'
#  Contain our process in cgroups. 0 attaches current process
echo 0 > /sys/fs/cgroup/cpuset/mygroup/tasks
echo 0 > /sys/fs/cgroup/memory/mygroup/tasks
#  Execute our process in the private loopback IP namespace
exec ip netns exec mynetns $@
EOF

# Example: run a 10 second sleep to verify CPU constraints
bash /tmp/container.sh sleep 10 &
cat /proc/$!/cpuset
cat /proc/$!/cgroup

#  Consume more memory, nom nom nom...
bash /tmp/container.sh dd if=/dev/zero of=/dev/shm/fill bs=1k count=2048k
#  Check syslog to confirm that an out-of-memory killer ran against "mygroup"
grep mygroup /var/log/messages | tail -n 3
#  Remove file consuming memory
rm -f /dev/shm/fill

#  Install nc if not present
yum -y install nmap-ncat
#  Bind locally within the container via the nc command in background
bash /tmp/container.sh nc -l 8080 &
#  Confirm that its loopback is not available from host
echo "hello" | nc 127.0.0.1 8080
#  Try sending to loopback within the container
echo "hello container" | bash /tmp/container.sh nc 127.0.0.1 8080
#  Hello world?  Nope. Non-loopback connectivity always denied with "Network is unreachable"
bash /tmp/container.sh nc 10.1.1.1 8080

Contain Yourself!

With new technology comes bugs and limitations. For example, certain drivers still need: disk quotas, shared kernel logging, and resolution to security concerns. Consult the current feature and bug lists to ensure that an execution driver meets your requirements.

Let's run through building some Fedora system containers with the same restrictions and assumptions as container.sh:

=== libcontainer

An execution driver promoted to be the default driver for Docker from version 0.10 and later.

yum -y install docker-io
systemctl start docker
docker run --interactive=true --tty=true --cpuset="1" --memory="2000000000b" fedora /bin/bash
#  exiting shell will also terminate container

=== LXC

LXC, although no longer the default driver for docker, is still being enhanced for OpenStack with the addition of partner daemon LXD. LXD and some recent LXC enhancements appear to be focused on improving container security. Security is a hot, fast moving topic, which I'm leaving as blog post #4.

#  Assumes running as root with lxc, lxc-templates, and bridge-utils installed, e.g. for GCE CentOS 7 install:
curl http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-2.noarch.rpm > /tmp/epel-release-7-2.noarch.rpm
rpm -ivh /tmp/epel-release-7-2.noarch.rpm
yum -y install lxc lxc-templates bridge-utils
#  Create a container to build a fedora system
lxc-create -n myfedora -t fedora
#  Get the password, e.g.
cat /var/lib/lxc/myfedora/tmp_root_pass
#  Setup the CPU and memory restrictions
echo "lxc.cgroup.cpuset.cpus = 1" >> /var/lib/lxc/myfedora/config
echo "lxc.cgroup.memory.limit_in_bytes = 2000000000" >> /var/lib/lxc/myfedora/config
#  If you are not running libvirt, you'll need to create a virbr0 interface
brctl addbr virbr0
#  Starting container in daemonize mode, then using console command to get around lack of escape code bug upon start-up in recent LXC version
lxc-start --daemon --name myfedora
lxc-console --name myfedora
#  Ctrl+a, q to exit
lxc-stop --name myfedora

=== systemd-nspawn

Targeted at "building, testing, debugging, and profiling" for system containers only. There are plenty of disclaimers from the developers indicated that it was a development tool only, and not a full fledged container app. Personal experience confirms that there are a lot of rough edges. The following example does not clean itself fully after termination.

#  Tested from a recent Fedora 20 bare-metal install.
#  Disable audits in on your kernel cmdline via "audit=0"
#  Create the container
yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
#  Set password
systemd-nspawn --directory /srv/mycontainer passwd
#  Boot container with a private network
systemd-nspawn --boot --private-network --directory /srv/mycontainer
#  Set cgroups limits.
#  Sub-optimal: as it only applies after the server is started
systemctl set-property machine-mycontainer.scope MemoryLimit=2G
#  Or, via cgroup FS
#  echo "2000000000" > /sys/fs/cgroup/memory/machine.slice/machine-mycontainer.scope/memory.limit_in_bytes
#  Hmmm...  Unable to set CPU Affinity via systemd.resource-control
#  Only CPUShares is available
#  Stop from another terminal:
machinectl terminate mycontainer

Right tool for the job

My recommendation is to use the right tool for the job. Here are some suggestions:

Trying out containers for streamlining application deployments, or virtual internal system deployments on homogeneous kernels? Use Docker with libcontainer.
Want to securely run container images from 3rd parties or host multi-tenant containers? Look into the LXD project.
Testing the latest Linux distro release candidate on your current kernel with a subset of packages? Try out nspawn.

Numerous articles about Docker have claimed that it will be THE way of deploying software, a.k.a. application sandboxing. Key features are file system snapshots and sub-second launch times. Snapshots are not only useful for recovery from hardware failure and upgrade rollbacks, but also lead into development towards a highly desired feature, live migration. I'll review those capabilities and more in my next blog reviewing storage backends.

What has been your experience been with docker?

Tech Review

Friday, November 14, 2014

Linux Container Review 1 of 4: Execution Drivers