Tech Review

Friday, December 26, 2014

Linux Container Review 2 of 4: Storage Backends

Continuing reviews of Linux containers, we'll explore the file storage and tools involved. The tools around Docker images are in flux, creating incompatibilities in images. The introduction of Rocket, https://coreos.com/blog/rocket/ , also paves the way towards further image incompatibilities. Changing storage backends after creating a large deployment may be difficult, so it is important to evaluate them beforehand, picking the one that best suites your needs. In this blog post I'll investigate storage formats, storage backend drivers, and tooling around container images of a very minimal application deployment.

Picture This - Container image and storage concepts
What's in the Box? - Explore current storage formats
Tar and Beyond - Test tools to create and explore images
Efficient Cow - Evaluation of some "copy on write" efficiency
Thinking Outside the Box - Summarize (impatient readers should skip to here)

Picture This...

All containers require an image, which provides the visible root file system on which the container will operate. This includes the directory layouts, file timestamps, Unix modes, and extended attributes from '/' on up. An image may contain a complete Linux distribution or nothing at all.

To start a container both an image and compatible storage driver is required. The storage driver must be able to provide a layered file system. The idea is that multiple containers may run concurrently on the same server using the same copy of the data within the image. Any writes performed will be saved in such a way that only the container that wrote to it will be able to access the updated version of the file.

The changes made within a container can be persisted via a snapshot. Some tools use this functionality implicitly, e.g. a docker build on a Dockerfile will create a snapshot automatically after each processed statement. Snapshots can then be used for file system rollback. Common use cases for rollback are to revert to the first sign of issues within a build or installation process after a failure.

Containers can be started with mounts separate from the base image, e.g. Docker "volumes". Also, tools can be provided to mount external file systems, using traditional methods as well, e.g. via NFS mounts. This is particularly important if a container relies on high write volume and/or performance within a portion of the file system.

What's in the Box?

Images will be in one of two states: within a storage backend or in a file archive. Both Docker and Rocket projects use tar as their archive format for offline storage, https://github.com/appc/spec/blob/master/SPEC.md#app-container-image . Both specify nested tars with the outer layer containing meta data files, including a JSON file describing the image format. The storage backend formats vary greatly. Some examples of storage backends for Docker are:

AUFS
LVM
BTRFS
OverlayFS

These file systems fall into one of two categories:

Union file system: Combines base image with changes via union mount, http://en.wikipedia.org/wiki/Union_mount
Copy-on-write file system: Combines base image with changes via snapshot deltas

Either system will allow independent containers to run, re-using files where possible, creating new ones as needed. Union file systems typically resolve access more quickly, but must create completely new files if any portion of a file is modified within a container. Snapshot CoW file systems can store deltas at a block level, so that small changes to big files only result in small deltas.

AUFS is a union file system, and one of the original storage backends for Docker. Use of AUFS did require compilation of a custom kernel to be made available for use. It is fast becoming a historical note in the Docker history, as other options were pursued after it was not accepted in the upstream Linux kernel.

LVM, Logical Volume Manager http://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29 , is well known for disk volume management. The Docker community frequently refers to the use of lvm storage mouns by its device driver name, "devicemapper." LVM has snapshot capabilities, enabling it to be used for a copy-on-write block storage. The current implementation specifies that an image requires two LVM mounts: one for metadata, and the other for storage. Structured this way, the LVM snapshot deltas can be seamlessly integrated for the container file system. Docker has two methods of using LVM for its image formats.

LVM loopback - Two files on the file system are used as virtual devices via the dm-thinp loopback drivers. This is a widely support, and simplest method for running images.
LVM direct - Two logical volumes are created directly from a volume group. This requires that an administrator be able to provision volume group storage directly, but provides better performance

BTRFS, https://btrfs.wiki.kernel.org/index.php/Main_Page, a full stack copy-on-write file system. It requires its own storage partition, can manage its own RAID, and provides its own file system API. This project is still in a state of development, so please note the caveat that you should back up critical data stored on this file system.

OverlayFS is another union file system, similar to AUFS. However, OverlayFS has upstream Linux kernel acceptance. Currently only in kernel 3.18, and with more features in 3.19+, early performance tests indicate that it may be the most optimal for use in concurrent use given low write volumes, see http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/ . The file system is under heavy development, so please note the caveat that you should back up critical data stored on this file system.

Tar and Beyond

Let's put together a basic image of Linux tools to demonstrate image creation and management. For this example we'll use the busybox project, http://www.busybox.net/about.html, which combines tiny versions of many common UNIX utilities into a single small executable.

Commands to download, statically build busybox, and run within a container.
# Assumes docker service is running. Requires sudo access for docker.
# Tested on Fedora 20 with docker 1.3 using devicemapper loopback.
# Requires a static version of glibc, e.g. on Fedora/CentOS/RHEL:
# yum -y install glibc-static

# Download busybox

mkdir busybox
pushd busybox
curl http://busybox.net/downloads/busybox-1.23.0.tar.bz2 > busybox-1.23.0.tar.bz2

# Verify md5sum

md5sum busybox-1.23.0.tar.bz2
6dffeb16044c6022476c64744492106a  busybox-1.23.0.tar.bz2

# Extract and build

tar -xf busybox-1.23.0.tar.bz2
pushd busybox-1.23.0
make defconfig
LDFLAGS="--static" make -j 4 install

# Created an _install directory. Confirm that busybox is a statically linked:

ldd _install/bin/busybox
       not a dynamic executable

# Import into docker

tar -C _install -cvf /tmp/busybox.tar .
tar -C _install -cvf - . | sudo docker import - busybox-1.23.0

# Import command yields a hash based on files and import time, e.g.

dc4a553fa7b57554157b251d780f87a384705d80a9275f804123507fedef809a

# Docker command to confirm that the image is loaded:

sudo docker images

# Now, we can run a container

sudo docker run --name my_busybox busybox-1.23.0 echo foo
foo

# Show containers. The "--all" also shows containers which ran, but stopped.

sudo docker ps --all

# The docker container "run" status is determined by our run command.
# Since we use echo, it exited immediately. Docker will not allow us to
# re-use this container for other commands, while stopped. However,
# there is a trick. To continue from the last state in the container,
# we can create a new image which is the snapshot of the last running
# status of the container:

sudo docker commit my_busybox my_busybox_running

# Now we can re-use the run image. We'll add a "rm" flag to auto-clenaup
# this container after it runs

sudo docker run --rm=true my_running_busybox echo bar

# We can examine the contents of the full image in an archive format

sudo docker save --output=my_running_busybox.tar my_running_busybox

# Or, re-run the container and just look at the latest file system

mkdir exported_busybox
sudo docker run --name export_me --detach=true my_running_busybox sleep 10
sudo docker export export_me | tar -xf - -C exported_busybox
diff -r exported_busybox _install
Only in exported_busybox: dev
Only in exported_busybox: .dockerenv
Only in exported_busybox: .dockerinit
Only in exported_busybox: etc
Only in exported_busybox: proc
Only in exported_busybox: sys

# If you want to play with docker further, here are some sub-command hints
# The docker "save/load" operations run against docker images, preserving
# snapshot history.
# "commit" creates a new snapshot explicitly
# "import" creates an image from a tarball
# "export" extracts tarball from a container or snapshot
# When done, here are the commands to cleanup containers and images

sudo docker rm export_me
sudo docker rm my_busybox
sudo docker rmi busybox-1.23.0
sudo docker rmi my_running_busybox

# Leave build and download directories

popd
popd

# Remove remaining build and downloads

rm -r busybox

Now that we have a busybox image, we can now start multiple containers to demonstrate the file system layering.

Efficient Cow

The goal of the following tests is to understand the performance overhead in using a container run a command, and the disk utilization to store containers and their results. Given that AUFS is on its way out, and OverlayFS just came out on Fedora 21, I'll only demonstrate some Copy-on-Write file systems: LVM loopback, LVM direct, and BTRFS.

The busybox binary is 2.5Mb. We can investigate the disk usage of creating additional contains on these files systems. The test for each system will follow the same set of steps, timing each along the way:

One shot container:
Start 10 containers based on the initial busybox image
Write 64K to a file via dd
The container terminates
Commit, take a snapshot, of the state of the container
Run 10 new container instance from the committed snapshots

Code:
# Check initial disk usage as per "df" and docker info

df -k
sudo docker info

# loop over initial container creation

CONTAINER_NUM=10
time for CIDX in $(seq 1 10)
do
    time sudo docker run --name my_busybox${CIDX} busybox-1.23.0 dd if=/dev/zero of=/data1 bs=64k count=1
    sudo docker info
    df -k
done > /tmp/docker.run.output

# loop over container snapshots

 time for CIDX in $(seq 1 10)
do
    time sudo docker commit my_busybox${CIDX} my_busybox${CIDX}_run
    sudo docker info
    df -k
done >> /tmp/docker.run.output

# check mem

top -n -b 1
free

# loop over container sleeps

time for CIDX in $(seq 1 10)
do
    time sudo docker run --name my_busybox${CIDX}_sleep --detach=true my_busybox${CIDX}_run sleep 180
    sudo docker info
    df -k
done >> /tmp/docker.run.output

# check mem

top -n -b 1
free

Test Configuration
These tests require at least a few GB per file system. 9GB for data and 1G for metadata is sufficient, though the defaults for docker on my system are to use LVM loopback devicemapper files pre-allocated with 100GB. Test system:
Bare Metal: Lenovo Ideapad-Z710
Operating System: Fedora 20 (Heisenbug)
Linux kernel: 3.17.4-200.fc20.x86_64
docker-io-1.3.2
lvm2-2.02.106
btrfs-progs-3.17
On Fedora the docker storage configuration is in:
/etc/sysconfig/docker-storage

LVM loopback
# Docker configuration options
Took docker defaults
# One shot container

1.7 seconds to start each container and write the output, with an extra second of the first run. +1MB dat/run, +41K metadata/run. "docker info" output agreed with the "df" command output

# Container commit

1.6s to commit each container. Each commit required 0.5Mb of data and 20Kb of metadata
# Run from snapshot
1.3s to start each container, though 1.5s on the first Each container added 1Mb of data, and 41K metadata.
# Memory
The memory usage jumped around by 2MB per container as the containers were created, but top showed that busy box used the same amount of memory inside the container as it does outside: 4Kb of RSS, and 2.7Mb of VSZ.
# Notes
You may need to manually clean out your /var/lib/docker device mapper to re-claim space due to a known thinp issue. I found this to be true after running these tests. Running the following will wipe out all images and containers, and restore the devicemapper to 100GB thin provisioning. You may need to reboot after disabling docker if you are not allowed to delete stopped containers. See restoration code, restoring a 100GB "sparse file":


systemctl disable docker
systemctl stop docker
rm -rf /var/lib/docker
mkdir -p /var/lib/docker/devicemapper/devicemapper
dd if=/dev/zero of=/var/lib/docker/devicemapper/devicemapper/data bs=1G count=0 seek=100
systemctl start docker

LVM direct
# Docker configuration options
DOCKER_STORAGE_OPTIONS="--storage-opt dm.datadev=/dev/vg1/ldata --storage-opt dm.metadatadev=/dev/vg1/mdata --storage-opt dm.fs=xfs"

# One shot container

1.7 seconds to start each container and write the output, with an extra second on the first run. +5.5MB data, +8.2Kb metadata per run
# Container commit
1.6s to commit the containers. +2.7MB data, +4kb per commit
# Run from snapshot
1.6s to restart each container. +5.5MB data, +8.2Kb metadata per run
# Memory

Same as LVM loopback
# Notes
Using LVM direct did NOT have the same problem with space reclamation, which were seen using LVM loopback.

BTRFS
# Docker configuration options. Disable selinux during this test due to btrfs incompatibilities, and switch storage options. I created a btrfs partition, added as a local mount in /mnt/butter, and created a bind mount from /mnt/butter to /var/lib/docker for this test:


systemctl stop docker
mount -o bind /mnt/butter /var/lib/docker
perl -pi -e 's/^OPTIONS=--selinux-enabled/OPTIONS=/' /etc/sysconfig/docker
perl -pi -e 's/^DOCKER_STORAGE_OPTIONS=.*/DOCKER_STORAGE_OPTIONS="-s btrfs"/' /etc/sysconfig/docker-storage
systemctl start docker

# One shot container
0.92s to start each container and write the output. df difference per container was: +256Kb per container
# Container commit

0.17s to commit each container. After the first commit a one time use of 672Kb of space was seen, and each container commit after that was +128Kb per container
# Run from snapshot
0.45s for each container restart to complete. Again, another jump of 700Kb of space, and after that was +256Kb per container
# Memory
A spike of 144Mb memory use was seen when restarting 10 containers, but it dropped down to 8.7Mb
# Notes
The command: "btrfs filesystem df" tended to show half the usage as the "df" command, and almost always showed jumps in the meta data, with the exception of the first commit and restarts. I may have simply been measuring some delayed disk commit from the prior run.

Many thanks to the following blogger for advanced docker LVM configuration tips:
http://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/

Thinking Outside the Box

The snapshot and union file systems facilitate the ability to shared common unwritten data among multiple containers on the same host. The understanding of the underlying implementation provides some insight into how it can be used. Writes to container file systems will be slower and generate ever increasing deltas, however, it does give the ability to roll back the file system as a developer tool.

Container images will not be too unfamiliar to people already accustomed to packaging directories using the tar command. The key details involved in managing containers is dealing with the extra snapshot layers involved that are sometimes implicitly created, or forcing explicit creation for container transfers or backups.

Start times of containers were all shown to be quick. However, a spiike in memory use is typically seen when starting containers. Also, the disk usage was almost always 2:1, at least when using the relatively small binary sizes used in the test examples.
Next it is time to look at the ecosystem and tools for managing contianers. How do you automate builds and upgrades of images, how do you share and/or deploy them, how do you link container services to the outside, and how do you manage large numbers of containers across multiple system. Keep tuned as the investigation continues.

What has your experience been with docker storage, and what other tests would you like to see?

Friday, November 14, 2014

Linux Container Review 1 of 4: Execution Drivers

Docker offers easy-to-use containers as a high performance, low footprint VM alternative in a variety of settings. Before integrating containers into my workflow, I wanted to understand the components, features, and restrictions. In this blog post I'll investigate the following:

Container means... - Define "container" and list its components
Bottoms Up! - Delve into execution driver implementation features: cgroups and namespaces
container.sh - Use a shell script to demonstrate cgroups and namespaces
Contain Yourself! - Explore some execution drivers: libcontainer, LXC, and systemd nspawn
Right Tool for the Job - Summarize (impatient readers should skip to here)

Container means...

"Lightweight container", or simply "container" is too broad. Documentation from the execution drivers disagree somewhat, but all agree that a "container" runs on the same shared kernel of the host, differentiating it from a VM. For clarification I'll borrow from systemd nspawn's sub-categorization to clarify two types of containers: system versus application sandbox. "system containers" have a mostly complete OS with its own init system. "application sandbox containers", on the other hand, might run only one or some small subset of applications. System containers still share the kernel, so they would still typically better utilize server global resources than virtual machines.

We can better understand containers by categorizing the component technologies. Partly borrowing Docker terminology, the container components are:

Execution Driver
Storage Backend
Management Layer
Security Features

When migrating existing workloads to Linux containers, I find it easier to move and test individual components using a "bottom up" perspective.

Bottoms Up!

The execution driver for containers on Linux have been made feasible primarily by two kernel features:

cgroups
namespaces

cgroups, Linux Control Groups, provide the functionality to constrain a process hierarchy for a given set of resources, e.g. core affinity, max memory use, etc. Process hierarchies are maintained, regardless of whether a process has daemonized itself, e.g. via the "double fork" technique. Classic system resource management tools like taskset and ulimit can provide similar restrictions, but are not as easily applied and enforced on an entire process hierarchy. cgroup common knowledge has been increasing, and it has a file system API easily accessed by systems administrators.

Namespaces, on the other hand, are accessed only via system calls, typically not seen outside of execution drivers and a handful of system utilities, like "unshare" and "ip". Also available, "nsenter", if you have util-linux package v2.23 or later. A namespace wraps a global system resource for a set of processes so that (1) the resources can be translated, and (2) the global resource state can not be seen. The translation performed is usually to disguise a local resource as the global resource, making the process(es) appear to have sole access.

container.sh

First, we'll demonstrate the functionality of cgroups and namespaces as controlled by a shell script. Suppose that we are load testing an application, and we need restrictions such that: (1) a CPU core is reserved for administrative control on the host server, (2) memory use is limited , and (3) the application is allowed to bind to local network interfaces, but prevents external connectivity. The memory restriction enables targeted out-of-memory operations, which removes the possibility that the out-of-memory killer will kill a host system service, or that the system will slow to a crawl due to excessive swap paging.

#  Assumes running as root on a 2 core system with over 2G of mem
#  on CentOS 7+ with cgroups enabled.  cgroup memory sub-systems are
#  frequently disabled, but can be enabled with kernel CONFIG flags or
#  kernel command-line parameters: cgroup_enable=memory swapaccount=1
#  The following was tested on a GCE, Google Compute Engine, CentOS 7 instance

#  Install libcgroup to establish the /sys/fs/cgroup file system
yum -y install libcgroup
#  Create the CPU cgroup "mygroup"
mkdir /sys/fs/cgroup/cpuset/mygroup
#  Assign CPU 1, zero indexed in "mygroup",
#  which means that it can NOT use CPU 0.
echo "1" > /sys/fs/cgroup/cpuset/mygroup/cpuset.cpus
#  Initialize NUMA "mems" to existing setting.  NUMA is out-of-scope for this example.
#  However, init is required for cgroup cpuset, even if not used.
cat /sys/fs/cgroup/cpuset/cpuset.mems > /sys/fs/cgroup/cpuset/mygroup/cpuset.mems

#  Create memory cgroup using the same name for consistency
mkdir /sys/fs/cgroup/memory/mygroup
#  Update to 2GB max virtual mem in "mygroup"
echo "2000000000" > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes

#  Setup a loopback network namespace
ip netns add mynetns
ip netns exec mynetns ip link set dev lo up

#  Script to run the command using these resources.
#  This program will first constrain itself in a cgroup,
#  and then exec into the private ip namespace using the
#  given arguments:
cat >/tmp/container.sh <<'EOF'
#  Contain our process in cgroups. 0 attaches current process
echo 0 > /sys/fs/cgroup/cpuset/mygroup/tasks
echo 0 > /sys/fs/cgroup/memory/mygroup/tasks
#  Execute our process in the private loopback IP namespace
exec ip netns exec mynetns $@
EOF

# Example: run a 10 second sleep to verify CPU constraints
bash /tmp/container.sh sleep 10 &
cat /proc/$!/cpuset
cat /proc/$!/cgroup

#  Consume more memory, nom nom nom...
bash /tmp/container.sh dd if=/dev/zero of=/dev/shm/fill bs=1k count=2048k
#  Check syslog to confirm that an out-of-memory killer ran against "mygroup"
grep mygroup /var/log/messages | tail -n 3
#  Remove file consuming memory
rm -f /dev/shm/fill

#  Install nc if not present
yum -y install nmap-ncat
#  Bind locally within the container via the nc command in background
bash /tmp/container.sh nc -l 8080 &
#  Confirm that its loopback is not available from host
echo "hello" | nc 127.0.0.1 8080
#  Try sending to loopback within the container
echo "hello container" | bash /tmp/container.sh nc 127.0.0.1 8080
#  Hello world?  Nope. Non-loopback connectivity always denied with "Network is unreachable"
bash /tmp/container.sh nc 10.1.1.1 8080

Contain Yourself!

With new technology comes bugs and limitations. For example, certain drivers still need: disk quotas, shared kernel logging, and resolution to security concerns. Consult the current feature and bug lists to ensure that an execution driver meets your requirements.

Let's run through building some Fedora system containers with the same restrictions and assumptions as container.sh:

=== libcontainer

An execution driver promoted to be the default driver for Docker from version 0.10 and later.

yum -y install docker-io
systemctl start docker
docker run --interactive=true --tty=true --cpuset="1" --memory="2000000000b" fedora /bin/bash
#  exiting shell will also terminate container

=== LXC

LXC, although no longer the default driver for docker, is still being enhanced for OpenStack with the addition of partner daemon LXD. LXD and some recent LXC enhancements appear to be focused on improving container security. Security is a hot, fast moving topic, which I'm leaving as blog post #4.

#  Assumes running as root with lxc, lxc-templates, and bridge-utils installed, e.g. for GCE CentOS 7 install:
curl http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-2.noarch.rpm > /tmp/epel-release-7-2.noarch.rpm
rpm -ivh /tmp/epel-release-7-2.noarch.rpm
yum -y install lxc lxc-templates bridge-utils
#  Create a container to build a fedora system
lxc-create -n myfedora -t fedora
#  Get the password, e.g.
cat /var/lib/lxc/myfedora/tmp_root_pass
#  Setup the CPU and memory restrictions
echo "lxc.cgroup.cpuset.cpus = 1" >> /var/lib/lxc/myfedora/config
echo "lxc.cgroup.memory.limit_in_bytes = 2000000000" >> /var/lib/lxc/myfedora/config
#  If you are not running libvirt, you'll need to create a virbr0 interface
brctl addbr virbr0
#  Starting container in daemonize mode, then using console command to get around lack of escape code bug upon start-up in recent LXC version
lxc-start --daemon --name myfedora
lxc-console --name myfedora
#  Ctrl+a, q to exit
lxc-stop --name myfedora

=== systemd-nspawn

Targeted at "building, testing, debugging, and profiling" for system containers only. There are plenty of disclaimers from the developers indicated that it was a development tool only, and not a full fledged container app. Personal experience confirms that there are a lot of rough edges. The following example does not clean itself fully after termination.

#  Tested from a recent Fedora 20 bare-metal install.
#  Disable audits in on your kernel cmdline via "audit=0"
#  Create the container
yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
#  Set password
systemd-nspawn --directory /srv/mycontainer passwd
#  Boot container with a private network
systemd-nspawn --boot --private-network --directory /srv/mycontainer
#  Set cgroups limits.
#  Sub-optimal: as it only applies after the server is started
systemctl set-property machine-mycontainer.scope MemoryLimit=2G
#  Or, via cgroup FS
#  echo "2000000000" > /sys/fs/cgroup/memory/machine.slice/machine-mycontainer.scope/memory.limit_in_bytes
#  Hmmm...  Unable to set CPU Affinity via systemd.resource-control
#  Only CPUShares is available
#  Stop from another terminal:
machinectl terminate mycontainer

Right tool for the job

My recommendation is to use the right tool for the job. Here are some suggestions:

Trying out containers for streamlining application deployments, or virtual internal system deployments on homogeneous kernels? Use Docker with libcontainer.
Want to securely run container images from 3rd parties or host multi-tenant containers? Look into the LXD project.
Testing the latest Linux distro release candidate on your current kernel with a subset of packages? Try out nspawn.

Numerous articles about Docker have claimed that it will be THE way of deploying software, a.k.a. application sandboxing. Key features are file system snapshots and sub-second launch times. Snapshots are not only useful for recovery from hardware failure and upgrade rollbacks, but also lead into development towards a highly desired feature, live migration. I'll review those capabilities and more in my next blog reviewing storage backends.

What has been your experience been with docker?