Friday, December 26, 2014

Linux Container Review 2 of 4: Storage Backends

Continuing reviews of Linux containers, we'll explore the file storage and tools involved.  The tools around Docker images are in flux, creating incompatibilities in images.  The introduction of Rocket, https://coreos.com/blog/rocket/ ,  also paves the way towards further image incompatibilities.  Changing storage backends after creating a large deployment may be difficult, so it is important to evaluate them beforehand, picking the one that best suites your needs.  In this blog post I'll investigate storage formats, storage backend drivers, and tooling around container images of a very minimal application deployment.
  • Picture This - Container image and storage concepts
  • What's in the Box? - Explore current storage formats
  • Tar and Beyond - Test tools to create and explore images
  • Efficient Cow - Evaluation of some "copy on write" efficiency
  • Thinking Outside the Box - Summarize (impatient readers should skip to here)

Picture This...

All containers require an image, which provides the visible root file system on which the container will operate.  This includes the directory layouts, file timestamps, Unix modes, and extended attributes from '/' on up.  An image may contain a complete Linux distribution or nothing at all.

To start a container both an image and compatible storage driver is required.  The storage driver must be able to provide a layered file system.  The idea is that multiple containers may run concurrently on the same server using the same copy of the data within the image.  Any writes performed will be saved in such a way that only the container that wrote to it will be able to access the updated version of the file.

The changes made within a container can be persisted via a snapshot.  Some tools use this functionality implicitly, e.g. a docker build on a Dockerfile will create a snapshot automatically after each processed statement.  Snapshots can then be used for file system rollback.  Common use cases for rollback are to revert to the first sign of issues within a build or installation process after a failure.

Containers can be started with mounts separate from the base image, e.g. Docker "volumes".  Also, tools can be provided to mount external file systems, using traditional methods as well, e.g. via NFS mounts.  This is particularly important if a container relies on high write volume and/or performance within a portion of the file system.

What's in the Box?

Images will be in one of two states: within a storage backend or in a file archive.  Both Docker and Rocket projects use tar as their archive format for offline storage, https://github.com/appc/spec/blob/master/SPEC.md#app-container-image .  Both specify nested tars with the outer layer containing meta data files, including a JSON file describing the image format.  The storage backend formats vary greatly.  Some examples of storage backends for Docker are:
  • AUFS
  • LVM
  • BTRFS
  • OverlayFS
These file systems fall into one of two categories:

  • Union file system: Combines base image with changes via union mount, http://en.wikipedia.org/wiki/Union_mount 
  • Copy-on-write file system: Combines base image with changes via snapshot deltas
Either system will allow independent containers to run, re-using files where possible, creating new ones as needed.  Union file systems typically resolve access more quickly, but must create completely new files if any portion of a file is modified within a container.  Snapshot CoW file systems can store deltas at a block level, so that small changes to big files only result in small deltas.

AUFS is a union file system, and one of the original storage backends for Docker.  Use of AUFS did require compilation of a custom kernel to be made available for use.  It is fast becoming a historical note in the Docker history, as other options were pursued after it was not accepted in the upstream Linux kernel.

LVM, Logical Volume Manager http://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29 , is well known for disk volume management.  The Docker community frequently refers to the use of lvm storage mouns by its device driver name, "devicemapper."  LVM has snapshot capabilities, enabling it to be used for a copy-on-write block storage.  The current implementation specifies that an image requires two LVM mounts: one for metadata, and the other for storage.  Structured this way, the LVM snapshot deltas can be seamlessly integrated for the container file system.  Docker has two methods of using LVM for its image formats.
  • LVM loopback - Two files on the file system are used as virtual devices via the dm-thinp loopback drivers. This is a widely support, and simplest method for running images.
  • LVM direct - Two logical volumes are created directly from a volume group.  This requires that an administrator be able to provision volume group storage directly, but provides better performance
BTRFS, https://btrfs.wiki.kernel.org/index.php/Main_Page, a full stack copy-on-write file system.  It requires its own storage partition, can manage its own RAID, and provides its own file system API.  This project is still in a state of development, so please note the caveat that you should back up critical data stored on this file system.

OverlayFS is another union file system, similar to AUFS.  However, OverlayFS has upstream Linux kernel acceptance.  Currently only in kernel 3.18, and with more features in 3.19+, early performance tests indicate that it may be the most optimal for use in concurrent use given low write volumes, see http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/ .  The file system is under heavy development, so please note the caveat that you should back up critical data stored on this file system.

Tar and Beyond

Let's put together a basic image of Linux tools to demonstrate image creation and management.  For this example we'll use the busybox project, http://www.busybox.net/about.html, which combines tiny versions of many common UNIX utilities into a single small executable.

Commands to download, statically build busybox, and run within a container.
# Assumes docker service is running.  Requires sudo access for docker.
# Tested on Fedora 20 with docker 1.3 using devicemapper loopback.
# Requires a static version of glibc, e.g. on Fedora/CentOS/RHEL:
# yum -y install glibc-static

#  Download busybox
mkdir busybox
pushd busybox
curl http://busybox.net/downloads/busybox-1.23.0.tar.bz2 > busybox-1.23.0.tar.bz2
#  Verify md5sum
md5sum busybox-1.23.0.tar.bz2
6dffeb16044c6022476c64744492106a  busybox-1.23.0.tar.bz2
#  Extract and build
tar -xf busybox-1.23.0.tar.bz2
pushd busybox-1.23.0
make defconfig
LDFLAGS="--static" make -j 4 install
#  Created an _install directory.  Confirm that busybox is a statically linked:
ldd _install/bin/busybox
       not a dynamic executable
#  Import into docker
tar -C _install -cvf /tmp/busybox.tar .
tar -C _install -cvf - . | sudo docker import - busybox-1.23.0
#  Import command yields a hash based on files and import time, e.g.
dc4a553fa7b57554157b251d780f87a384705d80a9275f804123507fedef809a
#  Docker command to confirm that the image is loaded:
sudo docker images
#  Now, we can run a container
sudo docker run --name my_busybox busybox-1.23.0 echo foo
foo
#  Show containers.  The "--all" also shows containers which ran, but stopped.
sudo docker ps --all
#  The docker container "run" status is determined by our run command.
# Since we use echo, it exited immediately.  Docker will not allow us to
# re-use this container for other commands, while stopped.  However,
# there is a trick.  To continue from the last state in the container,
# we can create a new image which is the snapshot of the last running
# status of the container:
sudo docker commit my_busybox my_busybox_running
#  Now we can re-use the run image.  We'll add a "rm" flag to auto-clenaup
# this container after it runs
sudo docker run --rm=true my_running_busybox echo bar
#  We can examine the contents of the full image in an archive format
sudo docker save --output=my_running_busybox.tar my_running_busybox
#  Or, re-run the container and just look at the latest file system
mkdir exported_busybox
sudo docker run --name export_me --detach=true my_running_busybox sleep 10
sudo docker export export_me | tar -xf - -C exported_busybox
diff -r exported_busybox _install
Only in exported_busybox: dev
Only in exported_busybox: .dockerenv
Only in exported_busybox: .dockerinit
Only in exported_busybox: etc
Only in exported_busybox: proc
Only in exported_busybox: sys
#  If you want to play with docker further, here are some sub-command hints
#  The docker "save/load" operations run against docker images, preserving
# snapshot history.
#  "commit" creates a new snapshot explicitly
#  "import" creates an image from a tarball
#  "export" extracts tarball from a container or snapshot
#  When done, here are the commands to cleanup containers and images
sudo docker rm export_me
sudo docker rm my_busybox
sudo docker rmi busybox-1.23.0
sudo docker rmi my_running_busybox
#  Leave build and download directories
popd
popd
#  Remove remaining build and downloads
rm -r busybox

Now that we have a busybox image, we can now start multiple containers to demonstrate the file system layering.

Efficient Cow

The goal of the following tests is to understand the performance overhead in using a container run a command, and the disk utilization to store containers and their results.  Given that AUFS is on its way out, and OverlayFS just came out on Fedora 21, I'll only demonstrate some Copy-on-Write file systems: LVM loopback, LVM direct, and BTRFS.

The busybox binary is 2.5Mb.  We can investigate the disk usage of creating additional contains on these files systems.  The test for each system will follow the same set of steps, timing each along the way:
  1. One shot container:
  2.   Start 10 containers based on the initial busybox image
  3.   Write 64K to a file via dd
  4.   The container terminates
  5. Commit, take a snapshot, of the state of the container
  6. Run 10 new container instance from the committed snapshots
Code:
#  Check initial disk usage as per "df" and docker info
df -k
sudo docker info
#  loop over initial container creation
CONTAINER_NUM=10
time for CIDX in $(seq 1 10)
do
    time sudo docker run --name my_busybox${CIDX} busybox-1.23.0 dd if=/dev/zero of=/data1 bs=64k count=1
    sudo docker info
    df -k
done > /tmp/docker.run.output
#  loop over container snapshots
 time for CIDX in $(seq 1 10)
do
    time sudo docker commit my_busybox${CIDX} my_busybox${CIDX}_run
    sudo docker info
    df -k
done >> /tmp/docker.run.output
# check mem
top -n -b 1
free
#  loop over container sleeps
time for CIDX in $(seq 1 10)
do
    time sudo docker run --name my_busybox${CIDX}_sleep --detach=true my_busybox${CIDX}_run sleep 180
    sudo docker info
    df -k
done >> /tmp/docker.run.output
# check mem
top -n -b 1
free

Test Configuration
These tests require at least a few GB per file system.  9GB for data and 1G for metadata is sufficient, though the defaults for docker on my system are to use LVM loopback devicemapper files pre-allocated with 100GB.  Test system:
Bare Metal: Lenovo Ideapad-Z710
Operating System: Fedora 20 (Heisenbug)
Linux kernel: 3.17.4-200.fc20.x86_64
docker-io-1.3.2
lvm2-2.02.106
btrfs-progs-3.17
On Fedora the docker storage configuration is in:
/etc/sysconfig/docker-storage

LVM loopback
#  Docker configuration options
Took docker defaults
#  One shot container
1.7 seconds to start each container and write the output, with an extra second of the first run. +1MB dat/run, +41K metadata/run.  "docker info" output agreed with the "df" command output
#  Container commit
1.6s to commit each container.  Each commit required 0.5Mb of data and 20Kb of metadata
#  Run from snapshot
1.3s to start each container, though 1.5s on the first  Each container added 1Mb of data, and 41K metadata.
#  Memory
The memory usage jumped around by 2MB per container as the containers were created, but top showed that busy box used the same amount of memory inside the container as it does outside: 4Kb of RSS, and 2.7Mb of VSZ.
#  Notes
You may need to manually clean out your /var/lib/docker device mapper to re-claim space due to a known thinp issue.  I found this to be true after running these tests.  Running the following will wipe out all images and containers, and restore the devicemapper to 100GB thin provisioning. You may need to reboot after disabling docker if you are not allowed to delete stopped containers.  See restoration code, restoring a  100GB "sparse file":

systemctl disable docker
systemctl stop docker
rm -rf /var/lib/docker
mkdir -p /var/lib/docker/devicemapper/devicemapper
dd if=/dev/zero of=/var/lib/docker/devicemapper/devicemapper/data bs=1G count=0 seek=100
systemctl start docker

LVM direct
#  Docker configuration options
 DOCKER_STORAGE_OPTIONS="--storage-opt dm.datadev=/dev/vg1/ldata --storage-opt dm.metadatadev=/dev/vg1/mdata --storage-opt dm.fs=xfs"
#  One shot container
1.7 seconds to start each container and write the output, with an extra second on the first run. +5.5MB data, +8.2Kb metadata per run
#  Container commit
1.6s to commit the containers.  +2.7MB data, +4kb per commit
#  Run from snapshot
1.6s to restart each container.   +5.5MB data, +8.2Kb metadata per run
#  Memory
Same as LVM loopback
#  Notes
Using LVM direct did NOT have the same problem with space reclamation, which were seen using LVM loopback.

BTRFS
#  Docker configuration options.  Disable selinux during this test due to btrfs incompatibilities, and switch storage options.  I created a btrfs partition, added as a local mount in /mnt/butter, and created a bind mount from /mnt/butter to /var/lib/docker for this test:

systemctl stop docker
mount -o bind /mnt/butter /var/lib/docker
perl -pi -e 's/^OPTIONS=--selinux-enabled/OPTIONS=/' /etc/sysconfig/docker
perl -pi -e 's/^DOCKER_STORAGE_OPTIONS=.*/DOCKER_STORAGE_OPTIONS="-s btrfs"/' /etc/sysconfig/docker-storage
systemctl start docker
#  One shot container
0.92s to start each container and write the output. df difference per container was: +256Kb per container
#  Container commit
0.17s to commit each container. After the first commit a one time use of 672Kb of space was seen, and each container commit after that was +128Kb per container
#  Run from snapshot
0.45s for each container restart to complete. Again, another jump of 700Kb of space, and after that was +256Kb per container
#  Memory
A spike of 144Mb memory use was seen when restarting 10 containers, but it dropped down to 8.7Mb
#  Notes
The command: "btrfs filesystem df" tended to show half the usage as the "df" command, and almost always showed jumps in the meta data, with the exception of the first commit and restarts.  I may have simply been measuring some delayed disk commit from the prior run.

Many thanks to the following blogger for advanced docker LVM configuration tips:
 http://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/

Thinking Outside the Box

The snapshot and union file systems facilitate the ability to shared common unwritten data among multiple containers on the same host.  The understanding of the underlying implementation provides some insight into how it can be used.  Writes to container file systems will be slower and generate ever increasing deltas, however, it does give the ability to roll back the file system as a developer tool.

Container images will not be too unfamiliar to people already accustomed to packaging directories using the tar command.  The key details involved in managing containers is dealing with the extra snapshot layers involved that are sometimes implicitly created, or forcing explicit creation for container transfers or backups.

Start times of containers were all shown to be quick.  However, a spiike in memory use is typically seen when starting containers.  Also, the disk usage was almost always 2:1, at least when using the relatively small binary sizes used in the test examples.
Next it is time to look at the ecosystem and tools for managing contianers.  How do you automate builds and upgrades of images, how do you share and/or deploy them, how do you link container services to the outside, and how do you manage large numbers of containers across multiple system.  Keep tuned as the investigation continues.

What has your experience been with docker storage, and what other tests would you like to see?