- Container means... - Define "container" and list its components
- Bottoms Up! - Delve into execution driver implementation features: cgroups and namespaces
- container.sh - Use a shell script to demonstrate cgroups and namespaces
- Contain Yourself! - Explore some execution drivers: libcontainer, LXC, and systemd nspawn
- Right Tool for the Job - Summarize (impatient readers should skip to here)
Container means...
"Lightweight container", or simply "container" is too broad. Documentation from the execution drivers disagree somewhat, but all agree that a "container" runs on the same shared kernel of the host, differentiating it from a VM. For clarification I'll borrow from systemd nspawn's sub-categorization to clarify two types of containers: system versus application sandbox. "system containers" have a mostly complete OS with its own init system. "application sandbox containers", on the other hand, might run only one or some small subset of applications. System containers still share the kernel, so they would still typically better utilize server global resources than virtual machines.We can better understand containers by categorizing the component technologies. Partly borrowing Docker terminology, the container components are:
- Execution Driver
- Storage Backend
- Management Layer
- Security Features
When migrating existing workloads to Linux containers, I find it easier to move and test individual components using a "bottom up" perspective.
Bottoms Up!
The execution driver for containers on Linux have been made feasible primarily by two kernel features:- cgroups
- namespaces
cgroups, Linux Control Groups, provide the functionality to constrain a process hierarchy for a given set of resources, e.g. core affinity, max memory use, etc. Process hierarchies are maintained, regardless of whether a process has daemonized itself, e.g. via the "double fork" technique. Classic system resource management tools like taskset and ulimit can provide similar restrictions, but are not as easily applied and enforced on an entire process hierarchy. cgroup common knowledge has been increasing, and it has a file system API easily accessed by systems administrators.
Namespaces, on the other hand, are accessed only via system calls, typically not seen outside of execution drivers and a handful of system utilities, like "unshare" and "ip". Also available, "nsenter", if you have util-linux package v2.23 or later. A namespace wraps a global system resource for a set of processes so that (1) the resources can be translated, and (2) the global resource state can not be seen. The translation performed is usually to disguise a local resource as the global resource, making the process(es) appear to have sole access.
container.sh
First, we'll demonstrate the functionality of cgroups and namespaces as controlled by a shell script. Suppose that we are load testing an application, and we need restrictions such that: (1) a CPU core is reserved for administrative control on the host server, (2) memory use is limited , and (3) the application is allowed to bind to local network interfaces, but prevents external connectivity. The memory restriction enables targeted out-of-memory operations, which removes the possibility that the out-of-memory killer will kill a host system service, or that the system will slow to a crawl due to excessive swap paging.# Assumes running as root on a 2 core system with over 2G of mem # on CentOS 7+ with cgroups enabled. cgroup memory sub-systems are # frequently disabled, but can be enabled with kernel CONFIG flags or # kernel command-line parameters: cgroup_enable=memory swapaccount=1 # The following was tested on a GCE, Google Compute Engine, CentOS 7 instance # Install libcgroup to establish the /sys/fs/cgroup file system yum -y install libcgroup # Create the CPU cgroup "mygroup" mkdir /sys/fs/cgroup/cpuset/mygroup # Assign CPU 1, zero indexed in "mygroup", # which means that it can NOT use CPU 0. echo "1" > /sys/fs/cgroup/cpuset/mygroup/cpuset.cpus # Initialize NUMA "mems" to existing setting. NUMA is out-of-scope for this example. # However, init is required for cgroup cpuset, even if not used. cat /sys/fs/cgroup/cpuset/cpuset.mems > /sys/fs/cgroup/cpuset/mygroup/cpuset.mems # Create memory cgroup using the same name for consistency mkdir /sys/fs/cgroup/memory/mygroup # Update to 2GB max virtual mem in "mygroup" echo "2000000000" > /sys/fs/cgroup/memory/mygroup/memory.limit_in_bytes # Setup a loopback network namespace ip netns add mynetns ip netns exec mynetns ip link set dev lo up # Script to run the command using these resources. # This program will first constrain itself in a cgroup, # and then exec into the private ip namespace using the # given arguments: cat >/tmp/container.sh <<'EOF' # Contain our process in cgroups. 0 attaches current process echo 0 > /sys/fs/cgroup/cpuset/mygroup/tasks echo 0 > /sys/fs/cgroup/memory/mygroup/tasks # Execute our process in the private loopback IP namespace exec ip netns exec mynetns $@ EOF # Example: run a 10 second sleep to verify CPU constraints bash /tmp/container.sh sleep 10 & cat /proc/$!/cpuset cat /proc/$!/cgroup # Consume more memory, nom nom nom... bash /tmp/container.sh dd if=/dev/zero of=/dev/shm/fill bs=1k count=2048k # Check syslog to confirm that an out-of-memory killer ran against "mygroup" grep mygroup /var/log/messages | tail -n 3 # Remove file consuming memory rm -f /dev/shm/fill # Install nc if not present yum -y install nmap-ncat # Bind locally within the container via the nc command in background bash /tmp/container.sh nc -l 8080 & # Confirm that its loopback is not available from host echo "hello" | nc 127.0.0.1 8080 # Try sending to loopback within the container echo "hello container" | bash /tmp/container.sh nc 127.0.0.1 8080 # Hello world? Nope. Non-loopback connectivity always denied with "Network is unreachable" bash /tmp/container.sh nc 10.1.1.1 8080
See also the Docker metrics link
Contain Yourself!
With new technology comes bugs and limitations. For example, certain drivers still need: disk quotas, shared kernel logging, and resolution to security concerns. Consult the current feature and bug lists to ensure that an execution driver meets your requirements.Let's run through building some Fedora system containers with the same restrictions and assumptions as container.sh:
=== libcontainer
An execution driver promoted to be the default driver for Docker from version 0.10 and later.yum -y install docker-io systemctl start docker docker run --interactive=true --tty=true --cpuset="1" --memory="2000000000b" fedora /bin/bash # exiting shell will also terminate container
=== LXC
LXC, although no longer the default driver for docker, is still being enhanced for OpenStack with the addition of partner daemon LXD. LXD and some recent LXC enhancements appear to be focused on improving container security. Security is a hot, fast moving topic, which I'm leaving as blog post #4.# Assumes running as root with lxc, lxc-templates, and bridge-utils installed, e.g. for GCE CentOS 7 install: curl http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-2.noarch.rpm > /tmp/epel-release-7-2.noarch.rpm rpm -ivh /tmp/epel-release-7-2.noarch.rpm yum -y install lxc lxc-templates bridge-utils # Create a container to build a fedora system lxc-create -n myfedora -t fedora # Get the password, e.g. cat /var/lib/lxc/myfedora/tmp_root_pass # Setup the CPU and memory restrictions echo "lxc.cgroup.cpuset.cpus = 1" >> /var/lib/lxc/myfedora/config echo "lxc.cgroup.memory.limit_in_bytes = 2000000000" >> /var/lib/lxc/myfedora/config # If you are not running libvirt, you'll need to create a virbr0 interface brctl addbr virbr0 # Starting container in daemonize mode, then using console command to get around lack of escape code bug upon start-up in recent LXC version lxc-start --daemon --name myfedora lxc-console --name myfedora # Ctrl+a, q to exit lxc-stop --name myfedora
=== systemd-nspawn
Targeted at "building, testing, debugging, and profiling" for system containers only. There are plenty of disclaimers from the developers indicated that it was a development tool only, and not a full fledged container app. Personal experience confirms that there are a lot of rough edges. The following example does not clean itself fully after termination.# Tested from a recent Fedora 20 bare-metal install. # Disable audits in on your kernel cmdline via "audit=0" # Create the container yum -y --releasever=20 --nogpg --installroot=/srv/mycontainer --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal # Set password systemd-nspawn --directory /srv/mycontainer passwd # Boot container with a private network systemd-nspawn --boot --private-network --directory /srv/mycontainer # Set cgroups limits. # Sub-optimal: as it only applies after the server is started systemctl set-property machine-mycontainer.scope MemoryLimit=2G # Or, via cgroup FS # echo "2000000000" > /sys/fs/cgroup/memory/machine.slice/machine-mycontainer.scope/memory.limit_in_bytes # Hmmm... Unable to set CPU Affinity via systemd.resource-control # Only CPUShares is available # Stop from another terminal: machinectl terminate mycontainer
Right tool for the job
My recommendation is to use the right tool for the job. Here are some suggestions:- Trying out containers for streamlining application deployments, or virtual internal system deployments on homogeneous kernels? Use Docker with libcontainer.
- Want to securely run container images from 3rd parties or host multi-tenant containers? Look into the LXD project.
- Testing the latest Linux distro release candidate on your current kernel with a subset of packages? Try out nspawn.
What has been your experience been with docker?