Containers share the host kernel — the only barriers between a container process and full host compromise are Linux security primitives: namespaces, cgroups, capabilities, seccomp, and mandatory access control (AppArmor/SELinux). Understanding how these controls work, and how attackers bypass them, is fundamental to securing any container workload.
This tutorial walks through practical Docker security scenarios:
--privileged, hostPID, and nsenter give attackers full host access--read-only and tmpfs-v /:/host is equivalent to giving away rootEach example is designed to be run locally with Docker — no Kubernetes cluster required.
mkdir -p ~/docker-security-lab && cd ~/docker-security-lab
Linux capabilities split the monolithic root privilege into discrete units. Instead of granting a process full root power, you grant only the specific capabilities it needs. Docker containers start with a reduced capability set, but it's still more than most workloads require.
# Run a container and list its capabilities
docker run --rm -it alpine:latest /bin/sh -c "apk add -q libcap && capsh --print"
The output shows the capability bounding set — these are the maximum privileges available to processes inside the container.
Dropping all capabilities is the most restrictive baseline. Most applications will still run — they just can't perform privileged operations like binding to low ports, changing file ownership, or sending raw network packets.
# Drop ALL capabilities — ping fails because it needs CAP_NET_RAW
docker run --rm --cap-drop ALL alpine:latest /bin/sh -c \
"ping -c1 -W2 127.0.0.1"
Expected output: ping: permission denied (are you running as root?)
# Add CAP_NET_RAW back — ping works again
docker run --rm --cap-drop ALL --cap-add NET_RAW alpine:latest /bin/sh -c \
"ping -c1 -W2 127.0.0.1"
Expected output: 1 packets transmitted, 1 packets received, 0% packet loss
| Capability | What it allows | Risk level |
|---|---|---|
NET_RAW |
Raw sockets (ping, packet crafting) | Medium — enables ARP spoofing |
NET_ADMIN |
Network configuration changes | High — can sniff traffic, modify routes |
SYS_ADMIN |
Mount filesystems, manage namespaces | Critical — near-root access |
SYS_PTRACE |
Trace and debug processes | High — can read memory of other processes |
DAC_OVERRIDE |
Bypass file permission checks | High — read/write any file |
SETUID / SETGID |
Change process UID/GID | High — escalate to any user |
Best practice: Always use --cap-drop ALL as the baseline and add back only the specific capabilities your application requires. Document why each capability is needed.
# Example: web server that only needs to bind to port 80
docker run --rm --cap-drop ALL --cap-add NET_BIND_SERVICE \
-p 8080:80 nginx:alpine
Seccomp (Secure Computing Mode) filters which system calls a container process can execute. Docker applies a default seccomp profile that blocks approximately 44 of the 300+ Linux syscalls, including dangerous ones like mount, reboot, and kexec_load.
Disabling seccomp removes all syscall filtering — the container process can invoke any syscall the kernel supports.
# Without seccomp: unshare succeeds — the process can create new namespaces
docker run --rm -it --security-opt seccomp=unconfined alpine:latest \
unshare --map-root-user --user /bin/sh -c "whoami && id"
This works because unshare requires the CLONE_NEWUSER syscall, which the default seccomp profile blocks.
# With default seccomp: unshare is blocked
docker run --rm -it alpine:latest \
unshare --map-root-user --user /bin/sh -c "whoami && id"
Expected: unshare: unshare(0x10000000): Operation not permitted
Download Docker's default profile and customize it:
# Download the default seccomp profile
curl -sO https://raw.githubusercontent.com/moby/moby/master/profiles/seccomp/default.json
# Inspect which syscalls are allowed
cat default.json | python3 -m json.tool | grep -c "name"
Create a stricter profile that also blocks chmod and chown:
cat > strict-seccomp.json << 'SECCOMP'
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_AARCH64"
],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "bind", "brk", "capget",
"capset", "chdir", "clone", "close", "connect", "dup",
"dup2", "dup3", "epoll_create", "epoll_create1",
"epoll_ctl", "epoll_wait", "epoll_pwait", "execve",
"exit", "exit_group", "faccessat", "fchdir", "fcntl",
"fstat", "fstatfs", "futex", "getcwd", "getdents64",
"getegid", "geteuid", "getgid", "getpeername",
"getpid", "getppid", "getrandom", "getsockname",
"getsockopt", "getuid", "ioctl", "listen", "lseek",
"madvise", "mmap", "mprotect", "munmap", "nanosleep",
"newfstatat", "open", "openat", "pipe", "pipe2",
"poll", "ppoll", "prctl", "pread64", "prlimit64",
"pwrite64", "read", "readlink", "readlinkat",
"recvfrom", "recvmsg", "rename", "rt_sigaction",
"rt_sigprocmask", "rt_sigreturn", "select",
"sendmsg", "sendto", "set_robust_list",
"set_tid_address", "setsockopt", "shutdown",
"sigaltstack", "socket", "stat", "statfs",
"sysinfo", "tgkill", "uname", "unlink", "wait4",
"write", "writev"
],
"action": "SCMP_ACT_ALLOW"
}
]
}
SECCOMP
# Run with the strict profile — chmod is now blocked
docker run --rm -it --security-opt seccomp=./strict-seccomp.json \
alpine:latest /bin/sh -c "touch /tmp/test && chmod 777 /tmp/test"
Expected: chmod: /tmp/test: Operation not permitted
AppArmor provides mandatory access control at the filesystem and network level. Docker's default AppArmor profile (docker-default) restricts containers from writing to /proc and /sys, mounting filesystems, and accessing certain devices.
# Verify AppArmor is loaded (Linux hosts only)
sudo aa-status 2>/dev/null | head -20
# Run with the default profile explicitly
docker run --rm -it --security-opt apparmor=docker-default \
alpine:latest /bin/sh -c "cat /proc/sysrq-trigger"
Write a restrictive profile that prevents a container from writing to anything except /tmp:
sudo tee /etc/apparmor.d/docker-restricted << 'EOF'
#include <tunables/global>
profile docker-restricted flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
# Allow read access to most paths
/ r,
/** r,
# Only allow writes to /tmp
/tmp/** rw,
# Deny writes everywhere else
deny /etc/** w,
deny /usr/** w,
deny /var/** w,
deny /home/** w,
deny /root/** w,
# Deny raw network access
deny network raw,
# Deny mount operations
deny mount,
# Allow necessary capabilities
capability net_bind_service,
capability setuid,
capability setgid,
}
EOF
# Load the profile
sudo apparmor_parser -r /etc/apparmor.d/docker-restricted
# Test: writing to /etc fails
docker run --rm -it --security-opt apparmor=docker-restricted \
alpine:latest /bin/sh -c "echo test > /etc/test.txt"
# Test: writing to /tmp succeeds
docker run --rm -it --security-opt apparmor=docker-restricted \
alpine:latest /bin/sh -c "echo test > /tmp/test.txt && cat /tmp/test.txt"
Understanding how attackers escalate privileges from inside a container helps you defend against these patterns. Each technique below exploits a specific Docker misconfiguration.
--privileged FlagThe --privileged flag disables all container isolation — capabilities, seccomp, AppArmor, device cgroups, and namespace restrictions are removed. The container process has the same access as a root process on the host.
# Privileged container can see ALL host devices
docker run --rm -it --privileged alpine:latest /bin/sh -c \
"fdisk -l 2>/dev/null | head -20"
The container can see and interact with host block devices, mount host filesystems, load kernel modules, and access every hardware device.
nsenter with Host PID NamespaceWhen a container shares the host's PID namespace (--pid=host), it can see all host processes. Combined with --privileged, nsenter lets you enter the host's mount, UTS, network, and IPC namespaces — effectively escaping the container entirely.
# Full host escape: nsenter into PID 1 (the host's init process)
docker run --rm -it --privileged --pid=host alpine:latest \
nsenter -t 1 -m -u -n -i /bin/sh -c "hostname && whoami && cat /etc/hostname"
What each nsenter flag does:
| Flag | Namespace | Effect |
|---|---|---|
-t 1 |
Target | Attach to PID 1 (host init) |
-m |
Mount | See the host filesystem |
-u |
UTS | See the host hostname |
-n |
Network | See the host network stack |
-i |
IPC | See the host IPC resources |
This command gives you an interactive shell as root on the host. From here, you can read /etc/shadow, install packages, modify systemd services, or pivot to other machines on the network.
Mounting the host root filesystem into a container provides direct read/write access to everything on the host:
# Mount the host root filesystem at /host
docker run --rm -it -v /:/host alpine:latest /bin/sh -c \
"chroot /host /bin/sh -c 'cat /etc/shadow | head -5'"
This works even without --privileged if the user running Docker has permission to bind-mount /. The chroot then pivots into the host filesystem, giving the process full host context.
Mounting the Docker socket gives the container control over the Docker daemon — it can create new privileged containers, access volumes, and effectively control the host:
# Mount the Docker socket — the container can now manage Docker
docker run --rm -it -v /var/run/docker.sock:/var/run/docker.sock \
docker:latest docker ps
From here, an attacker can launch a privileged container with the host filesystem mounted — achieving full host compromise in two steps.
Defense: Never mount the Docker socket into application containers. Use rootless Docker or Podman for environments where containers need to build other containers.
Running containers with --read-only prevents any writes to the container's root filesystem. This blocks attackers from downloading tools, modifying configs, or writing persistence mechanisms.
# Read-only root filesystem — writes fail
docker run --rm -it --read-only alpine:latest /bin/sh -c \
"echo 'malware' > /tmp/payload"
Expected: can't create /tmp/payload: Read-only file system
Most applications need to write to a few specific paths (temp files, PID files, logs). Use tmpfs mounts for these — they exist only in memory and are never written to disk.
# Read-only with tmpfs for /tmp — application writes work, nothing persists
docker run --rm -it --read-only --tmpfs /tmp:rw,noexec,nosuid \
alpine:latest /bin/sh -c \
"echo 'tempdata' > /tmp/test && cat /tmp/test && ls -la /tmp/"
Key tmpfs mount options:
| Option | Effect |
|---|---|
rw |
Allow read/write (default for tmpfs) |
noexec |
Prevent executing binaries from tmpfs — blocks attackers from running downloaded payloads |
nosuid |
Ignore SUID/SGID bits — prevents privilege escalation via setuid binaries |
size=64m |
Limit tmpfs size — prevents memory exhaustion attacks |
Combine --read-only, --tmpfs, non-root user, and dropped capabilities for a hardened container:
# Create a test image with a non-root user
cat > /tmp/Dockerfile.secure << 'EOF'
FROM alpine:latest
RUN adduser -D -u 1000 appuser
USER appuser
WORKDIR /home/appuser
CMD ["sh"]
EOF
docker build -t secure-test -f /tmp/Dockerfile.secure /tmp
# Run with full hardening
docker run --rm -it \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--cap-drop ALL \
--security-opt no-new-privileges \
-u 1000:1000 \
secure-test /bin/sh -c "whoami && id && touch /tmp/ok && echo 'write works in /tmp'"
The --security-opt no-new-privileges flag prevents the process from gaining additional privileges through SUID binaries or capability inheritance — even if an attacker finds a setuid binary inside the container, they cannot exploit it.
Using --network=host removes network namespace isolation — the container shares the host's network stack, can bind to any port, and can sniff all host traffic.
# Host network: nginx binds directly to the host's port 80
docker run --rm -d --network=host --name host-nginx nginx:alpine
# Verify it's listening on the host's network interface
curl -s -o /dev/null -w "%{http_code}" http://localhost:80
# Cleanup
docker stop host-nginx
Why this is dangerous:
127.0.0.1 on the host (databases, admin interfaces)Best practice: Use Docker's bridge network (default) or custom networks. Only use --network=host when absolutely required for performance-critical network applications, and combine it with other security controls.
Before running any image — especially from public registries — inspect its contents and scan for known vulnerabilities.
# Pull an image
docker pull ubuntu/squid:latest
# View the full build history — shows every Dockerfile instruction
docker history --no-trunc ubuntu/squid:latest
Look for suspicious patterns in the history:
curl or wget fetching unknown URLsRUN bash -c "...")netcat, nmap, or socat# Scan for HIGH and CRITICAL vulnerabilities
trivy image --severity HIGH,CRITICAL ubuntu/squid:latest
# Scan and output as JSON for pipeline integration
trivy image --severity HIGH,CRITICAL -f json -o scan-results.json \
ubuntu/squid:latest
# Scan a local Dockerfile for misconfigurations
trivy config /tmp/Dockerfile.secure
# Start a container to inspect
docker run -d --name inspect-target alpine:latest sleep 3600
# Export the filesystem as a tar and examine
docker export inspect-target | tar -tf - | head -50
# Look for suspicious files in writable locations
docker export inspect-target | tar -tf - | grep -E "(tmp|dev/shm|var/tmp)/"
# Check for SUID binaries
docker exec inspect-target find / -perm -4000 -type f 2>/dev/null
# Cleanup
docker stop inspect-target && docker rm inspect-target
For images signed with Sigstore cosign:
# Verify an image signature
cosign verify --key cosign.pub <registry>/<image>:<tag>
# Verify with keyless signing (Sigstore transparency log)
cosign verify \
--certificate-identity <signer-email> \
--certificate-oidc-issuer https://accounts.google.com \
<registry>/<image>:<tag>
Run containers with the maximum practical restrictions:
# Production-hardened container run command
docker run -d \
--name my-app \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
--security-opt no-new-privileges \
--security-opt seccomp=./strict-seccomp.json \
--security-opt apparmor=docker-restricted \
--memory 512m \
--cpus 1 \
--pids-limit 100 \
-u 1000:1000 \
--network my-app-net \
-p 8080:8080 \
my-app:latest
What each flag does:
| Flag | Security control |
|---|---|
--read-only |
Immutable root filesystem |
--tmpfs /tmp:noexec |
Writable temp that blocks binary execution |
--cap-drop ALL |
Remove all Linux capabilities |
--cap-add NET_BIND_SERVICE |
Add back only what's needed |
--no-new-privileges |
Block SUID/capability escalation |
--seccomp= |
Custom syscall filter |
--apparmor= |
Mandatory access control profile |
--memory 512m |
Prevent memory exhaustion |
--cpus 1 |
Prevent CPU exhaustion |
--pids-limit 100 |
Prevent fork bombs |
-u 1000:1000 |
Run as non-root user |
--network my-app-net |
Isolated network (not host) |
# Remove any running lab containers
docker rm -f falco-test secure-test inspect-target host-nginx 2>/dev/null
# Remove lab images
docker rmi secure-test 2>/dev/null
# Remove generated files
rm -f strict-seccomp.json default.json scan-results.json
rm -f /tmp/Dockerfile.secure
# Remove custom AppArmor profile (if created)
sudo apparmor_parser -R /etc/apparmor.d/docker-restricted 2>/dev/null
sudo rm -f /etc/apparmor.d/docker-restricted
# Remove lab directory
rm -rf ~/docker-security-lab/
--read-only