Most container deployments run with Docker's default seccomp profile, which blocks roughly 44 syscalls out of 300+. That leaves a massive attack surface. If an attacker escapes your application code, every allowed syscall is a potential exploitation primitive — ptrace for process injection, mount for filesystem escape, clone with CLONE_NEWUSER for privilege escalation.
This tutorial builds a defense-in-depth hardening pipeline for a containerized application:
Each step maps to a single make target so you can follow along one command at a time.
You need a Linux host (Ubuntu 22.04+ recommended). eBPF tracing and iptables egress filtering do not work on macOS or Windows — you need direct access to the kernel.
# Docker and Compose
sudo apt-get update
sudo apt-get install -y docker.io docker-compose-plugin
# eBPF tracing and JSON processing
sudo apt-get install -y bpftrace jq
# Verify
bpftrace --version
jq --version
docker compose version
Add your user to the docker group to avoid sudo for Docker commands:
sudo usermod -aG docker $USER
newgrp docker
ebpf-seccomp-cli-example/
├── Makefile # One-command-per-step workflow
├── app.py # Demo CLI application
├── Dockerfile # Minimal container image
├── docker-compose.yml # Hardened compose config
├── docker-compose.seccomp.yml # Compose overlay for seccomp
├── scripts/
│ ├── trace_syscalls.sh # eBPF tracing with bpftrace
│ ├── build_seccomp_profile.sh # Seccomp JSON generator
│ └── egress_policy.sh # Per-container iptables egress
└── seccomp-profile.json # Generated (after make seccomp)
| Target | What it does |
|---|---|
make setup |
Install bpftrace/jq, create host mount directories |
make build |
Build the container image |
make run |
Baseline run (no custom seccomp) |
make trace |
Trace syscalls with eBPF while the container runs |
make seccomp |
Generate seccomp profile from the trace |
make run-locked |
Run with the generated seccomp profile applied |
make egress-apply |
Lock container egress to a single domain |
make egress-remove |
Remove egress restrictions |
make clean |
Stop containers, remove generated files |
make all |
Full pipeline: build → run → trace → seccomp → run-locked |
The demo is a Python CLI that makes a single HTTPS request and writes the response to disk. It is intentionally simple, but it demonstrates three security patterns that matter in production:
https://kurtisvelarde.com#!/usr/bin/env python3
import argparse
import json
import logging
import os
import pathlib
import socket
import sys
import urllib.error
import urllib.parse
import urllib.request
from datetime import datetime, timezone
ALLOWED_URL = "https://kurtisvelarde.com"
ALLOWED_HOST = "kurtisvelarde.com"
ALLOWED_SCHEME = "https"
ALLOWED_OUTPUT_DIR = pathlib.Path("/hostmount/output_data").resolve()
ALLOWED_LOG_DIR = pathlib.Path("/hostmount/log_data").resolve()
class JsonFormatter(logging.Formatter):
def format(self, record: logging.LogRecord) -> str:
payload = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": record.levelname,
"event": record.msg,
"details": getattr(record, "details", {}),
}
return json.dumps(payload, separators=(",", ":"))
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description=(
"Fetch HTTPS content from kurtisvelarde.com and write to "
"/hostmount/output_data with logs in /hostmount/log_data."
)
)
parser.add_argument(
"--url", default=ALLOWED_URL,
help="Target URL. Must use https://kurtisvelarde.com",
)
parser.add_argument(
"--output", default="/hostmount/output_data/response.html",
help="Output file path under /hostmount/output_data",
)
parser.add_argument(
"--log-file", default="/hostmount/log_data/fetch.log",
help="Log file path under /hostmount/log_data",
)
parser.add_argument(
"--timeout", type=float, default=10.0,
help="HTTP timeout in seconds",
)
return parser.parse_args()
def assert_allowed_url(url: str) -> None:
parsed = urllib.parse.urlparse(url)
if parsed.scheme != ALLOWED_SCHEME:
raise ValueError(f"URL must use {ALLOWED_SCHEME}")
if parsed.hostname != ALLOWED_HOST:
raise ValueError(f"URL host must be {ALLOWED_HOST}")
if parsed.port not in (None, 443):
raise ValueError("URL port must be 443")
def assert_within_dir(
path_value: str, allowed_dir: pathlib.Path, label: str
) -> pathlib.Path:
candidate = pathlib.Path(path_value)
resolved = candidate.resolve()
if not str(resolved).startswith(str(allowed_dir) + os.sep):
raise ValueError(f"{label} must be under {allowed_dir}")
return resolved
def setup_logger(log_path: pathlib.Path) -> logging.Logger:
log_path.parent.mkdir(parents=True, exist_ok=True)
logger = logging.getLogger("fetch_cli")
logger.setLevel(logging.INFO)
logger.handlers = []
handler = logging.FileHandler(log_path, encoding="utf-8")
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)
return logger
def resolve_host_ips(hostname: str) -> list[str]:
infos = socket.getaddrinfo(hostname, 443, type=socket.SOCK_STREAM)
ips = sorted({info[4][0] for info in infos})
return ips
def fetch_bytes(url: str, timeout: float) -> tuple[bytes, int]:
request = urllib.request.Request(url, method="GET")
with urllib.request.urlopen(request, timeout=timeout) as response:
return response.read(), response.status
def write_output(output_path: pathlib.Path, body: bytes) -> int:
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as f:
written = f.write(body)
return written
def main() -> int:
args = parse_args()
try:
assert_allowed_url(args.url)
output_path = assert_within_dir(
args.output, ALLOWED_OUTPUT_DIR, "Output path"
)
log_path = assert_within_dir(
args.log_file, ALLOWED_LOG_DIR, "Log path"
)
except Exception as exc:
print(f"Validation error: {exc}", file=sys.stderr)
return 2
logger = setup_logger(log_path)
# DNS resolution
try:
resolved_ips = resolve_host_ips(ALLOWED_HOST)
logger.info(
"dns_lookup",
extra={"details": {"hostname": ALLOWED_HOST, "ips": resolved_ips}},
)
except Exception as exc:
logger.error(
"dns_lookup_failure",
extra={"details": {"hostname": ALLOWED_HOST, "error": str(exc)}},
)
print(f"DNS lookup failed: {exc}", file=sys.stderr)
return 3
# HTTPS fetch
try:
body, status = fetch_bytes(args.url, args.timeout)
logger.info(
"outbound_connection_success",
extra={"details": {
"url": args.url, "status": status,
"bytes_received": len(body),
}},
)
except urllib.error.URLError as exc:
logger.error(
"outbound_connection_failure",
extra={"details": {"url": args.url, "error": str(exc)}},
)
print(f"Request failed: {exc}", file=sys.stderr)
return 4
# Write output
try:
bytes_written = write_output(output_path, body)
logger.info(
"file_write_success",
extra={"details": {
"path": str(output_path),
"bytes_written": bytes_written,
}},
)
except OSError as exc:
logger.error(
"file_write_failure",
extra={"details": {"path": str(output_path), "error": str(exc)}},
)
print(f"File write failed: {exc}", file=sys.stderr)
return 5
print(f"Fetched {args.url} -> {output_path} ({bytes_written} bytes)")
return 0
if __name__ == "__main__":
raise SystemExit(main())
Key security design choices:
assert_allowed_url() blocks requests to any host except the hardcoded one. This is defense at the app layer — even if someone modifies the CLI arguments, the app rejects it before making a connection.assert_within_dir() resolves symlinks and checks the canonical path is under the allowed directory. This prevents path traversal attacks like --output /hostmount/output_data/../../etc/passwd.JsonFormatter produces structured logs that are machine-parseable. Every security-relevant event (DNS lookup, connection success/failure, file write success/failure) gets its own log entry with a timestamp.The image is intentionally minimal — no shell utilities, no package manager cache, just the Python runtime and the application:
FROM python:3.12-slim
WORKDIR /app
COPY app.py /app/app.py
ENTRYPOINT ["python3", "/app/app.py"]
Using python:3.12-slim over python:3.12 drops the image size significantly and removes compilers, headers, and other tools an attacker could leverage post-exploitation.
Every step of the hardening workflow is a single make command:
COMPOSE := docker compose
SERVICE := ebpf_seccomp_demo
SECCOMP_OVERLAY := -f docker-compose.yml -f docker-compose.seccomp.yml
DOMAIN := kurtisvelarde.com
RAW_FILE := /tmp/syscalls.raw
SYSCALL_LIST := /tmp/syscalls.txt
PROFILE := ./seccomp-profile.json
.PHONY: help setup build run trace seccomp run-locked \
egress-apply egress-remove clean all
help: ## Show available targets
@grep -E '^[a-zA-Z_-]+:.*##' $(MAKEFILE_LIST) \
| awk 'BEGIN {FS = ":.*## "}; {printf " %-16s %s\n", $$1, $$2}'
setup: ## Install host dependencies and create mount dirs
sudo apt-get update -qq && sudo apt-get install -y bpftrace jq
sudo mkdir -p /hostmount/output_data /hostmount/log_data
sudo chown -R $$(id -u):$$(id -g) /hostmount/output_data \
/hostmount/log_data
build: ## Build the container image
$(COMPOSE) build
run: ## Run baseline (no custom seccomp profile)
$(COMPOSE) run --rm $(SERVICE)
trace: ## Trace syscalls with eBPF
@echo "--- Starting container in background ---"
$(COMPOSE) up -d
@CID=$$($(COMPOSE) ps -q $(SERVICE)); \
PID=$$(docker inspect -f '{{.State.Pid}}' "$$CID"); \
echo "Container PID: $$PID"; \
echo "--- Tracing (press Ctrl+C when done) ---"; \
./scripts/trace_syscalls.sh "$$PID" $(RAW_FILE) || true
$(COMPOSE) down
@wc -l < $(RAW_FILE) \
| xargs -I{} echo "Captured {} syscall events"
seccomp: ## Build seccomp profile from traced syscalls
./scripts/build_seccomp_profile.sh \
$(RAW_FILE) $(SYSCALL_LIST) $(PROFILE)
@echo "--- Allowed syscalls ---"
@cat $(SYSCALL_LIST)
run-locked: ## Run with generated seccomp profile applied
$(COMPOSE) $(SECCOMP_OVERLAY) run --rm $(SERVICE)
egress-apply: ## Lock egress to DOMAIN only (requires sudo)
@CID=$$($(COMPOSE) ps -q $(SERVICE)); \
sudo ./scripts/egress_policy.sh apply "$$CID" $(DOMAIN)
egress-remove: ## Remove egress restrictions (requires sudo)
@CID=$$($(COMPOSE) ps -q $(SERVICE)); \
sudo ./scripts/egress_policy.sh remove "$$CID"
clean: ## Remove generated files and stop containers
$(COMPOSE) down --remove-orphans 2>/dev/null || true
rm -f $(RAW_FILE) $(SYSCALL_LIST) $(PROFILE)
all: build run trace seccomp run-locked ## Full pipeline
make setup → make build)First, prepare your host and build the image:
make setup
make build
make setup installs bpftrace and jq and creates the host mount directories under /hostmount/. make build runs docker compose build which builds the image from the Dockerfile.
The docker-compose.yml applies several hardening measures before we even get to seccomp:
services:
ebpf_seccomp_demo:
build:
context: .
image: ebpf-seccomp-demo:latest
read_only: true
tmpfs:
- /tmp:rw,noexec,nosuid,size=16m
cap_drop:
- ALL
security_opt:
- no-new-privileges:true
volumes:
- /hostmount/output_data:/hostmount/output_data:rw
- /hostmount/log_data:/hostmount/log_data:rw
command:
- --url
- https://kurtisvelarde.com
- --output
- /hostmount/output_data/response.html
- --log-file
- /hostmount/log_data/fetch.log
Line by line:
| Setting | What it prevents |
|---|---|
read_only: true |
Attacker cannot write to the container filesystem (no malware drops, no crontab modification) |
tmpfs: /tmp:rw,noexec,nosuid,size=16m |
/tmp is writable but binaries there cannot execute and SUID bits are ignored |
cap_drop: [ALL] |
Drops every Linux capability — no CAP_NET_RAW (no raw sockets), no CAP_SYS_ADMIN (no mount), no CAP_DAC_OVERRIDE (no permission bypass) |
no-new-privileges:true |
Prevents SUID binaries from elevating privileges, blocks execve privilege escalation |
Scoped volumes |
Only two specific host directories are mounted writable — the app cannot reach /etc, /root, or any other host path |
This is already significantly harder to exploit than a default docker run. But the container still has access to ~260 syscalls via Docker's default seccomp profile. We can do better.
make run)Run the app to verify it works before we start restricting it:
make run
Expected output:
Fetched https://kurtisvelarde.com -> /hostmount/output_data/response.html (45231 bytes)
Check the structured log:
cat /hostmount/log_data/fetch.log | jq .
You should see three JSON events:
{"timestamp":"2026-03-13T...","level":"INFO","event":"dns_lookup","details":{"hostname":"kurtisvelarde.com","ips":["..."]}}
{"timestamp":"2026-03-13T...","level":"INFO","event":"outbound_connection_success","details":{"url":"https://kurtisvelarde.com","status":200,"bytes_received":45231}}
{"timestamp":"2026-03-13T...","level":"INFO","event":"file_write_success","details":{"path":"/hostmount/output_data/response.html","bytes_written":45231}}
The app works. Now let's find out exactly which syscalls it used.
make trace)eBPF (extended Berkeley Packet Filter) lets you attach programs to kernel events without modifying the kernel or loading kernel modules. bpftrace is a high-level tracing language that compiles to eBPF bytecode.
We use it to hook the tracepoint:syscalls:sys_enter_* family — every time the container's process enters a syscall, bpftrace records which one.
#!/usr/bin/env bash
set -euo pipefail
PID="$1"
OUTFILE="${2:-/tmp/syscalls.raw}"
sudo bpftrace -e \
"tracepoint:syscalls:sys_enter_* /tgid == ${PID}/ \
{ printf(\"%s\\n\", probe); }" > "$OUTFILE"
This one-liner:
- Hooks every sys_enter_* tracepoint (all ~300 syscalls)
- Filters by tgid (thread group ID = PID of the container's main process)
- Prints the probe name (e.g., tracepoint:syscalls:sys_enter_read) to the output file
make trace
This starts the container in the background, finds its PID, and starts tracing. Press Ctrl+C after the container finishes its work (you'll see the fetch output in the Docker logs).
Inspect the raw trace:
head -20 /tmp/syscalls.raw
tracepoint:syscalls:sys_enter_read
tracepoint:syscalls:sys_enter_write
tracepoint:syscalls:sys_enter_openat
tracepoint:syscalls:sys_enter_close
tracepoint:syscalls:sys_enter_fstat
tracepoint:syscalls:sys_enter_mmap
tracepoint:syscalls:sys_enter_mprotect
...
Each line is one syscall invocation. You'll see some syscalls appear thousands of times (read, write, futex) and others just once (socket, connect). The exact count doesn't matter — we only care about the unique set.
make seccomp)Seccomp (Secure Computing Mode) uses BPF filters to restrict which syscalls a process can make. When a process attempts a blocked syscall, the kernel either kills the process or returns an error — the syscall never executes.
Docker's default seccomp profile blocks ~44 syscalls that are almost never needed (like reboot, kexec_load, mount). But it allows ~260 others. A custom profile generated from a real trace flips this: deny by default, allow only what was observed.
#!/usr/bin/env bash
set -euo pipefail
RAW_INPUT="${1:-/tmp/syscalls.raw}"
SYSCALL_LIST="${2:-/tmp/syscalls.txt}"
PROFILE_OUT="${3:-./seccomp-profile.json}"
# Deduplicate and strip the tracepoint prefix
sort -u "$RAW_INPUT" \
| sed 's/^tracepoint:syscalls:sys_enter_//' > "$SYSCALL_LIST"
# Build the JSON profile
jq -Rn '
[inputs | select(length>0)] as $names
| {
defaultAction: "SCMP_ACT_ERRNO",
architectures: [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
syscalls: [
{ names: $names, action: "SCMP_ACT_ALLOW" }
]
}
' "$SYSCALL_LIST" > "$PROFILE_OUT"
The critical setting is defaultAction: "SCMP_ACT_ERRNO" — any syscall not in the allowlist returns EPERM to the caller instead of executing.
make seccomp
Output:
Wrote syscall list to: /tmp/syscalls.txt
Wrote seccomp profile to: ./seccomp-profile.json
--- Allowed syscalls ---
access
bind
brk
clone3
close
connect
...
write
Inspect the generated profile:
cat seccomp-profile.json | jq .
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"access",
"bind",
"brk",
"clone3",
"close",
"connect",
"..."
],
"action": "SCMP_ACT_ALLOW"
}
]
}
A typical Python HTTPS client uses around 30-40 unique syscalls. That's a reduction from ~260 allowed by default to ~35 — roughly an 85% reduction in syscall attack surface.
make run-locked)The compose overlay applies the generated profile:
# docker-compose.seccomp.yml
services:
ebpf_seccomp_demo:
security_opt:
- no-new-privileges:true
- seccomp:./seccomp-profile.json
Run with the profile:
make run-locked
If the output matches the baseline run, your profile is correct. The app works with only the syscalls it actually needs.
To confirm the profile is enforcing, try running a shell command that uses a syscall outside the allowlist. For example, ptrace is used by debuggers and is almost certainly not in your trace:
docker compose -f docker-compose.yml -f docker-compose.seccomp.yml \
run --rm --entrypoint /bin/sh ebpf_seccomp_demo -c \
"python3 -c 'import ctypes; ctypes.CDLL(None).ptrace(0,0,0,0)'"
This should fail with a permission error because ptrace is not in the allowlist.
make egress-apply)Seccomp filters syscall numbers and arguments, not high-level concepts like destination hostname or IP address. It can block connect() entirely, but it cannot say "allow connect() only to kurtisvelarde.com:443." That requires a different layer.
The egress_policy.sh script creates per-container iptables rules in Docker's DOCKER-USER chain:
#!/usr/bin/env bash
set -euo pipefail
ACTION="$1"
CONTAINER_REF="$2"
DOMAIN="${3:-kurtisvelarde.com}"
# Get container's IP on the Docker bridge network
SOURCE_IP=$(docker inspect -f \
'{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' \
"$CONTAINER_REF")
# Resolve the allowed domain to IPv4 addresses
ALLOWED_IPS=($(getent ahostsv4 "$DOMAIN" | awk '{print $1}' | sort -u))
# Create a per-container chain
CHAIN="CODX_EGRESS_${SHORT_ID}"
iptables -N "$CHAIN"
# Allow established connections (responses to our requests)
iptables -A "$CHAIN" -m conntrack \
--ctstate ESTABLISHED,RELATED -j ACCEPT
# Allow HTTPS to resolved IPs only
for ip in "${ALLOWED_IPS[@]}"; do
iptables -A "$CHAIN" -p tcp -d "$ip" --dport 443 -j ACCEPT
done
# Reject everything else from this container
iptables -A "$CHAIN" -j REJECT
# Hook into DOCKER-USER
iptables -I DOCKER-USER 1 -s "${SOURCE_IP}/32" -j "$CHAIN"
Start the container, then apply egress restrictions:
# Start the container
docker compose up -d
# Apply egress lock
make egress-apply
Output:
Applied egress policy for container: ebpf_seccomp_demo
- Source IP: 172.18.0.2
- Chain: CODX_EGRESS_a1b2c3d4e5f6
- Allowed destination domain: kurtisvelarde.com
- Allowed destination IPv4: 104.21.x.x 172.67.x.x
Now the container can only make HTTPS connections to the resolved IPs of kurtisvelarde.com. Any attempt to reach another host will be rejected by iptables before it leaves the Docker bridge.
make egress-remove
This flushes and removes the per-container chain cleanly.
The entire pipeline from zero to fully hardened:
# One-time host setup
make setup
# Full pipeline
make all
Or step by step:
make build # Build container image
make run # Verify baseline works
make trace # Trace syscalls (Ctrl+C when done)
make seccomp # Generate seccomp profile
make run-locked # Run with seccomp enforced
Add egress filtering on top:
docker compose up -d
make egress-apply # Lock network to allowed domain
make egress-remove # Clean up when done
When you update the base image (python:3.12-slim → python:3.13-slim) or modify app.py, the set of required syscalls may change. Always re-trace:
make clean
make build
make trace
make seccomp
Compare the old and new syscall lists to see what changed:
diff /tmp/syscalls.old.txt /tmp/syscalls.txt
Add a CI job that:
seccomp-profile.json# CI script snippet
make build
make trace
make seccomp
diff seccomp-profile.json seccomp-profile.json.committed || {
echo "FAIL: Syscall set changed. Review and commit the new profile."
diff seccomp-profile.json seccomp-profile.json.committed
exit 1
}
Test that your hardening actually blocks things:
# Wrong URL — app should reject before connecting
docker compose run --rm ebpf_seccomp_demo \
--url https://evil.com
# Wrong output path — app should reject path traversal
docker compose run --rm ebpf_seccomp_demo \
--output /etc/passwd
# Blocked egress — iptables should reject the connection
# (apply egress for a different domain, then try the real one)
Seccomp is a powerful layer, but it has clear boundaries:
| What you need | Seccomp can do it? | Use instead |
|---|---|---|
Block ptrace, mount, reboot |
Yes | — |
Allow read but only on /app/data |
No | AppArmor or SELinux (MAC) |
Allow connect but only to 10.0.0.5:443 |
No | iptables, Kubernetes NetworkPolicy |
Prevent writes to /etc/shadow |
No | AppArmor, SELinux, read-only rootfs |
| Limit memory or CPU | No | cgroups (--memory, --cpus) |
The defense-in-depth model in this tutorial combines:
Each layer covers the gaps of the others. An attacker who bypasses your app validation still hits the read-only filesystem. If they find a writable tmpfs, they can't execute binaries (noexec). If they find an allowed syscall to open a socket, iptables blocks the connection. No single layer is sufficient — the combination is what makes container breakout impractical.
make clean
This stops all containers, removes generated files (seccomp-profile.json, syscall traces), and clears the host mount directories.