1. 程式人生 > >【轉】Docker 生產環境之安全性 - 適用於 Docker 的 Seccomp 安全配置檔案

【轉】Docker 生產環境之安全性 - 適用於 Docker 的 Seccomp 安全配置檔案

安全計算模式(secure computing mode,seccomp)是 Linux 核心功能。可以使用它來限制容器內可用的操作。seccomp() 系統呼叫在呼叫程序的 seccomp 狀態下執行。可以使用此功能來限制你的應用程式的訪問許可權。

只有在使用 seccomp 構建 Docker 並且核心配置了 CONFIG_SECCOMP 的情況下,此功能才可用。要檢查你的核心是否支援 seccomp

$ cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
CONFIG_SECCOMP=y
  • 1
  • 2

注意:seccomp 配置檔案需要 seccomp 2.2.1,這在 Ubuntu 14.04,Debian Wheezy 或 Debian Jessie 中不可用。要在這些發行版上使用 seccomp,必須下載 最新的靜態 Linux 二進位制檔案(而不是軟體包)。

1. 為容器傳遞配置檔案

預設的 seccomp 配置檔案為使用 seccomp 執行容器提供了一個合理的設定,並禁用了大約 44 個超過 300+ 的系統呼叫。它具有適度的保護性,同時提供廣泛的應用相容性。預設的 Docker 配置檔案可以在 

這裡 找到。

實際上,該配置檔案是白名單,預設情況下阻止訪問所有的系統呼叫,然後將特定的系統呼叫列入白名單。該配置檔案工作時需要定義 SCMP_ACT_ERRNO 的 defaultAction 並僅針對特定的系統呼叫覆蓋該 actionSCMP_ACT_ERRNO 的影響是觸發 Permission Denied 錯誤。接下來,配置檔案中通過將 action 被覆蓋為 SCMP_ACT_ALLOW,定義一個完全允許的系統呼叫的特定列表。最後,一些特定規則適用於個別的系統呼叫,如 personality

socketsocketcall 等,以允許具有特定引數的那些系統呼叫的變體(to allow variants of those system calls with specific arguments)。

seccomp 有助於以最小許可權執行 Docker 容器。不建議更改預設的 seccomp 配置檔案。

執行容器時,如果沒有通過 --security-opt 選項覆蓋容器,則會使用預設配置。例如,以下顯式指定了一個策略:

$ docker run --rm \
             -it \
             --security-opt seccomp=/path/to/seccomp/profile.json \
             hello-world
  • 1
  • 2
  • 3
  • 4

1.1 預設配置檔案阻止的重要的系統呼叫

Docker 的預設 seccomp 配置檔案是一個白名單,它指定了允許的呼叫。下表列出了由於不在白名單而被有效阻止的重要(但不是全部)系統呼叫。該表包含每個系統呼叫被阻止的原因。

Syscall Description
acct Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT.
add_key Prevent containers from using the kernel keyring, which is not namespaced.
adjtimex Similar to clock_settime and settimeofday, time/date is not namespaced. Also gated by CAP_SYS_TIME.
bpf Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN.
clock_adjtime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
clock_settime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
clone Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
create_module Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE.
delete_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
finit_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
get_kernel_syms Deny retrieval of exported kernel and module symbols. Obsolete.
get_mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
init_module Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE.
ioperm Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
iopl Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO.
kcmp Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
kexec_file_load Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT.
kexec_load Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT.
keyctl Prevent containers from using the kernel keyring, which is not namespaced.
lookup_dcookie Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN.
mbind Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
mount Deny mounting, already gated by CAP_SYS_ADMIN.
move_pages Syscall that modifies kernel memory and NUMA settings.
name_to_handle_at Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE.
nfsservctl Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1.
open_by_handle_at Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH.
perf_event_open Tracing/profiling syscall, which could leak a lot of information on the host.
personality Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns.
pivot_root Deny pivot_root, should be privileged operation.
process_vm_readv Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
process_vm_writev Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE.
ptrace Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.
query_module Deny manipulation and functions on kernel modules. Obsolete.
quotactl Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN.
reboot Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT.
request_key Prevent containers from using the kernel keyring, which is not namespaced.
set_mempolicy Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE.
setns Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN.
settimeofday Time/date is not namespaced. Also gated by CAP_SYS_TIME.
socket, socketcall Used to send or receive packets and for other socket operations. All socket and socketcall calls are blocked except communication domains AF_UNIX, AF_INET, AF_INET6, AF_NETLINK, and AF_PACKET.
stime Time/date is not namespaced. Also gated by CAP_SYS_TIME.
swapon Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
swapoff Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN.
sysfs Obsolete syscall.
_sysctl Obsolete, replaced by /proc/sys.
umount Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
umount2 Should be a privileged operation. Also gated by CAP_SYS_ADMIN.
unshare Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare –user.
uselib Older syscall related to shared libraries, unused for a long time.
userfaultfd Userspace page fault handling, largely needed for process migration.
ustat Obsolete syscall.
vm86 In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.
vm86old In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN.

2. 不使用預設的 seccomp 配置檔案

可以傳遞 unconfined 以執行沒有預設 seccomp 配置檔案的容器。

$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
    unshare --map-root-user --user sh -c whoami