什么是runC?
2022-2-22 17:59:0 Author: mp.weixin.qq.com(查看原文) 阅读量:54 收藏


本文为看雪论坛优秀文章
看雪论坛作者ID:时钟

控制寄存器

OCI 标准

容器运行时,Container runtime是指管理和运行容器的工具,当前的容器工具很多,比如docker,rkt等,但是如果每个容器工具都使用自己的运行时,那么就不利于容器领域的发展,因此,一些容器厂商就一起制定了容器镜像格式和容器运行时的标准,即Open Container Initiative(OCI)。

OCI bundle

OCI Bundle是指满足OCI标准的一系列文件,这些文件包含了运行容器所需要的所有数据,它们存放在一个共同的目录,该目录包含以下两项:
1、config.json:包含容器运行的配置数据
2、container 的 root filesystem

runC框架

这是runC主要的代码逻辑,其中libcontainer其实就是早期docker的一大基础,为了适应OCI格式进行了二次的封装。
 
以runc create 为例子,其对应的主要操作如下:
startContainer:通过读取config.json配置将配置内容转换为OCI标准规定的内存数据结构形式,尝试创建容器,并根据参数执行不同的操作比如run,start,Restore。
contianer对应的一些数据结构如下,这里创建了一个接口,里面包括了一个容器需要的所有的操作:
type BaseContainer interface {    // Returns the ID of the container    ID() string
// Returns the current status of the container. Status() (Status, error)
// State returns the current container's state information. State() (*State, error)
// OCIState returns the current container's state information. OCIState() (*specs.State, error)
// Returns the current config of the container. Config() configs.Config
// Returns the PIDs inside this container. The PIDs are in the namespace of the calling process. // // Some of the returned PIDs may no longer refer to processes in the Container, unless // the Container state is PAUSED in which case every PID in the slice is valid. Processes() ([]int, error)
// Returns statistics for the container. Stats() (*Stats, error)
// Set resources of container as configured // // We can use this to change resources when containers are running. // Set(config configs.Config) error
// Start a process inside the container. Returns error if process fails to // start. You can track process lifecycle with passed Process structure. Start(process *Process) (err error)
// Run immediately starts the process inside the container. Returns error if process // fails to start. It does not block waiting for the exec fifo after start returns but // opens the fifo after start returns. Run(process *Process) (err error)
// Destroys the container, if its in a valid state, after killing any // remaining running processes. // // Any event registrations are removed before the container is destroyed. // No error is returned if the container is already destroyed. // // Running containers must first be stopped using Signal(..). // Paused containers must first be resumed using Resume(..). Destroy() error
// Signal sends the provided signal code to the container's initial process. // // If all is specified the signal is sent to all processes in the container // including the initial process. Signal(s os.Signal, all bool) error
// Exec signals the container to exec the users process at the end of the init. Exec() error}
在linux平台上,对该接口进行了一些包裹,生成了linux 平台的一些专用接口:
// Container is a libcontainer container object.//// Each container is thread-safe within the same process. Since a container can// be destroyed by a separate process, any function may return that the container// was not found.type Container interface {    BaseContainer
// Methods below here are platform specific
// Checkpoint checkpoints the running container's state to disk using the criu(8) utility. Checkpoint(criuOpts *CriuOpts) error
// Restore restores the checkpointed container to a running state using the criu(8) utility. Restore(process *Process, criuOpts *CriuOpts) error
// If the Container state is RUNNING or CREATED, sets the Container state to PAUSING and pauses // the execution of any user processes. Asynchronously, when the container finished being paused the // state is changed to PAUSED. // If the Container state is PAUSED, do nothing. Pause() error
// If the Container state is PAUSED, resumes the execution of any user processes in the // Container before setting the Container state to RUNNING. // If the Container state is RUNNING, do nothing. Resume() error
// NotifyOOM returns a read-only channel signaling when the container receives an OOM notification. NotifyOOM() (<-chan struct{}, error)
// NotifyMemoryPressure returns a read-only channel signaling when the container reaches a given pressure level NotifyMemoryPressure(level PressureLevel) (<-chan struct{}, error)}
还有一个重要的接口Factory:
type Factory interface {    // Creates a new container with the given id and starts the initial process inside it.    // id must be a string containing only letters, digits and underscores and must contain    // between 1 and 1024 characters, inclusive.    //    // The id must not already be in use by an existing container. Containers created using    // a factory with the same path (and filesystem) must have distinct ids.    //    // Returns the new container with a running process.    //    // On error, any partially created container parts are cleaned up (the operation is atomic).    Create(id string, config *configs.Config) (Container, error)
// Load takes an ID for an existing container and returns the container information // from the state. This presents a read only view of the container. Load(id string) (Container, error)
// StartInitialization is an internal API to libcontainer used during the reexec of the // container. StartInitialization() error
// Type returns info string about factory type (e.g. lxc, libcontainer...) Type() string}
其中也有对应Linux 平台的一个实现:
// LinuxFactory implements the default factory interface for linux based systems.type LinuxFactory struct {    // Root directory for the factory to store state.    Root string
// InitPath is the path for calling the init responsibilities for spawning // a container. InitPath string
// InitArgs are arguments for calling the init responsibilities for spawning // a container. InitArgs []string
// CriuPath is the path to the criu binary used for checkpoint and restore of // containers. CriuPath string
// New{u,g}idmapPath is the path to the binaries used for mapping with // rootless containers. NewuidmapPath string NewgidmapPath string
// Validator provides validation to container configurations. Validator validate.Validator
// NewIntelRdtManager returns an initialized Intel RDT manager for a single container. NewIntelRdtManager func(config *configs.Config, id string, path string) intelrdt.Manager}
Linux Factory中的create的具体实现其实就是创建一个LinuxContainer(这正和我们之前所说的Linux下的container接口相对应):
type linuxContainer struct {    id                   string    root                 string    config               *configs.Config    cgroupManager        cgroups.Manager    intelRdtManager      intelrdt.Manager    initPath             string    initArgs             []string    initProcess          parentProcess    initProcessStartTime uint64    criuPath             string    newuidmapPath        string    newgidmapPath        string    m                    sync.Mutex    criuVersion          int    state                containerState    created              time.Time    fifo                 *os.File}
func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) {    rootlessCg, err := shouldUseRootlessCgroupManager(context)    if err != nil {        return nil, err    }    config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{        CgroupName:       id,        UseSystemdCgroup: context.GlobalBool("systemd-cgroup"),        NoPivotRoot:      context.Bool("no-pivot"),        NoNewKeyring:     context.Bool("no-new-keyring"),        Spec:             spec,        RootlessEUID:     os.Geteuid() != 0,        RootlessCgroups:  rootlessCg,    })    if err != nil {        return nil, err    }
factory, err := loadFactory(context) if err != nil { return nil, err } return factory.Create(id, config)}
可以看到,首先加载配置config,然后使用loadFactory创建相关的LinuxFactory,最终调用了factory.Create(id, config),然后由factory.Create(id, config)返回一个LinuxContainer。
其中LoadFactory十分关键,他在最后调用了libcontainer.New()函数来返回LinuxContainer,在该New函数里面其设置了InitPath(InitPath非常重要):
// New returns a linux based container factory based in the root directory and// configures the factory with the provided option funcs.func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {    if root != "" {        if err := os.MkdirAll(root, 0o700); err != nil {            return nil, err        }    }    l := &LinuxFactory{        Root:      root,        InitPath:  "/proc/self/exe",        InitArgs:  []string{os.Args[0], "init"},        Validator: validate.New(),        CriuPath:  "criu",    }
for _, opt := range options { if opt == nil { continue } if err := opt(l); err != nil { return nil, err } } return l, nil}
在LinuxFactory的Create过程中InitPath和InitArgs被传递给linuxContainer。在知道是如何创建出一个linuxContainer之后,我们把目光返回到startContainer,该函数最后生成了runner结构体,然后调用了其run方法,参数为spec.Process,这里的spec.Process其实就是当初config.json里面的进程信息。
 
在run方法中,一方面通过newProcess以config.json为模板创建了libcontainer.Process结构体,与进程相关的limt和Capabilities等设置都在此时完成,另一方面主要根据action做了三种操作:
switch r.action {case CT_ACT_CREATE:    err = r.container.Start(process)case CT_ACT_RESTORE:    err = r.container.Restore(process, r.criuOpts)case CT_ACT_RUN:    err = r.container.Run(process)default:    panic("Unknown action")}
Process结构体,其中大部分的内容都来自config.json文件:
// Process specifies the configuration and IO for a process inside// a container.type Process struct {    // The command to be run followed by any arguments.    Args []string
// Env specifies the environment variables for the process. Env []string
// User will set the uid and gid of the executing process running inside the container // local to the container's user and group configuration. User string
// AdditionalGroups specifies the gids that should be added to supplementary groups // in addition to those that the user belongs to. AdditionalGroups []string
// Cwd will change the processes current working directory inside the container's rootfs. Cwd string
// Stdin is a pointer to a reader which provides the standard input stream. Stdin io.Reader
// Stdout is a pointer to a writer which receives the standard output stream. Stdout io.Writer
// Stderr is a pointer to a writer which receives the standard error stream. Stderr io.Writer
// ExtraFiles specifies additional open files to be inherited by the container ExtraFiles []*os.File
// Initial sizings for the console ConsoleWidth uint16 ConsoleHeight uint16
// Capabilities specify the capabilities to keep when executing the process inside the container // All capabilities not specified will be dropped from the processes capability mask Capabilities *configs.Capabilities
// AppArmorProfile specifies the profile to apply to the process and is // changed at the time the process is execed AppArmorProfile string
// Label specifies the label to apply to the process. It is commonly used by selinux Label string
// NoNewPrivileges controls whether processes can gain additional privileges. NoNewPrivileges *bool
// Rlimits specifies the resource limits, such as max open files, to set in the container // If Rlimits are not set, the container will inherit rlimits from the parent process Rlimits []configs.Rlimit
// ConsoleSocket provides the masterfd console. ConsoleSocket *os.File
// Init specifies whether the process is the first process in the container. Init bool
ops processOperations
LogLevel string
// SubCgroupPaths specifies sub-cgroups to run the process in. // Map keys are controller names, map values are paths (relative to // container's top-level cgroup). // // If empty, the default top-level container's cgroup is used. // // For cgroup v2, the only key allowed is "". SubCgroupPaths map[string]string}
start方法:
func (c *linuxContainer) Start(process *Process) error {    c.m.Lock()    defer c.m.Unlock()    if c.config.Cgroups.Resources.SkipDevices {        return errors.New("can't start container with SkipDevices set")    }    if process.Init {        if err := c.createExecFifo(); err != nil {            return err        }    }    if err := c.start(process); err != nil {        if process.Init {            c.deleteExecFifo()        }        return err    }    return nil}
可以看到,start方法,主要是创建了一个fifo管道(这个管道主要用于阻塞,后面会用到),然后调用了start方法。
func (c *linuxContainer) start(process *Process) (retErr error) {    parent, err := c.newParentProcess(process)    if err != nil {        return fmt.Errorf("unable to create new parent process: %w", err)    }
logsDone := parent.forwardChildLogs() if logsDone != nil { defer func() { // Wait for log forwarder to finish. This depends on // runc init closing the _LIBCONTAINER_LOGPIPE log fd. err := <-logsDone if err != nil && retErr == nil { retErr = fmt.Errorf("unable to forward init logs: %w", err) } }() }
if err := parent.start(); err != nil { return fmt.Errorf("unable to start container process: %w", err) }
if process.Init { c.fifo.Close() if c.config.Hooks != nil { s, err := c.currentOCIState() if err != nil { return err }
if err := c.config.Hooks[configs.Poststart].RunHooks(s); err != nil { if err := ignoreTerminateErrors(parent.terminate()); err != nil { logrus.Warn(fmt.Errorf("error running poststart hook: %w", err)) } return err } } } return nil}
该方法第一步首先返回了一个initProcess结构体,这个结构体实现了 parentProcess接口,该结构体由linuxContainer的newInitProcess函数创建。
type initProcess struct {    cmd             *exec.Cmd    messageSockPair filePair    logFilePair     filePair    config          *initConfig    manager         cgroups.Manager    intelRdtManager intelrdt.Manager    container       *linuxContainer    fds             []string    process         *Process    bootstrapData   io.Reader    sharePidns      bool}
接口如下:
type parentProcess interface {    // pid returns the pid for the running process.    pid() int
// start starts the process execution. start() error
// send a SIGKILL to the process and wait for the exit. terminate() error
// wait waits on the process returning the process state. wait() (*os.ProcessState, error)
// startTime returns the process start time. startTime() (uint64, error)
signal(os.Signal) error
externalDescriptors() []string
setExternalDescriptors(fds []string)
forwardChildLogs() chan error}
在整个的newParentProcess函数过程中,首先创了一对sock和一对pipe管道,然后用这一对sock中的childsock和childpipe创建了一个cmd模板,该模板中执行的命令正好就是之前的InitPath中设置的路径("/proc/self/exe",和 "init",这其实表示会执行runC本身,参数就是init),sock和pipe其实是为了实现cmd和父进程直接的数据通信,它们被放入到cmd.ExtraFiles中,同时相关的文件描述符被放入到环境变量里面。
接下来是对进程是否是初始化进程进行判断,如果不是,则调用newSetnsProcess,来返回一个setnsProcess结构体,该结构体同样实现了parentProcess接口,newSetnsProcess主要是用来在已有容器中创建一个新的进程。
 
接下来执行includeExecFifo()方法,其就是打开之前创建的exec.fifo文件,并存入到cmd.ExtraFiles和环境变量中,最后调用最关键的函数newInitProcess来创建Init结构体:
func (c *linuxContainer) newInitProcess(p *Process, cmd *exec.Cmd, messageSockPair, logFilePair filePair) (*initProcess, error) {    cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE="+string(initStandard))    nsMaps := make(map[configs.NamespaceType]string)    for _, ns := range c.config.Namespaces {        if ns.Path != "" {            nsMaps[ns.Type] = ns.Path        }    }    _, sharePidns := nsMaps[configs.NEWPID]    data, err := c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps, initStandard)    if err != nil {        return nil, err    }
if c.shouldSendMountSources() { // Elements on this slice will be paired with mounts (see StartInitialization() and // prepareRootfs()). This slice MUST have the same size as c.config.Mounts. mountFds := make([]int, len(c.config.Mounts)) for i, m := range c.config.Mounts { if !m.IsBind() { // Non bind-mounts do not use an fd. mountFds[i] = -1 continue }
// The fd passed here will not be used: nsexec.c will overwrite it with dup3(). We just need // to allocate a fd so that we know the number to pass in the environment variable. The fd // must not be closed before cmd.Start(), so we reuse messageSockPair.child because the // lifecycle of that fd is already taken care of. cmd.ExtraFiles = append(cmd.ExtraFiles, messageSockPair.child) mountFds[i] = stdioFdCount + len(cmd.ExtraFiles) - 1 }
mountFdsJson, err := json.Marshal(mountFds) if err != nil { return nil, fmt.Errorf("Error creating _LIBCONTAINER_MOUNT_FDS: %w", err) }
cmd.Env = append(cmd.Env, "_LIBCONTAINER_MOUNT_FDS="+string(mountFdsJson), ) }
init := &initProcess{ cmd: cmd, messageSockPair: messageSockPair, logFilePair: logFilePair, manager: c.cgroupManager, intelRdtManager: c.intelRdtManager, config: c.newInitConfig(p), container: c, process: p, bootstrapData: data, sharePidns: sharePidns, } c.initProcess = init return init, nil}
在该函数中首先设置standard环境变量,然后从config.json里面读取需要新建的namespaces,并将这些数据进行存储,然后创建initProcess结构体,中间的shouldSendMountSources不用特别关心,它其实是为了挂载一些目录所设置的。到此为止,parentProcess结构体就基本设置完成了。
 
在start方法中接下来调用了parentProcess的start()函数,这里其实是initProcess结构体实现的start函数。在该start函数中会启动之前设置的/proc/self/exe进程,参数为init,然后给父进程设置了cgroup,之后通过sock把信息传输给子进程,这里最关键的其实是启动了runC init这样一个子进程,因为创建的容器可能具备新的namespaces。
因此,通过子进程执行runC init的时候可以很方便的通过setns()完成命名空间的切换,同时setns其实是不运行在多线程条件下使用的,但是go runtime就是多线程的,因此必须在go runtime之前设置命名空间,因此使用cgo在go runtime启动之前使用c代码设置命名空间。
 
在cgo中,首先利用环境变量拿到了pipe(可以看到之前父进程在环境变量里面进程了设置),然后以netlink msg的格式读取父进程发送的config配置信息,接着同样执行了创建sock组的操作,这是为了使得它和孙进程之间可以相互通信,接着以状态机的形式用clone创建出符合config.json中设置的命名空间的进程,然后本来的子进程就exit(0)销毁。
 
接着回到create中,在执行init进程之后对其进行了cgroup的限制,这也方便在接下来的过程中防止子进程通过cgroup进行逃逸,接着父进程发送bootstrapData数据到init进程,之后create拿到init创建的子进程的pid。
然后通过pipe管拿到子进程打开的fd进行保存,在进行一系列的设置之后通过sendConfig发送config.json中的要执行的进程的信息,接下来就是容器初始化和执行config.json中设置的进程了,具体的过程可以参考standard_init_linux.go中linuxStandardInit的Init函数,到此为止一个容器的大致启动过程就基本分析结束了。

参考链接:

https://segmentfault.com/a/1190000017576314#item-1
https://github.com/opencontainers/runc

 

看雪ID:时钟

https://bbs.pediy.com/user-home-831025.htm

*本文由看雪论坛 时钟 原创,转载请注明来自看雪社区

# 往期推荐

1.CTF实战练习:web-Loginme

2.CTF实战练习Cmcc_simplerop

3.python_mmdt:KNN机器学习分类结果测试分析

4.CVE-2021-1732 EXP Win10_1909 KaLendsi 的EXP编写与分析

5.一种将LLVM Pass集成到NDK中的通用方法

6.人工智能竞赛-房价预测

球分享

球点赞

球在看

点击“阅读原文”,了解更多!


文章来源: http://mp.weixin.qq.com/s?__biz=MjM5NTc2MDYxMw==&mid=2458429449&idx=1&sn=6d14a8ccc3647a456624380bc290b02f&chksm=b18f988386f81195f03610c86b945e39cd3cde808d80dec2cf074f2d8ee91190e6fcd97e3802#rd
如有侵权请联系:admin#unsafe.sh