• Need advice about fixing PROC mount failures in a DIY Linux container

    From Lew Pitcher@21:1/5 to All on Sat Jan 7 01:27:28 2023
    XPost: alt.os.linux.slackware, comp.os.linux.misc, comp.unix.programmer

    Hi, all

    I've come late to the party, and have just started learning
    about the ins and outs of Linux containers. To get a better
    understanding of the subject, I decided to learn about the
    underlying technologies by building my own container software.

    I've modelled my DIY container on Brian Swetland's mkbox
    container[1], and have a demonstration program that works
    on my development system (a 64bit AMD Ryzen 5 3400G with
    Radeon Vega Graphics, running Slackware Linux 14.2 with
    the 4.4.301 kernel and all available patches applied).
    [1] https://github.com/swetland/mkbox


    However, when I run either Brian's mkbox or my demo program
    on my "production" system (another 64bit AMD Ryzen 5 3400G
    with Radeon Vega Graphics, running Slackware Linux 14.2 with
    the 4.4.301 kernel and all available patches applied), the
    container breaks while trying to mount the proc filesystem
    to the new (isolated) root fs.

    Specifically, I get an "Operation not permitted" error when
    I try to
    mount("proc","proc","proc",MS_REC,NULL)
    /but/ ONLY ON THIS ONE SYSTEM.

    This failure affects both my DIY container and Brian's mkbox
    container.

    With my DIY container, I've checked the capabilities given
    to the container process, and they are identical and complete
    on both systems. On both systems, I run the container process
    (mine and Brian's) from the same unprivileged UID/GID.

    I have to conclude that there's a difference in the two
    environments that causes this problem, but I don't know what
    that difference is. Both systems use the type CPU, the
    same amount of memory, the same 64-bit addressing mode,
    the same kernel, and the same distribution (with the same
    essential utilities).

    There /are/ differences in the two systems:
    pn the development system, my user is a member of a
    number of groups that it is not a member of on the
    "production" system. I run a root pulseaudio (I have my
    reasons) on the development system that I do not on
    the "production" system. Et cetera.

    Can anyone suggest an environmental factor or set of
    factors that might cause this behaviour?

    For reference, I include a copy of a minimal implementation
    of my DIY container that illustrates the problem, along with
    captures of both a successful run on my development system
    and an unsuccessful run on my production system.

    ========== demo.c ==========
    /*
    ** demonstrate selective problem with Slackware Linux 14.2
    ** user namespace creation (Kernel 4.4.301)
    */

    #define _GNU_SOURCE
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <sys/wait.h>
    #include <fcntl.h>
    #include <sys/mount.h>
    #include <sched.h>
    #include <string.h>
    #include <errno.h>

    /* pivot_root() prototype not supplied by headers */
    extern int pivot_root(const char *new_root, const char *put_old);

    void Die(int line); /* generate error message and exit process */
    #define DIE() Die(__LINE__)

    int main(void)
    {
    char *fauxRoot = "./.fauxroot", /* will be our new root filesystem */
    *oldRoot = ".oldroot", /* where pivot_root puts old root fs */
    *oldProc = ".oldproc", /* where we temp relocate /proc to */
    *newProc = "proc"; /* where we mount /proc to */
    pid_t init_pid;

    umask(0);

    rmdir(fauxRoot); if (mkdir(fauxRoot,0777)) DIE();

    if (unshare(CLONE_NEWUSER|CLONE_NEWNS|CLONE_NEWPID)) DIE();

    if (mount("none","/",NULL,MS_REC|MS_PRIVATE,NULL)) DIE();
    if (mount(fauxRoot,fauxRoot,NULL,MS_BIND|MS_NOSUID,NULL)) DIE();
    if (chdir(fauxRoot)) DIE();

    rmdir(oldRoot); if (mkdir(oldRoot,0751)) DIE();
    rmdir(oldProc); if (mkdir(oldProc,0755)) DIE();
    rmdir(newProc); if (mkdir(newProc,0755)) DIE();

    if (mount("/proc",oldProc,NULL,MS_BIND|MS_REC,NULL)) DIE();

    /* set new uid, gid */
    {
    FILE *map;

    if ((map = fopen("/proc/self/uid_map","w")) == NULL) DIE();
    fprintf(map,"0 %lu 1\n",(unsigned long)getuid());
    fclose(map);

    if ((map = fopen("/proc/self/setgroups","w")) == NULL) DIE();
    fwrite("deny",4,1,map);
    fclose(map);

    if ((map = fopen("/proc/self/gid_map","w")) == NULL) DIE();
    fprintf(map,"0 %lu 1\n",(unsigned long)getgid());
    fclose(map);
    }

    if (pivot_root(".",oldRoot)) DIE();
    if (umount2(oldRoot,MNT_DETACH)) DIE();
    if (rmdir(oldRoot)) DIE();

    switch (init_pid = fork())
    {
    case -1:
    DIE();
    break;

    case 0:
    if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
    if (umount2(oldProc,MNT_DETACH)) DIE();
    if (rmdir(oldProc)) DIE();
    printf("INIT: my pid is %lu\n",(unsigned long)getpid());
    break;

    default:
    printf("PARENT: INIT pid is %lu\n",(unsigned long)init_pid);
    wait(NULL);
    break;
    }

    return EXIT_SUCCESS;
    }

    void Die(int line)
    {
    fprintf(stderr,"Error encountered at line %d: %s\n",line,strerror(errno));
    exit(EXIT_FAILURE);
    }

    ========== successful execution on development system ==========
    Script started on Fri 06 Jan 2023 08:20:12 PM EST
    20:20 $ uname -a
    Linux wordsworth 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
    20:20 $ cat /etc/slackware-version
    Slackware 14.2
    20:20 $ rm demo
    20:20 $ rm -rf .fauxroot
    20:20 $ cc -o demo demo.c
    20:20 $ ./demo
    PARENT: INIT pid is 558
    INIT: my pid is 1
    20:20 $ ls -laR .fauxroot
    fauxroot:
    total 12
    drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 .
    drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:20 ..
    drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 proc

    fauxroot/proc:
    total 8
    drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 .
    drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 ..
    20:21 $ exit
    exit

    Script done on Fri 06 Jan 2023 08:21:02 PM EST


    ========== unsuccessful execution on production system ==========
    Script started on Fri Jan 6 20:21:11 2023
    ~/code/namespaces $ uname -a
    Linux merlin 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
    ~/code/namespaces $ cat /etc/slackware-version
    Slackware 14.2
    ~/code/namespaces $ rm demo
    ~/code/namespaces $ rm -rf .fauxroot
    ~/code/namespaces $ cc -o demo demo.c
    ~/code/namespaces $ ./demo
    PARENT: INIT pid is 1651
    Error encountered at line 77: Operation not permitted
    ~/code/namespaces $ nl -ba demo.c | grep ' 77'
    77 if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE(); ~/code/namespaces $ ls -laR .fauxroot
    fauxroot:
    total 16
    drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 .
    drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:21 ..
    drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .oldproc
    drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 proc

    fauxroot/.oldproc:
    total 8
    drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
    drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..

    fauxroot/proc:
    total 8
    drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
    drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..
    ~/code/namespaces $ exit
    exit

    Script done on Fri Jan 6 20:22:50 2023




    --
    Lew Pitcher
    "In Skills, We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Lew Pitcher on Sat Jan 7 02:12:43 2023
    XPost: alt.os.linux.slackware, comp.os.linux.misc, comp.unix.programmer

    On Sat, 07 Jan 2023 01:27:28 +0000, Lew Pitcher wrote:

    Hi, all

    I've come late to the party, and have just started learning
    about the ins and outs of Linux containers. To get a better
    understanding of the subject, I decided to learn about the
    underlying technologies by building my own container software.

    I've modelled my DIY container on Brian Swetland's mkbox
    container[1], and have a demonstration program that works
    on my development system (a 64bit AMD Ryzen 5 3400G with
    Radeon Vega Graphics, running Slackware Linux 14.2 with
    the 4.4.301 kernel and all available patches applied).
    [1] https://github.com/swetland/mkbox


    However, when I run either Brian's mkbox or my demo program
    on my "production" system (another 64bit AMD Ryzen 5 3400G
    with Radeon Vega Graphics, running Slackware Linux 14.2 with
    the 4.4.301 kernel and all available patches applied), the
    container breaks while trying to mount the proc filesystem
    to the new (isolated) root fs.

    Specifically, I get an "Operation not permitted" error when
    I try to
    mount("proc","proc","proc",MS_REC,NULL)
    /but/ ONLY ON THIS ONE SYSTEM.

    This failure affects both my DIY container and Brian's mkbox
    container.

    With my DIY container, I've checked the capabilities given
    to the container process, and they are identical and complete
    on both systems. On both systems, I run the container process
    (mine and Brian's) from the same unprivileged UID/GID.

    I have to conclude that there's a difference in the two
    environments that causes this problem, but I don't know what
    that difference is. Both systems use the type CPU, the
    same amount of memory, the same 64-bit addressing mode,
    the same kernel, and the same distribution (with the same
    essential utilities).

    There /are/ differences in the two systems:
    pn the development system, my user is a member of a
    number of groups that it is not a member of on the
    "production" system. I run a root pulseaudio (I have my
    reasons) on the development system that I do not on
    the "production" system. Et cetera.

    Can anyone suggest an environmental factor or set of
    factors that might cause this behaviour?

    [snip]


    Well, I can answer my own question, now. But the answer
    leads to more questions.

    The reason I get "Operation not permitted" on the
    container /proc mount on my "production" system is that
    I also run an nfs server on my "production" system (and
    do not run one on my development system), and is nfs
    server maintains two mountpoints within the /proc
    filesystem.

    Apparently, the attempt to mount /proc within my container
    was blocked by the existance of these two mount points
    (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
    rpc and nfs servers, and umounted these two mounts, I could
    successfully run my demo container.

    /Now/ the question is: how do I get my container /proc mount
    to ignore or bypass these two nfsd mounts?


    --
    Lew Pitcher
    "In Skills, We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jasen Betts@21:1/5 to Lew Pitcher on Sat Jan 7 07:06:37 2023
    XPost: alt.os.linux.slackware, comp.os.linux.misc, comp.unix.programmer

    On 2023-01-07, Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
    On Sat, 07 Jan 2023 01:27:28 +0000, Lew Pitcher wrote:

    I try to
    mount("proc","proc","proc",MS_REC,NULL)
    /but/ ONLY ON THIS ONE SYSTEM.

    Well, I can answer my own question, now. But the answer
    leads to more questions.

    The reason I get "Operation not permitted" on the
    container /proc mount on my "production" system is that
    I also run an nfs server on my "production" system (and
    do not run one on my development system), and is nfs
    server maintains two mountpoints within the /proc
    filesystem.

    Apparently, the attempt to mount /proc within my container
    was blocked by the existance of these two mount points
    (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
    rpc and nfs servers, and umounted these two mounts, I could
    successfully run my demo container.

    /Now/ the question is: how do I get my container /proc mount
    to ignore or bypass these two nfsd mounts?

    What's the difference between mount() and /bin/mount

    --
    Jasen.
    pǝsɹǝʌǝɹ sʇɥƃᴉɹ ll∀

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John-Paul Stewart@21:1/5 to Lew Pitcher on Sat Jan 7 11:41:34 2023
    XPost: alt.os.linux.slackware, comp.os.linux.misc, comp.unix.programmer

    [Followups set to comp.os.linux.misc since I don't read any of the other groups]

    On 1/6/23 21:12, Lew Pitcher wrote:

    The reason I get "Operation not permitted" on the
    container /proc mount on my "production" system is that
    I also run an nfs server on my "production" system (and
    do not run one on my development system), and is nfs
    server maintains two mountpoints within the /proc
    filesystem.

    Apparently, the attempt to mount /proc within my container
    was blocked by the existance of these two mount points
    (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
    rpc and nfs servers, and umounted these two mounts, I could
    successfully run my demo container.

    /Now/ the question is: how do I get my container /proc mount
    to ignore or bypass these two nfsd mounts?

    In your OP you showed that you've got MS_REC in the mountflags field,
    which will cause a recursive mount; i.e., you've explicitly asked for
    the inclusion of the NFS-related subtrees. Have you tried without that
    flag? MS_BIND would seem a more appropriate choice instead, IMHO, since
    it doesn't do the recursion. Then, by default, the subtrees will be
    excluded.

    See also the section on "Changing the propagation type of an existing
    mount" in the mount(2) man page for other ways to prevent the NFS
    subtrees from being processed recursively. That might be relevant if
    you want to recurse into other parts of the /proc tree, just not the two directories you've named.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rainer Weikusat@21:1/5 to Lew Pitcher on Mon Jan 9 19:27:13 2023
    XPost: alt.os.linux.slackware, comp.os.linux.misc, comp.unix.programmer

    Lew Pitcher <lew.pitcher@digitalfreehold.ca> writes:

    [...]

    Well, I can answer my own question, now. But the answer
    leads to more questions.

    The reason I get "Operation not permitted" on the
    container /proc mount on my "production" system is that
    I also run an nfs server on my "production" system (and
    do not run one on my development system), and is nfs
    server maintains two mountpoints within the /proc
    filesystem.

    Apparently, the attempt to mount /proc within my container
    was blocked by the existance of these two mount points
    (/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
    rpc and nfs servers, and umounted these two mounts, I could
    successfully run my demo container.

    /Now/ the question is: how do I get my container /proc mount
    to ignore or bypass these two nfsd mounts?

    Instead of doing a bind mount of a proc filesystem already mounted
    somewhere, you could mount a new instance of it. The command for this
    would be

    mount -t proc proc <mount point>

    You'll generally also want to mount sysfs, BTW.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)