2.6.35: unshare(NEWNS) does not work inside a container anymore?

Thu Sep 2 02:20:45 PDT 2010

31.08.2010 15:02, Michael Tokarev wrote:
> I just noticed a regression - immediately after updating
> kernel from 2.6.32 to 2.6.35 (I skipped .33 and .34).
> Namely, unshare(CLONE_NEWNS) stopped workin from within
> a container, like this:
> 
> unshare(CLONE_NEWNS)              = -1 EINVAL (Invalid argument)
> 
> There's no other fancy stuff going on around, just plain
> unshare and exec a new shell.
> 
> What's wrong with 2.6.35 in this context?

So, after discussing this on IRC and doing some discovery,
it turned out to be new (in 2.6.35) cgroup subsystem --
block I/O controller (CONFIG_BLK_CGROUP).  This one does
not allow more than 1 level of nesting, so, for example,
it is impossible to create a subdirectory in another
cgroup dir in cgroupfs:

 mkdir /dev/cgroup/foo  -- this one succeeds, but
 mkdir /dev/cgroup/foo/bar -- this fails as long
as blkio mount option is enabled.  Once disabled, it
works again.

In 2.6.35 block/blk-cgroup.c, blkiocg_create() there's the
following code:

  /* Currently we do not support hierarchy deeper than two level (0,1) */
  if (parent != cgroup->top_cgroup)
          return ERR_PTR(-EINVAL);

In 2.6.36-tobe it were changed to

          return ERR_PTR(-EPERM);

but the issue remains anyway.  What is problematic here
is that blkio is different from all other cgroups in
this very respect (not allowing nesting), but there's
no indication of this fact anywhere.  At least, the
above quoted place warranrs a WARN() or WARN_ONCE()
to tell the user what's going on - or else it's very
difficult to debug.

Speaking of real solution, it looks like disallowing
nesting should be done in a different way.  Maybe
allow creation of a subcontainer but reset the limits
in there and catch attempts to set them, - I dunno.
Or, don't clone whole cgroup hierarchy on CLONE_NEWNS
only.

Current situation is too restrictive IMHO - blkio
controller is useful for a container like LXC, but
currently it implies that one can't create even a
new filesystem namespace within it.

Thanks.

/mjt