• [RFC PATCH 3/3] fs: detect that the i_rwsem has already been taken

    From Mimi Zohar@21:1/5 to Dave Chinner on Mon Oct 2 14:20:02 2017
    On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote:
    On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
    On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
    On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
    On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:

    Right, re-introducing the iint->mutex and a new i_generation field in the iint struct with a separate set of locks should work. It will be reset if the file metadata changes (eg. setxattr, chown, chmod).

    Note that the "inner lock" could possibly be omitted if the invalidation can be just a single atomic instruction.

    So particularly if invalidation could be just an atomic_inc() on the generation count, there might not need to be any inner lock at all.

    You'd have to serialize the actual measurement with the "read generation count", but that should be as simple as just doing a smp_rmb() between the "read generation count" and "do measurement on file contents".

    We already have a change counter on the inode, which is modified on
    any data or metadata write (i_version) under filesystem locks. The i_version counter has well defined semantics - it's required by
    NFSv4 to increment on any metadata or data change - so we should be
    able to rely on it's behaviour to implement IMA as well. Filesystems
    that support i_version are marked with [SB|MS]_I_VERSION in the superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
    can be supported on a specific filesystem (btrfs, ext4, fuse and xfs ATM).

    Recently I received a patch to replace i_version with mtime/atime.

    mtime is not guaranteed to change on data writes - the resolution of
    the filesystem timestamps may mean mtime only changes once a second regardless of the number of writes performed to that file. That's
    why NFS can't use it as a change attribute, and hence we have
    i_version....

     Now, even more recently, I received a patch that claims that
    i_version is just a performance improvement.

    Did you ask them to explain/quantify the performance improvement?

    Using i_version is a performance improvement as opposed to always
    calculating the file hash and writing the xattr.  The patch is
    intended for filesystems that don't support i_version (eg. ubifs).
     
    e.g. Using i_version on XFS slows down performance on small
    writes by 2-3% because i_version because all data writes log a
    version change rather than only logging a change when mtime updates.
    We take that penalty because NFS requires specific change attribute behaviour, otherwise we wouldn't have implemented it at all in
    XFS...

     For file systems that
    don't support i_version, assume that the file has changed.

    For file systems that don't support i_version, instead of assuming
    that the file has changed, we can at least use i_generation.

    I'm not sure what you mean here - the struct inode already has a
    i_generation variable. It's a lifecycle indicator used to
    discriminate between alloc/free cycles on the same inode number.
    i.e. It only changes at inode allocation time, not whenever the data
    in the inode changes...

    Sigh, my error.


    With Linus' suggested changes, I think this will work nicely.

    The IMA code should be able to sample that at measurement time and
    either fail or be retried if i_version changes during measurement.
    We can then simply make the IMA xattr write conditional on the
    i_version value being unchanged from the sample the IMA code passes
    into the filesystem once the filesystem holds all the locks it needs
    to write the xattr...

    I note that IMA already grabs the i_version in
    ima_collect_measurement(), so this shouldn't be too hard to do.
    Perhaps we don't need any new locks or counterst all, maybe just
    the ability to feed a version cookie to the set_xattr method?

    The security.ima xattr is normally written out in
    ima_check_last_writer(), not in ima_collect_measurement().

    Which, if IIUC, does this to measure and update the xattr:

    ima_check_last_writer
    -> ima_update_xattr
    -> ima_collect_measurement
    -> ima_fix_xattr

     ima_collect_measurement() calculates the file hash for storing in the measurement list (IMA-measurement), verifying the hash/signature (IMA- appraisal) already stored in the xattr, and auditing (IMA-audit).

    Yup, and it samples the i_version before it calculates the hash and
    stores it in the iint, which then gets passed to ima_fix_xattr().
    Looks like all that is needed is to pass the i_version back to the
    filesystem through the xattr call....

    IOWs, sample the i_version early while we hold the inode lock and
    check the writer count, then if it is the last writer drop the inode
    lock and call ima_update_xattr(). The sampled i_version then tells
    us if the file has changed before we write the updated xattr...

    The only time that ima_collect_measurement() writes the file xattr is
    in "fix" mode.  Writing the xattr will need to be deferred until after
    the iint->mutex is released.

    ima_collect_measurement() doesn't write an xattr at all - it just
    reads the file data and calculates the hash.

    There's another call to ima_fix_xattr() from ima_appraise_measurement().

    There should be no open writers in ima_check_last_writer(), so the
    file shouldn't be changing.

    If that code is not holding the inode i_rwsem across
    ima_update_xattr(), then the writer check is racy as hell. We're
    trying to get rid of the need for this code to hold the inode lock
    to stabilise the writer count for the entire operation, and it looks
    to me like everything is there to use the i_version to ensure the
    the IMA code doesn't need to hold the inode lock across ima_collect_measurement() and ima_fix_xattr()...

    Ok

    Mimi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mimi Zohar@21:1/5 to Eric W. Biederman on Mon Oct 2 14:30:01 2017
    On Sun, 2017-10-01 at 22:25 -0500, Eric W. Biederman wrote:
    Mimi Zohar <zohar@linux.vnet.ibm.com> writes:

    There should be no open writers in ima_check_last_writer(), so the
    file shouldn't be changing.

    This is slightly tangential but I think important to consider.
    What do you do about distributed filesystems fuse, nfs, etc that
    can change the data behind the kernels back.

    Exactly!

    Do you not support such systems or do you have a sufficient way to
    detect changes?

    Currently, only the initial file access in policy is measured,
    verified, audited.  Even if there was a way of detecting the change,
    since we can't trust these file systems, the performance would be
    awful, but we should probably not be caching the
    measurement/verification results.

    Mimi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeff Layton@21:1/5 to Mimi Zohar on Mon Oct 2 14:50:03 2017
    On Mon, 2017-10-02 at 08:09 -0400, Mimi Zohar wrote:
    On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote:
    On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
    On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
    On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
    On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet. ibm.com> wrote:

    Right, re-introducing the iint->mutex and a new
    i_generation field in
    the iint struct with a separate set of locks should
    work. It will be
    reset if the file metadata changes (eg. setxattr, chown,
    chmod).

    Note that the "inner lock" could possibly be omitted if the invalidation can be just a single atomic instruction.

    So particularly if invalidation could be just an atomic_inc()
    on the
    generation count, there might not need to be any inner lock
    at all.

    You'd have to serialize the actual measurement with the "read generation count", but that should be as simple as just doing
    a
    smp_rmb() between the "read generation count" and "do
    measurement on
    file contents".

    We already have a change counter on the inode, which is
    modified on
    any data or metadata write (i_version) under filesystem
    locks. The
    i_version counter has well defined semantics - it's required by
    NFSv4 to increment on any metadata or data change - so we
    should be
    able to rely on it's behaviour to implement IMA as well.
    Filesystems
    that support i_version are marked with [SB|MS]_I_VERSION in the superblock (IS_I_VERSION(inode)) so it should be easy to tell
    if IMA
    can be supported on a specific filesystem (btrfs, ext4, fuse
    and xfs
    ATM).

    Recently I received a patch to replace i_version with
    mtime/atime.


    I assume you're talking here about the patch I sent a few months ago.

    I specifically do _not_ want to replace i_version with the mtime/atime.
    The point there was to stop trying to use i_version on filesystems that
    don't properly implement it (which is most of them).

    The next best approximation on those filesystems is the mtime. It's not perfect, but it's better than nothing (which is what you have now on filesystems that never increment i_version on writes). IOW, it just
    added a fallback for when you can't count on the i_version changing.

    (BTW: atime is worthless here -- who cares if the thing was accessed?
    IIUC, we only care if something changed.)

    Ideally, all filesystems would implement i_version properly. In
    practice, that's a tall order as that may require on-disk changes for
    some of them. That's not always possible where cross-OS compatibility
    is necessary (e.g. FAT or NTFS).

    mtime is not guaranteed to change on data writes - the resolution
    of
    the filesystem timestamps may mean mtime only changes once a second regardless of the number of writes performed to that file. That's
    why NFS can't use it as a change attribute, and hence we have
    i_version....

    Now, even more recently, I received a patch that claims that
    i_version is just a performance improvement.

    Did you ask them to explain/quantify the performance improvement?

    Using i_version is a performance improvement as opposed to always
    calculating the file hash and writing the xattr. The patch is
    intended for filesystems that don't support i_version (eg. ubifs).

    e.g. Using i_version on XFS slows down performance on small
    writes by 2-3% because i_version because all data writes log a
    version change rather than only logging a change when mtime
    updates.
    We take that penalty because NFS requires specific change attribute behaviour, otherwise we wouldn't have implemented it at all in
    XFS...

    For file systems that
    don't support i_version, assume that the file has changed.

    For file systems that don't support i_version, instead of
    assuming
    that the file has changed, we can at least use i_generation.

    I'm not sure what you mean here - the struct inode already has a i_generation variable. It's a lifecycle indicator used to
    discriminate between alloc/free cycles on the same inode number.
    i.e. It only changes at inode allocation time, not whenever the
    data
    in the inode changes...

    Sigh, my error.


    With Linus' suggested changes, I think this will work nicely.

    The IMA code should be able to sample that at measurement time
    and
    either fail or be retried if i_version changes during
    measurement.
    We can then simply make the IMA xattr write conditional on the i_version value being unchanged from the sample the IMA code
    passes
    into the filesystem once the filesystem holds all the locks it
    needs
    to write the xattr...
    I note that IMA already grabs the i_version in ima_collect_measurement(), so this shouldn't be too hard to do.
    Perhaps we don't need any new locks or counterst all, maybe
    just
    the ability to feed a version cookie to the set_xattr method?

    The security.ima xattr is normally written out in ima_check_last_writer(), not in ima_collect_measurement().

    Which, if IIUC, does this to measure and update the xattr:

    ima_check_last_writer
    -> ima_update_xattr
    -> ima_collect_measurement
    -> ima_fix_xattr

    ima_collect_measurement() calculates the file hash for storing
    in the
    measurement list (IMA-measurement), verifying the hash/signature
    (IMA-
    appraisal) already stored in the xattr, and auditing (IMA-audit).

    Yup, and it samples the i_version before it calculates the hash and
    stores it in the iint, which then gets passed to ima_fix_xattr().
    Looks like all that is needed is to pass the i_version back to the filesystem through the xattr call....

    IOWs, sample the i_version early while we hold the inode lock and
    check the writer count, then if it is the last writer drop the
    inode
    lock and call ima_update_xattr(). The sampled i_version then tells
    us if the file has changed before we write the updated xattr...

    The only time that ima_collect_measurement() writes the file
    xattr is
    in "fix" mode. Writing the xattr will need to be deferred until
    after
    the iint->mutex is released.

    ima_collect_measurement() doesn't write an xattr at all - it just
    reads the file data and calculates the hash.

    There's another call to ima_fix_xattr() from
    ima_appraise_measurement().

    There should be no open writers in ima_check_last_writer(), so
    the
    file shouldn't be changing.

    If that code is not holding the inode i_rwsem across
    ima_update_xattr(), then the writer check is racy as hell. We're
    trying to get rid of the need for this code to hold the inode lock
    to stabilise the writer count for the entire operation, and it
    looks
    to me like everything is there to use the i_version to ensure the
    the IMA code doesn't need to hold the inode lock across ima_collect_measurement() and ima_fix_xattr()...

    Ok

    Mimi

    --
    Jeff Layton <jlayton@redhat.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)