• Who/what is abusing my fileserver

    From feenberg@gmail.com@21:1/5 to All on Sat May 6 07:53:44 2017
    Usually our TrueNAS fileservers (Really just FreeBSD with a GUI) perform well with

    iostat -x

    showing hundreds of megabytes/second read or written with the %b (%busy or %Utilization) at only several percent for each disk. But every few months performance goes to hell, with total throughput only 1 or 2 mbs and %b for group of disks at 99% or 100%
    while qlen grows from 0 or 1 to a dozen or 20 on some disks. CPU utilization stays very low. While this is happening a simple ls command can take 5 minutes. Eventually the problem solves itself.

    We believe this is because a client is doing a lot of random I/O that
    keeps the heads moving for very little data transfer, and that with all
    that seeking none of the other clients get much attention. How do we
    locate that job among the many jobs from many users on many nfs clients?
    On the client computers we can find out how many bytes are transferred by
    each process, but that number is small for all jobs - the one doing random
    I/O doesn't get more bytes than the jobs doing sequential I/O, it just exercises the heads more. We need more information to contact the user
    doing random I/O and work with them to do something else.

    Alternatively, is there some adjustment of the server that will downgrade
    the priority of random access? That user might self-identify if his jobs
    took forever to complete.

    Daniel Feenberg
    NBER

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mark F@21:1/5 to feenberg@gmail.com on Sun May 7 11:11:35 2017
    On Sat, 6 May 2017 07:53:44 -0700 (PDT), feenberg@gmail.com wrote:

    Usually our TrueNAS fileservers (Really just FreeBSD with a GUI) perform well with

    iostat -x

    showing hundreds of megabytes/second read or written with the %b (%busy or %Utilization) at only several percent for each disk. But every few months performance goes to hell, with total throughput only 1 or 2 mbs and %b for group of disks at 99% or 100%
    while qlen grows from 0 or 1 to a dozen or 20 on some disks. CPU utilization stays very low. While this is happening a simple ls command can take 5 minutes. Eventually the problem solves itself.

    We believe this is because a client is doing a lot of random I/O that
    keeps the heads moving for very little data transfer, and that with all
    Could also be for error recovery on a couple of blocks.
    I don't know about TrueNAS, but many filesystems/operating systems
    don't fix ECC problems until there is a complete failure and the disks themselves try to avoid actually rewriting data, possibly with
    location, to fix the problems.

    You could scan the disks and see if any performance problems arise.
    You should save the SMART data before and after the scan to see if
    there is any evidence of excessive error correction taking place, but
    not all disks (or SSDs) report such information. You might see
    counts for on the fly error recovery (which will seldom be zero
    even when no real problems) and perhaps second or even third
    level of recovery counts even if the drive never goes into
    a full (taking several minutes) retry method.

    The SMART data may even include a count of sectors known to be bad
    but not being fixed.

    that seeking none of the other clients get much attention. How do we
    locate that job among the many jobs from many users on many nfs clients?
    On the client computers we can find out how many bytes are transferred by each process, but that number is small for all jobs - the one doing random I/O doesn't get more bytes than the jobs doing sequential I/O, it just exercises the heads more. We need more information to contact the user
    doing random I/O and work with them to do something else.

    Alternatively, is there some adjustment of the server that will downgrade
    the priority of random access? That user might self-identify if his jobs
    took forever to complete.

    Daniel Feenberg
    NBER

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)