• Multithreaded disk access

    From Don Y@21:1/5 to All on Fri Oct 15 07:08:23 2021
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth? 90%? 110? etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer. And, filesystem code also throws
    a wrench in the works. Assume caching in the system
    is disabled/ineffective.]

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Don Y on Fri Oct 15 11:38:58 2021
    On 10/15/21 10:08 AM, Don Y wrote:
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth?  90%?  110?  etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer.  And, filesystem code also throws
    a wrench in the works.  Assume caching in the system
    is disabled/ineffective.]

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    You know that you can't actually get data off the media faster than the fundamental data rate of the media.

    As you mention, cache can give an apparent rate faster than the media,
    but you seem to be willing to assume that caching doesn't affect your
    rate, and each chunk will only be returned once.

    Pathological access patterns can reduce this rate dramatically, and
    worse case can result in rates of only a few percent of this factor if
    you force significant seeks between each sector read (and overload the buffering so it can't hold larger reads for a given stream).

    Non-Pathological access can often result in near 100% of the access rate.

    The best test of if you are I/O bound is if the I/O system is constantly
    in use, and every I/O request has another pending when it finishes, then
    you are totally I/O bound.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Richard Damon on Fri Oct 15 09:00:11 2021
    On 10/15/2021 8:38 AM, Richard Damon wrote:
    You know that you can't actually get data off the media faster than the fundamental data rate of the media.

    Yes, but you don't know that rate *and* that rate varies based on
    "where" you're accesses land on the physical medium (e.g., ZDR,
    shingled drives, etc.)

    As you mention, cache can give an apparent rate faster than the media, but you
    seem to be willing to assume that caching doesn't affect your rate, and each chunk will only be returned once.

    Cache in the filesystem code will be counterproductive. Cache in
    the drive may be a win for some accesses and a loss for others
    (e.g., if the drive read ahead thinking the next read was going to
    be sequential with the last -- and that proves to be wrong -- the
    drive may have missed an opportunity to respond more quickly to the
    ACTUAL access that follows).

    [I'm avoiding talking about reads AND writes just to keep the
    discussion complexity manageable -- to avoid having to introduce
    caveats with every statement]

    Pathological access patterns can reduce this rate dramatically, and worse case
    can result in rates of only a few percent of this factor if you force significant seeks between each sector read (and overload the buffering so it can't hold larger reads for a given stream).

    Exactly. But, you don't necessarily know where your next access will
    take you. This variation in throughput is what makes defining
    "i/o bound" tricky; if the access patterns at some instant (instant
    being a period over which you base your decision) make the drive
    look slow, then you would opt NOT to spawn a new thread to take
    advantage of excess throughput. Similarly, if the drive "looks" serendipitously fast, you may spawn another thread and its
    accesses will eventually conflict with those of the first thread
    to lower overall throughput.

    Non-Pathological access can often result in near 100% of the access rate.

    The best test of if you are I/O bound is if the I/O system is constantly in use, and every I/O request has another pending when it finishes, then you are totally I/O bound.

    But, if you make that assessment when the access pattern is "unfortunate",
    you erroneously conclude the disk is at its capacity. And, vice versa.

    Without control over the access patterns, it seems like there is no
    reliable strategy for determining when another thread can be
    advantageous (?)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dimiter_Popoff@21:1/5 to Don Y on Fri Oct 15 18:46:53 2021
    On 10/15/2021 17:08, Don Y wrote:
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth?  90%?  110?  etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer.  And, filesystem code also throws
    a wrench in the works.  Assume caching in the system
    is disabled/ineffective.]

    If caching is disabled things can get really bad quite quickly,
    think on updating directory entries to reflect modification/access
    dates, file sizes, scattering etc., think also allocation
    table accesses etc. E.g. in dps on a larger disk partition
    (say >100 gigabytes) the first CAT (cluster allocation table)
    access after boot takes some noticeable time, a second maybe;
    then it stops being noticeable at all as the CAT is updated
    rarely and on a modified area basis only (this on a a processor
    capable of 20 Mbytes/second) (dps needs the entire CAT to allocate
    new space in order to do its (enhanced) worst fit scheme).
    IOW if you torture the disk with constant seeks and scattered
    accesses you can slow it down from somewhat to a lot, depends
    on way too many factors to be worth wondering about.


    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    Just try it out for some time and make your pick. Recently
    I did that dfs (distributed file system, over tcp) for dps
    and had to watch much of this going on, at some point you
    reach something between 50 and 100% of the hardware limit,
    depending on file sizes you copy and who knows what else
    overhead you can think of.

    Dimiter

    ======================================================
    Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to All on Fri Oct 15 09:19:25 2021
    On 10/15/2021 8:46 AM, Dimiter_Popoff wrote:
    On 10/15/2021 17:08, Don Y wrote:

    If caching is disabled things can get really bad quite quickly,
    think on updating directory entries to reflect modification/access
    dates, file sizes, scattering etc., think also allocation
    table accesses etc.

    My point re: filesystem cache (not the on-board disk cache) was that
    the user objects accessed will only be visited once. So, no value
    to caching *them* in the filesystem's buffers.

    E.g. in dps on a larger disk partition
    (say >100 gigabytes) the first CAT (cluster allocation table)
    access after boot takes some noticeable time, a second maybe;
    then it stops being noticeable at all as the CAT is updated
    rarely and on a modified area basis only (this on a a processor
    capable of 20 Mbytes/second) (dps needs the entire CAT to allocate
    new space in order to do its (enhanced) worst fit scheme).
    IOW if you torture the disk with constant seeks and scattered
    accesses you can slow it down from somewhat to a lot, depends
    on way too many factors to be worth wondering about.

    I'm trying NOT to be aware of any particulars of the specific
    filesystem *type* (FAT*, NTFS, *BSD, etc.) and make decisions
    just from high level observations of disk performance.

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    Just try it out for some time and make your pick. Recently
    I did that dfs (distributed file system, over tcp) for dps
    and had to watch much of this going on, at some point you
    reach something between 50 and 100% of the hardware limit,
    depending on file sizes you copy and who knows what else
    overhead you can think of.

    I think the cost of any extra complexity in the algorithm
    (to dynamically try to optimize number of threads) is
    hard to justify -- given no control over the actual
    media. I.e., it seems like it's best to just aim for
    "simple" and live with whatever throughput you get...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dimiter_Popoff@21:1/5 to Don Y on Fri Oct 15 19:28:49 2021
    On 10/15/2021 19:19, Don Y wrote:
    On 10/15/2021 8:46 AM, Dimiter_Popoff wrote:
    ....
    Just try it out for some time and make your pick. Recently
    I did that dfs (distributed file system, over tcp) for dps
    and had to watch much of this going on, at some point you
    reach something between 50 and 100% of the hardware limit,
    depending on file sizes you copy and who knows what else
    overhead you can think of.

    I think the cost of any extra complexity in the algorithm
    (to dynamically try to optimize number of threads) is
    hard to justify -- given no control over the actual
    media.  I.e., it seems like it's best to just aim for
    "simple" and live with whatever throughput you get...

    I meant going the simplest way, not adding algorithms.
    Just leave it for now and have a few systems running,
    look at what is going on and pick some sane figure,
    perhaps try it out either way before you settle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Don Y on Fri Oct 15 12:48:48 2021
    On 10/15/21 12:00 PM, Don Y wrote:
    On 10/15/2021 8:38 AM, Richard Damon wrote:
    You know that you can't actually get data off the media faster than
    the fundamental data rate of the media.

    Yes, but you don't know that rate *and* that rate varies based on
    "where" you're accesses land on the physical medium (e.g., ZDR,
    shingled drives, etc.)

    But all of these still have a 'maximum' rate, so you can still define a maximum. It does say that the 'expected' rate you can get gets more
    variable.


    As you mention, cache can give an apparent rate faster than the media,
    but you seem to be willing to assume that caching doesn't affect your
    rate, and each chunk will only be returned once.

    Cache in the filesystem code will be counterproductive.  Cache in
    the drive may be a win for some accesses and a loss for others
    (e.g., if the drive read ahead thinking the next read was going to
    be sequential with the last -- and that proves to be wrong -- the
    drive may have missed an opportunity to respond more quickly to the
    ACTUAL access that follows).

    [I'm avoiding talking about reads AND writes just to keep the
    discussion complexity manageable -- to avoid having to introduce
    caveats with every statement]


    Yes, the drive might try to read ahead and hurt itself, or it might not.
    That is mostly out of your control.

    Pathological access patterns can reduce this rate dramatically, and
    worse case can result in rates of only a few percent of this factor if
    you force significant seeks between each sector read (and overload the
    buffering so it can't hold larger reads for a given stream).

    Exactly.  But, you don't necessarily know where your next access will
    take you.  This variation in throughput is what makes defining
    "i/o bound" tricky;  if the access patterns at some instant (instant
    being a period over which you base your decision) make the drive
    look slow, then you would opt NOT to spawn a new thread to take
    advantage of excess throughput.  Similarly, if the drive "looks" serendipitously fast, you may spawn another thread and its
    accesses will eventually conflict with those of the first thread
    to lower overall throughput.




    Non-Pathological access can often result in near 100% of the access rate.

    The best test of if you are I/O bound is if the I/O system is
    constantly in use, and every I/O request has another pending when it
    finishes, then you are totally I/O bound.

    But, if you make that assessment when the access pattern is "unfortunate", you erroneously conclude the disk is at its capacity.  And, vice versa.

    Without control over the access patterns, it seems like there is no
    reliable strategy for determining when another thread can be
    advantageous (?)

    Yes, adding more threads might change the access pattern. it will TEND
    to make the pattern less sequential, and thus more towards that
    pathological case (and thus more threads actually decrease the rate you
    can do I/O and thus slow down your I//O bound rate). It is possible that
    it just happens to be fortunate to make things more sequential, if the
    system can see that one thread wants sector N and another wants sector
    N+1, something can schedule the reads together and drop a seek.

    Predicting that sort of behavior can't be done 'in the abstract'. You
    need to think about the details of the system.

    As a general principle, if the I/O system is saturated, the job is I/O
    bound. Adding more threads will only help if you have the resources to
    queue up more requests and can optimize the order of servicing them to
    be more efficient with I/O. Predicting that means you need to know and
    have some control over the access pattern.

    Note, part of this is being able to trade memory to improve I/O speed.
    If you know that EVENTUALLY you will want the next sector after the one
    you are reading, reading that now and caching it will be a win, but only
    if you will be able to use that data before you need to claim that
    memory for other uses. This sort of improvement really does require
    knowing details you want to try to assume you don't want to know, so you
    are limiting your ability to make accurate decisions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Richard Damon on Fri Oct 15 11:40:25 2021
    On 10/15/2021 9:48 AM, Richard Damon wrote:
    On 10/15/21 12:00 PM, Don Y wrote:
    On 10/15/2021 8:38 AM, Richard Damon wrote:
    You know that you can't actually get data off the media faster than the
    fundamental data rate of the media.

    Yes, but you don't know that rate *and* that rate varies based on
    "where" you're accesses land on the physical medium (e.g., ZDR,
    shingled drives, etc.)

    But all of these still have a 'maximum' rate, so you can still define a maximum. It does say that the 'expected' rate you can get gets more variable.

    But only if you have control over the hardware.

    How long will a "backup" take on your PC. Today?
    Tomorrow? Last week?

    If you removed the disk and put it in another PC,
    how would those figures change?

    If you restore (using file access and not sector access),
    and then backup again, how will the numbers change?

    As you mention, cache can give an apparent rate faster than the media, but >>> you seem to be willing to assume that caching doesn't affect your rate, and >>> each chunk will only be returned once.

    Cache in the filesystem code will be counterproductive. Cache in
    the drive may be a win for some accesses and a loss for others
    (e.g., if the drive read ahead thinking the next read was going to
    be sequential with the last -- and that proves to be wrong -- the
    drive may have missed an opportunity to respond more quickly to the
    ACTUAL access that follows).

    [I'm avoiding talking about reads AND writes just to keep the
    discussion complexity manageable -- to avoid having to introduce
    caveats with every statement]

    Yes, the drive might try to read ahead and hurt itself, or it might not. That is mostly out of your control.

    Exactly. So, I can't do anything other than OBSERVE the performance
    I am getting.

    Non-Pathological access can often result in near 100% of the access rate. >>>
    The best test of if you are I/O bound is if the I/O system is constantly in >>> use, and every I/O request has another pending when it finishes, then you >>> are totally I/O bound.

    But, if you make that assessment when the access pattern is "unfortunate", >> you erroneously conclude the disk is at its capacity. And, vice versa.

    Without control over the access patterns, it seems like there is no
    reliable strategy for determining when another thread can be
    advantageous (?)

    Yes, adding more threads might change the access pattern. it will TEND to make
    the pattern less sequential, and thus more towards that pathological case (and
    thus more threads actually decrease the rate you can do I/O and thus slow down
    your I//O bound rate). It is possible that it just happens to be fortunate to make things more sequential, if the system can see that one thread wants sector
    N and another wants sector N+1, something can schedule the reads together and drop a seek.

    The point of additional threads is that another thread can schedule
    the next access while the processor is busy processing the previous
    one. So, the I/O is always kept busy instead of letting it idle
    between accesses.

    Predicting that sort of behavior can't be done 'in the abstract'. You need to think about the details of the system.

    As a general principle, if the I/O system is saturated, the job is I/O bound.

    The goal is to *ensure* the I/O system is completely saturated.

    Adding more threads will only help if you have the resources to queue up more requests and can optimize the order of servicing them to be more efficient with

    Ordering them is an optimization that requires knowledge of how they
    will interact *in* the drive. However, simply having ANOTHER request
    ready as soon as the previous one is completed (neglecting the
    potential for the drive to queue requests internally) is an
    enhancement to throughput.

    I/O. Predicting that means you need to know and have some control over the access pattern.

    Note, part of this is being able to trade memory to improve I/O speed. If you know that EVENTUALLY you will want the next sector after the one you are reading, reading that now and caching it will be a win, but only if you will be
    able to use that data before you need to claim that memory for other uses. This
    sort of improvement really does require knowing details you want to try to assume you don't want to know, so you are limiting your ability to make accurate decisions.

    Moving the code to another platform (something that the user can do
    in a heartbeat) will invalidate any assumptions I have made about the performance on my original platform. Hence the desire to have the
    *code* sort out what it *can* do to increase performance by observing
    its actual performance on TODAY'S actual hardware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Brett@21:1/5 to Don Y on Sun Oct 17 20:27:17 2021
    Don Y <blockedofcourse@foo.invalid> wrote:
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth? 90%? 110? etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer. And, filesystem code also throws
    a wrench in the works. Assume caching in the system
    is disabled/ineffective.]

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
    revolutions a second and a seek takes half a revolution and the next file
    is another half a revolution away on average, which gets you 125 files a
    second roughly speaking depending on the performance of the drive if my
    numbers are not too far off.

    This is plenty to support a dozen Windows VM’s on average if it were not
    for Windows updates that saturate the disks with hundreds of little file updates at once, causing Microsoft SQL timeouts for the VM’s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Brett on Sun Oct 17 14:49:52 2021
    On 10/17/2021 1:27 PM, Brett wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth? 90%? 110? etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer. And, filesystem code also throws
    a wrench in the works. Assume caching in the system
    is disabled/ineffective.]

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125 revolutions a second and a seek takes half a revolution and the next file
    is another half a revolution away on average, which gets you 125 files a second roughly speaking depending on the performance of the drive if my numbers are not too far off.

    You're assuming files are laid out contiguously -- that no seeks are needed "between sectors".

    You're also assuming moving to another track (seek time) is instantaneous
    (or, within the half-cylinder rotational delay).

    For a 7200 rpm (some are as slow as 5400, some as fast as 15K) drive,
    AVERAGE rotational delay is 8.3+ ms/2 = ~4ms.

    But, seek time can be 10, 15, + ms. (on my enterprise drives, its 4; but, average rotational delay is 2.) And, if the desired sector lies on a
    "distant" cylinder, you can scale that almost linearly.

    I.e., looking at the disk's specs is largely useless unless you know how
    the data on it is laid out. The only way to know that is to *look* at it.

    But, looking at part of the data doesn't mean you can extrapolate to ALL
    of the data. So, I'm back to my assumption that you can't really alter
    your approach -- with any degree of predictable success -- before hand.
    E.g., I can keep spawning threads until I find them queuing (more than
    one deep) on the disk driver. But, even then, a moment from now, the
    backlog can clear. Or, it can get worse (which means I've wasted
    the resources that the threads consume AND added complexity to the
    algorithm with no direct benefit)

    Below, you're also assuming Windows.

    And, for writes, shingled drives throw all of that down the toilet.

    This is plenty to support a dozen Windows VM’s on average if it were not for Windows updates that saturate the disks with hundreds of little file updates at once, causing Microsoft SQL timeouts for the VM’s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dimiter_Popoff@21:1/5 to Don Y on Mon Oct 18 01:09:48 2021
    On 10/18/2021 0:49, Don Y wrote:
    On 10/17/2021 1:27 PM, Brett wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth?  90%?  110?  etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer.  And, filesystem code also throws
    a wrench in the works.  Assume caching in the system
    is disabled/ineffective.]

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
    revolutions a second and a seek takes half a revolution and the next file
    is another half a revolution away on average, which gets you 125 files a
    second roughly speaking depending on the performance of the drive if my
    numbers are not too far off.

    You're assuming files are laid out contiguously -- that no seeks are needed "between sectors".

    This is the typical case anyway, most files are contiguously allocated.
    Even on popular filesystems which have long forgotten how to do worst
    fit allocation and have to defragment their disks not so infrequently.
    But I think they have to access at least 3 locations to get to a file;
    the directory entry, some kind of FAT-like thing, then the file.
    Unlike dps, where 2 accesses are enough. And of course dps does
    worst fit allocation so defragmentating is just unnecessary.

    ======================================================
    Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Sun Oct 17 15:01:08 2021
    On 10/17/2021 2:49 PM, Don Y wrote:
    But, seek time can be 10, 15, + ms. (on my enterprise drives, its 4; but, average rotational delay is 2.) And, if the desired sector lies on a "distant" cylinder, you can scale that almost linearly.

    Sorry, that should be .4 (I should discipline myself to add leading zeroes!)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Clifford Heath@21:1/5 to Don Y on Mon Oct 18 11:58:49 2021
    On 18/10/21 8:49 am, Don Y wrote:
    On 10/17/2021 1:27 PM, Brett wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    As a *rough* figure, what would you expect the bandwidth of
    a disk drive (spinning rust) to do as a function of number of
    discrete files being accessed, concurrently?

    E.g., if you can monitor the rough throughput of each
    stream and sum them, will they sum to 100% of the drive's
    bandwidth?  90%?  110?  etc.

    [Note that drives have read-ahead and write caches so
    the speed of the media might not bleed through to the
    application layer.  And, filesystem code also throws
    a wrench in the works.  Assume caching in the system
    is disabled/ineffective.]

    Said another way, what's a reasonably reliable way of
    determining when you are I/O bound by the hardware
    and when more threads won't result in more performance?

    Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
    revolutions a second and a seek takes half a revolution and the next file
    is another half a revolution away on average

    For a 7200 rpm (some are as slow as 5400, some as fast as 15K) drive,
    AVERAGE rotational delay is 8.3+ ms/2 = ~4ms.
    I.e., looking at the disk's specs is largely useless unless you know how
    the data on it is laid out.

    And, for writes, shingled drives throw all of that down the toilet.

    Modern drives do so much caching and bad-sector reassignment that your
    physical intuition isn't likely to be of much help in any case.
    Typically there will be many full-track caches and a full RTOS running
    I/O scheduling using head kinematics. It is incredibly sophisticated and unlikely to yield to any trivial rotational analysis.

    Clifford Heath

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to All on Mon Oct 18 08:05:28 2021
    On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:
    You're assuming files are laid out contiguously -- that no seeks are needed >> "between sectors".

    This is the typical case anyway, most files are contiguously allocated.

    I'm not sure that is the case for files that have been modified on a medium. Write once, read multiple MAY support that sort of approach (assuming there
    is enough contiguous space for that first write). But, if you are appending
    to a file or overwriting it, I don't know what guarantees you can expect;
    there are a LOT of different file systems out there!

    I would assume CD9660 would be the friendliest, in this regard. And, it
    is typically immutable so once the tracks are laid, they can persist.

    [IIRC, CD9660 also has a "summary VToC, of sorts, so you don't have to
    seek to individual subdirectories to find things from the medium's root]

    Even on popular filesystems which have long forgotten how to do worst
    fit allocation and have to defragment their disks not so infrequently.
    But I think they have to access at least 3 locations to get to a file;
    the directory entry, some kind of FAT-like thing, then the file.
    Unlike dps, where 2 accesses are enough. And of course dps does
    worst fit allocation so defragmentating is just unnecessary.

    I think directories are cached. And, possibly entire drive structures (depending on how much physical RAM you have available).

    I still can't see an easy threading strategy that can be applied
    without more details of the target hardware/OS, filesystem layout,
    specifics of the drives/volumes, etc.

    E.g., my disk sanitizer times each (fixed size) access to profile the
    drive's performance as well as looking for trouble spots on the media.
    But, things like recal cycles or remapping bad sectors introduce
    completely unpredictable blips in the throughput. So much so that
    I've had to implement a fair bit of logic to identify whether a
    "delay" was part of normal operation *or* a sign of an exceptional
    event.

    [But, the sanitizer has a very predictable access pattern so
    there's no filesystem/content -specific issues involved; just
    process sectors as fast as possible. (also, there is no
    need to have multiple threads per spindle; just a thread *per*
    spindle -- plus some overhead threads)

    And, the sanitizer isn't as concerned with throughput as the
    human operator is the bottleneck (I can crank out a 500+GB drive
    every few minutes).]

    I'll mock up some synthetic loads and try various thread-spawning
    strategies to see the sorts of performance I *might* be able
    to get -- with different "preexisting" media (to minimize my
    impact on that).

    I'm sure I can round up a dozen or more platforms to try -- just
    from stuff I have lying around here! :>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dimiter_Popoff@21:1/5 to Don Y on Mon Oct 18 19:25:05 2021
    On 10/18/2021 18:05, Don Y wrote:
    On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:
    You're assuming files are laid out contiguously -- that no seeks are
    needed
    "between sectors".

    This is the typical case anyway, most files are contiguously allocated.

    I'm not sure that is the case for files that have been modified on a
    medium.

    It is not the case for files which are appended to, say logs etc.
    But files like that do not make such a high percentage. I looked at a log
    file which logs some activities several times a day, it has grown to
    52 megabytes (ascii text) for something like 15 months (first entry
    is June 1 2020). It is spread over 32 segments, as one would expect.

    Then I looked at the directory where I archive emails, 311 files
    (one of them being Don.txt :-). Just one or two were two segmented,
    the rest were all contiguous.

    The devil is not that black (literally translating a Bulgarian saying)
    as you see. Worst fit allocation is of course crucial to get to
    such figures, the mainstream OS-s don't do it and things there
    must be much worse.

    ....
    Even on popular filesystems which have long forgotten how to do worst
    fit allocation and have to defragment their disks not so infrequently.
    But I think they have to access at least 3 locations to get to a file;
    the directory entry, some kind of FAT-like thing, then the file.
    Unlike dps, where 2 accesses are enough. And of course dps does
    worst fit allocation so defragmentating is just unnecessary.

    I think directories are cached.  And, possibly entire drive structures (depending on how much physical RAM you have available).

    Well of course they must be caching them, especially since there are
    gigabytes of RAM available. I know what dps does: it caches longnamed directories which coexist with the old 8.4 ones in the same filesystem
    and work faster than the 8.4 ones which typically don't get cached
    (these were done to work well even on floppies, a directory entry
    update writes back only the sector(s , if crossing) it occupies
    etc. Then in dps the CAT (cluster allocation tables) are cached all
    the time (do that for a 500G partition and enjoy reading all the
    4 megabytes each time the CAT is needed to allocate new space...
    it can be done, in fact the caches are enabled upon boot explicitly
    on a per LUN/partition basis).

    ...

    E.g., my disk sanitizer times each (fixed size) access to profile the
    drive's performance as well as looking for trouble spots on the media.
    But, things like recal cycles or remapping bad sectors introduce
    completely unpredictable blips in the throughput.  So much so that
    I've had to implement a fair bit of logic to identify whether a
    "delay" was part of normal operation *or* a sign of an exceptional
    event.

    [But, the sanitizer has a very predictable access pattern so
    there's no filesystem/content -specific issues involved; just
    process sectors as fast as possible.  (also, there is no
    need to have multiple threads per spindle; just a thread *per*
    spindle -- plus some overhead threads)

    And, the sanitizer isn't as concerned with throughput as the
    human operator is the bottleneck (I can crank out a 500+GB drive
    every few minutes).]

    I did something similar many years ago, wneh the largest drive
    a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
    magnetoresistive heads came to the world. It did develop
    bad sectors and did not do much internally about it (1993).
    So I wrote the "lockout" command (still available, I see I have
    recompiled it for power, last change 2016 - can't remember if it
    did anything useful, nor why I did that). It accessed sector by
    sector the LUN it was told to and built the lockout CAT on
    its filesystem (LCAT being ORed to CAT prior to the LUN being
    usable for the OS). Took quite some time on that drive but
    did the job back then.


    I'll mock up some synthetic loads and try various thread-spawning
    strategies to see the sorts of performance I *might* be able
    to get -- with different "preexisting" media (to minimize my
    impact on that).

    I'm sure I can round up a dozen or more platforms to try -- just
    from stuff I have lying around here!  :>

    I think this will give you plenty of an idea how to go about it.
    Once you know the limit you can run at some reasonable figure
    below it and be happy. Getting more precise figures about all
    that is neither easy nor will it buy you anything.


    ======================================================
    Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to All on Mon Oct 18 13:46:59 2021
    On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:
    The devil is not that black (literally translating a Bulgarian saying)
    as you see. Worst fit allocation is of course crucial to get to
    such figures, the mainstream OS-s don't do it and things there
    must be much worse.

    I think a lot depends on the amount of "churn" the filesystem
    experiences, in normal operation. E.g., the "system" disk on
    the workstation I'm using today has about 800G in use of 1T total.
    But, the vast majority of it is immutable -- binaries, libraries,
    etc. So, there's very low fragmentation (because I "build" the
    disk in one shot, instead of incrementally revising and
    "updating" its contents)

    By contrast, the other disks in the machine all see a fair bit of
    turnover as things get created, revised and deleted.

    Even on popular filesystems which have long forgotten how to do worst
    fit allocation and have to defragment their disks not so infrequently.
    But I think they have to access at least 3 locations to get to a file;
    the directory entry, some kind of FAT-like thing, then the file.
    Unlike dps, where 2 accesses are enough. And of course dps does
    worst fit allocation so defragmentating is just unnecessary.

    I think directories are cached. And, possibly entire drive structures
    (depending on how much physical RAM you have available).

    Well of course they must be caching them, especially since there are gigabytes of RAM available.

    Yes, but *which*? And how many (how "much")? I would assume memory
    set aside for caching files and file system structures is dynamically
    managed -- if a process looks at lots of directories but few files
    vs. a process that looks at few directories but many files...

    I know what dps does: it caches longnamed
    directories which coexist with the old 8.4 ones in the same filesystem
    and work faster than the 8.4 ones which typically don't get cached
    (these were done to work well even on floppies, a directory entry
    update writes back only the sector(s , if crossing) it occupies
    etc. Then in dps the CAT (cluster allocation tables) are cached all
    the time (do that for a 500G partition and enjoy reading all the
    4 megabytes each time the CAT is needed to allocate new space...
    it can be done, in fact the caches are enabled upon boot explicitly
    on a per LUN/partition basis).

    E.g., my disk sanitizer times each (fixed size) access to profile the
    drive's performance as well as looking for trouble spots on the media.
    But, things like recal cycles or remapping bad sectors introduce
    completely unpredictable blips in the throughput. So much so that
    I've had to implement a fair bit of logic to identify whether a
    "delay" was part of normal operation *or* a sign of an exceptional
    event.

    [But, the sanitizer has a very predictable access pattern so
    there's no filesystem/content -specific issues involved; just
    process sectors as fast as possible. (also, there is no
    need to have multiple threads per spindle; just a thread *per*
    spindle -- plus some overhead threads)

    And, the sanitizer isn't as concerned with throughput as the
    human operator is the bottleneck (I can crank out a 500+GB drive
    every few minutes).]

    I did something similar many years ago, wneh the largest drive
    a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
    magnetoresistive heads came to the world. It did develop
    bad sectors and did not do much internally about it (1993).
    So I wrote the "lockout" command (still available, I see I have
    recompiled it for power, last change 2016 - can't remember if it
    did anything useful, nor why I did that). It accessed sector by
    sector the LUN it was told to and built the lockout CAT on
    its filesystem (LCAT being ORed to CAT prior to the LUN being
    usable for the OS). Took quite some time on that drive but
    did the job back then.

    Yes, support for "bad block management" can be done outside the
    drive. In the sanitizer's case, it has to report on whether or not
    it was able to successfully "scrub" every PHYSICAL sector that
    might contain user data (for some "fussy" users).

    So, if it appears that a sector may have been remapped (visible
    by a drop in instantaneous access rate), I query the drive's
    bad sector statistics to see if I should just abort the process
    now and mark the drive to be (physically) shredded -- *if*
    it belongs to a "fussy" user. Regardless (for fussy users),
    I will query those stats at the end of the operation to see
    if they have changed during the process.

    But, again, that's a very specific application with different
    prospects for optimization. E.g., there's no file system
    support required as the disk is just a bunch of data blocks
    (sectors) having no particular structure nor meaning. (So,
    no need for filesystem code at all! One can scrub a SPARC
    disk just as easily as a Mac!)

    I'll mock up some synthetic loads and try various thread-spawning
    strategies to see the sorts of performance I *might* be able
    to get -- with different "preexisting" media (to minimize my
    impact on that).

    I'm sure I can round up a dozen or more platforms to try -- just
    from stuff I have lying around here! :>

    I think this will give you plenty of an idea how to go about it.
    Once you know the limit you can run at some reasonable figure
    below it and be happy. Getting more precise figures about all
    that is neither easy nor will it buy you anything.

    I suspect "1" is going to end up as the "best compromise". So,
    I'm treating this as an exercise in *validating* that assumption.
    I'll see if I can salvage some of the performance monitoring code
    from the sanitizer to give me details from which I might be able
    to ferret out "opportunities". If I start by restricting my
    observations to non-destructive synthetic loads, then I can
    pull a drive and see how it fares in a different host while
    running the same code, etc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dimiter_Popoff@21:1/5 to Don Y on Tue Oct 19 00:46:15 2021
    On 10/18/2021 23:46, Don Y wrote:
    On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:
    The devil is not that black (literally translating a Bulgarian saying)
    as you see. Worst fit allocation is of course crucial to get to
    such figures, the mainstream OS-s don't do it and things there
    must be much worse.

    I think a lot depends on the amount of "churn" the filesystem
    experiences, in normal operation.  E.g., the "system" disk on
    the workstation I'm using today has about 800G in use of 1T total.
    But, the vast majority of it is immutable -- binaries, libraries,
    etc.  So, there's very low fragmentation (because I "build" the
    disk in one shot, instead of incrementally revising and
    "updating" its contents)

    Typically so on most systems, I expect.


    By contrast, the other disks in the machine all see a fair bit of
    turnover as things get created, revised and deleted.

    Now this is where the worst fit allocation strategy becomes the game
    changer. A newly created file is almost certainly contiguously
    allocated; fragmentation occurs when it is appended to (and the
    space past its last block has been already allocated in the meantime).
    I think I saw somewhere (never really got interested) that mainstream
    operating systems of today do just first fit - which means once you
    delete a file, no matter how small, its space will be allocated as
    part of the next request etc., no idea why they do it (if they do so,
    my memory on that is not very certain) in such a primitive manner
    but here they are.

    ======================================================
    Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to All on Mon Oct 18 16:17:13 2021
    On 10/18/2021 2:46 PM, Dimiter_Popoff wrote:
    By contrast, the other disks in the machine all see a fair bit of
    turnover as things get created, revised and deleted.

    Now this is where the worst fit allocation strategy becomes the game
    changer. A newly created file is almost certainly contiguously
    allocated; fragmentation occurs when it is appended to (and the
    space past its last block has been already allocated in the meantime).
    I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit

    Dunno. There are *lots* of different "file systems". And, as most
    systems use the filesystem for its global namespace, the term is
    sadly overloaded (in ways that have nothing to do with storage media).

    I don't use large secondary stores in my designs. Where "big storage"
    is required, it is accessed remotely (network) so the problem of
    dealing with it is not mine. E.g., in my current project, there is
    no file system exposed; any persistent storage is done via "tables"
    that are created, as needed (by the system and its apps) in a remote
    database server.

    [This lets me delegate the issue of data retention to a specific box.
    And, also lets me impart structure to the data that is stored. So,
    for example, I can find the most recent firmware image for a particular hardware module by just issuing a query and waiting on the result
    instead of having to parse a bunch of names, somewhere in a filesystem/namespace, looking for the one that has the highest
    rev level IN its name: firmware_1.0.1, firmware_1.0.2, firmware_5.3,
    firmware 6.1 (typo! the '_' was omitted -- will the parse algorithm
    choke on that omission??)]

    - which means once you
    delete a file, no matter how small, its space will be allocated as
    part of the next request etc., no idea why they do it (if they do so,
    my memory on that is not very certain) in such a primitive manner
    but here they are.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Mon Oct 25 04:19:34 2021
    On 10/18/2021 1:46 PM, Don Y wrote:
    I think this will give you plenty of an idea how to go about it.
    Once you know the limit you can run at some reasonable figure
    below it and be happy. Getting more precise figures about all
    that is neither easy nor will it buy you anything.

    I suspect "1" is going to end up as the "best compromise". So,
    I'm treating this as an exercise in *validating* that assumption.
    I'll see if I can salvage some of the performance monitoring code
    from the sanitizer to give me details from which I might be able
    to ferret out "opportunities". If I start by restricting my
    observations to non-destructive synthetic loads, then I can
    pull a drive and see how it fares in a different host while
    running the same code, etc.

    Actually, '2' turns out to be marginally better than '1'.
    Beyond that, its hard to generalize without controlling some
    of the other variables.

    '2' wins because there is always the potential to make the
    disk busy, again, just after it satisfies the access for
    the 1st thread (which is now busy using the data, etc.)

    But, if the first thread finishes up before the second thread's
    request has been satisfied, then the presence of a THIRD thread
    would just be clutter. (i.e., the work performed correlates
    with the number of threads that can have value)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Damon@21:1/5 to Don Y on Mon Oct 25 20:23:25 2021
    On 10/25/21 7:19 AM, Don Y wrote:
    On 10/18/2021 1:46 PM, Don Y wrote:
    I think this will give you plenty of an idea how to go about it.
    Once you know the limit you can run at some reasonable figure
    below it and be happy. Getting more precise figures about all
    that is neither easy nor will it buy you anything.

    I suspect "1" is going to end up as the "best compromise".  So,
    I'm treating this as an exercise in *validating* that assumption.
    I'll see if I can salvage some of the performance monitoring code
    from the sanitizer to give me details from which I might be able
    to ferret out "opportunities".  If I start by restricting my
    observations to non-destructive synthetic loads, then I can
    pull a drive and see how it fares in a different host while
    running the same code, etc.

    Actually, '2' turns out to be marginally better than '1'.
    Beyond that, its hard to generalize without controlling some
    of the other variables.

    '2' wins because there is always the potential to make the
    disk busy, again, just after it satisfies the access for
    the 1st thread (which is now busy using the data, etc.)

    But, if the first thread finishes up before the second thread's
    request has been satisfied, then the presence of a THIRD thread
    would just be clutter.  (i.e., the work performed correlates
    with the number of threads that can have value)

    Actually, 2 might be slower than 1, because the new request from the
    second thread is apt to need a seek, while a single thread making all
    the calls is more apt to sequentially read much more of the disk.

    The controller, if not given a new chained command, might choose to automatically start reading the next sector of the cylinder, which could
    be likely the next one asked for.

    The real optimum is likely a single process doing asynchronous requests
    queuing up a series of requests, and then distributing the data as it
    comes in to processing threads to do what ever crunching needs to be done.

    These threads then send the data to a single thread that does
    asynchronous writes of the results.

    You can easily tell if the input or output processes are I/O bound or
    not and use that to adjust the number of crunching threads in the middle.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Richard Damon on Mon Oct 25 22:29:58 2021
    On 10/25/2021 5:23 PM, Richard Damon wrote:
    On 10/25/21 7:19 AM, Don Y wrote:
    On 10/18/2021 1:46 PM, Don Y wrote:
    I think this will give you plenty of an idea how to go about it.
    Once you know the limit you can run at some reasonable figure
    below it and be happy. Getting more precise figures about all
    that is neither easy nor will it buy you anything.

    I suspect "1" is going to end up as the "best compromise". So,
    I'm treating this as an exercise in *validating* that assumption.
    I'll see if I can salvage some of the performance monitoring code
    from the sanitizer to give me details from which I might be able
    to ferret out "opportunities". If I start by restricting my
    observations to non-destructive synthetic loads, then I can
    pull a drive and see how it fares in a different host while
    running the same code, etc.

    Actually, '2' turns out to be marginally better than '1'.
    Beyond that, its hard to generalize without controlling some
    of the other variables.

    '2' wins because there is always the potential to make the
    disk busy, again, just after it satisfies the access for
    the 1st thread (which is now busy using the data, etc.)

    But, if the first thread finishes up before the second thread's
    request has been satisfied, then the presence of a THIRD thread
    would just be clutter. (i.e., the work performed correlates
    with the number of threads that can have value)

    Actually, 2 might be slower than 1, because the new request from the second thread is apt to need a seek, while a single thread making all the calls is more apt to sequentially read much more of the disk.

    Second thread is another instance of first thread; same code, same data.
    So, it is just "looking ahead" -- i.e., it WILL do what the first thread
    WOULD do if the second thread hadn't gotten to it, first! The strategy
    is predicated on tightly coupled actors so they each can see what
    the other has done/is doing.

    If the layout of the object that thread 1 is processing calls for
    a seek, then thread 2 will perform that seek as if thread 1 were to
    do it "when done chewing on the past data".

    If the disk caches data, then thread 2 will reap the benefits of
    that cache AS IF it was thread 1 acting later.

    Etc.

    A second thread just hides the cost of the data processing.

    The controller, if not given a new chained command, might choose to automatically start reading the next sector of the cylinder, which could be likely the next one asked for.

    The real optimum is likely a single process doing asynchronous requests queuing
    up a series of requests, and then distributing the data as it comes in to processing threads to do what ever crunching needs to be done.

    These threads then send the data to a single thread that does asynchronous writes of the results.

    You can easily tell if the input or output processes are I/O bound or not and use that to adjust the number of crunching threads in the middle.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dimiter_Popoff@21:1/5 to antispam@math.uni.wroc.pl on Sun Oct 31 21:55:25 2021
    On 10/31/2021 21:21, antispam@math.uni.wroc.pl wrote:
    Dimiter_Popoff <dp@tgi-sci.com> wrote:

    Now this is where the worst fit allocation strategy becomes the game
    changer. A newly created file is almost certainly contiguously
    allocated; fragmentation occurs when it is appended to (and the
    space past its last block has been already allocated in the meantime).
    I think I saw somewhere (never really got interested) that mainstream
    operating systems of today do just first fit - which means once you
    delete a file, no matter how small, its space will be allocated as
    part of the next request etc., no idea why they do it (if they do so,
    my memory on that is not very certain) in such a primitive manner
    but here they are.

    There was some research and first fit turned out to be pretty good.
    But it is implemented slightly differently: there is moving pointer,
    search advances it. So, after deallocation you wait till moving
    pointer arrives to the hole. Way better than immediately filling
    hole: there is reasonable chance that holes will coalece into
    bigger free area. Another point is that you refuse allocation
    when disc is too full (say at 95% utilization). The two things
    together mean that normally fragmentation is not a problem.


    This might be somewhat better than plain first fit but it can be no
    match to worst fit. They probably don't do worst fit because they
    must have huge amounts of memory to dig through due to poor
    filesystem design (they have to go through 32 bits or more for
    each cluster).
    In DPS the CAT is one bit per cluster which makes worst fit doable
    without any problems like that. I have enhanced it a little
    (many years ago) noticing that the partition could come to a
    point where it has two almost equally sized empty parts; now you append
    to a file, it gets some clusters from one of these; next time you
    append to it the *other* one is the largest and it takes from it... :).
    So now dps first checks if there are empty clusters to allocate
    in succession of these already allocated to the file, if none
    then it does worst fit.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antispam@math.uni.wroc.pl@21:1/5 to dp@tgi-sci.com on Sun Oct 31 19:21:14 2021
    Dimiter_Popoff <dp@tgi-sci.com> wrote:

    Now this is where the worst fit allocation strategy becomes the game
    changer. A newly created file is almost certainly contiguously
    allocated; fragmentation occurs when it is appended to (and the
    space past its last block has been already allocated in the meantime).
    I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit - which means once you
    delete a file, no matter how small, its space will be allocated as
    part of the next request etc., no idea why they do it (if they do so,
    my memory on that is not very certain) in such a primitive manner
    but here they are.

    There was some research and first fit turned out to be pretty good.
    But it is implemented slightly differently: there is moving pointer,
    search advances it. So, after deallocation you wait till moving
    pointer arrives to the hole. Way better than immediately filling
    hole: there is reasonable chance that holes will coalece into
    bigger free area. Another point is that you refuse allocation
    when disc is too full (say at 95% utilization). The two things
    together mean that normally fragmentation is not a problem.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)