• Exception vs. error code

    From Don Y@21:1/5 to All on Mon May 9 20:13:02 2022
    [Apologies if you see this in several public/private forums]

    I'm trying to settle an argument among folks using my current
    codebase wrt how (atypical) error conditions are handled:
    returning an error code vs. raising an exception.

    The error code camp advocates for that approach as it allows
    the error to be handled where it arises (is detected).

    The exception camp claims these are truly "exceptional" conditions
    (that is debatable, especially if you assume "exceptional" means
    "infrequent") and should be handled out of the normal flow
    of execution.

    The error code camp claims processing the exception is more tedious
    (depends on language binding) AND easily overlooked/not-considered
    (like folks failing to test malloc() for NULL).

    The exception camp claims the OS can install "default" exception handlers
    to address those cases where the developer was lazy or ignorant. And,
    that no such remedy is possible if error codes are processed inline.

    In a perfect world, a developer would deal with ALL of the possibilities
    laid out by the API.

    But, we have to live with imperfect people... <frown>

    I can implement either case with similar effort (even a compile/run-time switch) but would prefer to just "take a stand" and be done with it...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to All on Tue May 10 17:49:02 2022
    Hi Don,


    On Mon, 9 May 2022 20:13:02 -0700, Don Y <blockedofcourse@foo.invalid>
    wrote:

    [Apologies if you see this in several public/private forums]

    I'm trying to settle an argument among folks using my current
    codebase wrt how (atypical) error conditions are handled:
    returning an error code vs. raising an exception.

    The error code camp advocates for that approach as it allows
    the error to be handled where it arises (is detected).

    The exception camp claims these are truly "exceptional" conditions
    (that is debatable, especially if you assume "exceptional" means >"infrequent") and should be handled out of the normal flow
    of execution.

    I don't think the manner in which the error gets reported is as
    important as /separating/ error returns from data returns. IMNSHO it
    is a bad mistake for a single value sometimes to represent good data
    and sometimes to represent an error.



    The locality argument I think too often is misunderstood: it's
    definitely not a "source" problem - you can place CATCH code as close
    to or as far away from the TRY code as you wish.

    The real problem with exceptions is that most compilers turn the CATCH
    block into a separate function. Then because exceptions are expected
    to be /low probability/ events, the CATCH function by default will be
    treated as a 'cold code' path. In the best case it will be in the
    same load module but separated from the TRY code. In the worst case
    it could end up in a completely different load module.

    Typically the TRY block also is turned into a separate function - at
    least initially - and then (hopefully) it will be recognized as being
    a unique function with a single call point and be inlined again at
    some later compilation stage [at least if it satisfies the inlining
    metrics].



    With a return value, conditional error code - even if never used - is
    likely to have at least been prefetched into cache [at least with
    /inline/ error code]. Exception handling code, being Out-Of-Band by
    design, will NOT be anywhere in cache unless it was used recently.

    This being comp.realtime, it matters also how quickly exceptions can
    be recognized and dispatched. Dispatch implementation can vary by how
    many handlers are active in the call chain, and also by distances
    between TRY and CATCH blocks ... this tends not to be documented very
    well (if at all) so you need to experiment and see what your compiler
    does with various structuring.



    [Certain popular compilers have pragmas to mark code as 'hot' or
    'cold' with the intent to control prefetch. You can mark CATCH blocks
    'hot' to keep them near(er) their TRY code, or conversely, you can
    mark conditional inline error handling as 'cold' to avoid having it
    prefetched because you expect it to be rarely used.

    (I have never seen any code deliberately marked 'cold'. 8-)]



    The error code camp claims processing the exception is more tedious
    (depends on language binding) AND easily overlooked/not-considered
    (like folks failing to test malloc() for NULL).

    The exception camp claims the OS can install "default" exception handlers
    to address those cases where the developer was lazy or ignorant. And,
    that no such remedy is possible if error codes are processed inline.

    Depends on the language. Some offer exceptions that /must/ be handled
    or the code won't compile.

    And some, like Scala, expect that exception handling will be used for
    general flow of control (like IF/THEN), and so they endeavor to make
    the cost of exception handing as low as possible.


    In a perfect world, a developer would deal with ALL of the possibilities
    laid out by the API.

    But, we have to live with imperfect people... <frown>

    I can implement either case with similar effort (even a compile/run-time >switch) but would prefer to just "take a stand" and be done with it...


    The answer depends on what languages you are using and what your
    compiler(s) can do.

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Wed May 11 00:24:58 2022
    Hi George,

    Hope you are well and surviving this "global inconvenience"? :>

    On 5/10/2022 2:49 PM, George Neuner wrote:
    [Apologies if you see this in several public/private forums]

    I'm trying to settle an argument among folks using my current
    codebase wrt how (atypical) error conditions are handled:
    returning an error code vs. raising an exception.

    The error code camp advocates for that approach as it allows
    the error to be handled where it arises (is detected).

    The exception camp claims these are truly "exceptional" conditions
    (that is debatable, especially if you assume "exceptional" means
    "infrequent") and should be handled out of the normal flow
    of execution.

    I don't think the manner in which the error gets reported is as
    important as /separating/ error returns from data returns. IMNSHO it
    is a bad mistake for a single value sometimes to represent good data
    and sometimes to represent an error.

    Agreed. I return tuples/structs so the "intended result" can be
    separated from the "status of the invocation". This (hopefully)
    reinforces the *need* (?) for the developer to explicitly check
    the "status" before assuming the "result" makes sense (or that
    the operation actually completed).

    However, there is nothing that ensures the developer will actually
    *examine* the status returned. Or, that he will handle ALL of the
    potential status codes possible!

    [It would be nice if an editor template inserted a list of all
    possible status codes and waited for the developer to specify how each
    should be handled:
    (status, result) := function(arguments)
    case status {
    STATUS_1 => /* handle STATUS_1 situation */
    STATUS_2 => /* handle STATUS_1 situation */
    ...
    STATUS_n => /* handle STATUS_n situation */
    }
    but that just ensures the code is lengthy!]

    But, it precludes use of the return value directly in an expression
    (which clutters up the code). [exceptions really go a long way towards cleaning this up as you can "catch" the SET of exceptions at the end
    of a block instead of having to explicitly test each "status" returned.

    OTOH, the notion of "status" is portable to all (?) language bindings
    (whereas exceptions require language support or "other mechanisms"
    to implement consistently)

    The locality argument I think too often is misunderstood: it's
    definitely not a "source" problem - you can place CATCH code as close
    to or as far away from the TRY code as you wish.

    Likewise with "status codes". But, you tend not to want to write
    much code without KNOWING that the code you've hoped to have executed
    actually DID execute (as intended). I.e., you wouldn't want to invoke
    several such functions (potentially relying on the results of earlier
    function calls) before sorting out if you've already stubbed your toe!

    There are just too many places where you can get an "unexpected" status returned, not just an "operation failed".

    And, with status codes, you need to examine EACH status code individually instead of just looking for, e.g., *any* INSUFFICIENT_PERMISSION exception
    or HOST_NOT_AVAILABLE, etc. thrown by any of the functions/methods in the
    TRY block.

    [You also know that the exception abends execution before any subsequent functions/methods are invoked; no need to *conditionally* execute function2 AFTER function1 has returned "uneventfully"]

    I.e., the code just ends up looking a lot "cleaner". More functionality in
    a given amount of paper.

    The real problem with exceptions is that most compilers turn the CATCH
    block into a separate function. Then because exceptions are expected
    to be /low probability/ events, the CATCH function by default will be
    treated as a 'cold code' path. In the best case it will be in the
    same load module but separated from the TRY code. In the worst case
    it could end up in a completely different load module.

    I think part of the problem is that there is an expectation that
    exceptions will be rare events. This leads to a certain laziness
    on the developer's part (akin to NOT expecting malloc() to fail
    "very often"). This is especially true if synthesizing those events
    (in test scaffolding) is difficult or not repeatable.

    But, there are cases where exceptions can be as common as "normal
    execution". Part of the problem is defining WHAT is an "exception"
    and what is a "failure" (bad choices of words) return.

    [Is "Name not found" an error? Or, an exception? A lot depends
    on the mindset of the developer -- if he is planning on only
    searching for names that he knows/assumes to exist, then the "failure"
    suggests something beyond his original conception has happened.]

    Typically the TRY block also is turned into a separate function - at
    least initially - and then (hopefully) it will be recognized as being
    a unique function with a single call point and be inlined again at
    some later compilation stage [at least if it satisfies the inlining
    metrics].

    With a return value, conditional error code - even if never used - is
    likely to have at least been prefetched into cache [at least with
    /inline/ error code]. Exception handling code, being Out-Of-Band by
    design, will NOT be anywhere in cache unless it was used recently.

    This being comp.realtime, it matters also how quickly exceptions can
    be recognized and dispatched. Dispatch implementation can vary by how
    many handlers are active in the call chain, and also by distances
    between TRY and CATCH blocks ... this tends not to be documented very
    well (if at all) so you need to experiment and see what your compiler
    does with various structuring.

    Exactly. And, much of the machinery isn't *intuitively* "visible" to the developer. So, he's unlikely to gauge the cost of encountering an exception; by contrast, he knows what it costs to explicitly examine a status code
    and execute HIS code to handle it.

    [Certain popular compilers have pragmas to mark code as 'hot' or
    'cold' with the intent to control prefetch. You can mark CATCH blocks
    'hot' to keep them near(er) their TRY code, or conversely, you can
    mark conditional inline error handling as 'cold' to avoid having it prefetched because you expect it to be rarely used.

    (I have never seen any code deliberately marked 'cold'. 8-)]

    Of course. And, all real-time is ALWAYS "hard"! And *my* task
    is always of the highest priority! :-/

    The error code camp claims processing the exception is more tedious
    (depends on language binding) AND easily overlooked/not-considered
    (like folks failing to test malloc() for NULL).

    The exception camp claims the OS can install "default" exception handlers
    to address those cases where the developer was lazy or ignorant. And,
    that no such remedy is possible if error codes are processed inline.

    Depends on the language. Some offer exceptions that /must/ be handled
    or the code won't compile.

    And some, like Scala, expect that exception handling will be used for
    general flow of control (like IF/THEN), and so they endeavor to make
    the cost of exception handing as low as possible.

    In a perfect world, a developer would deal with ALL of the possibilities
    laid out by the API.

    But, we have to live with imperfect people... <frown>

    I can implement either case with similar effort (even a compile/run-time
    switch) but would prefer to just "take a stand" and be done with it...

    The answer depends on what languages you are using and what your
    compiler(s) can do.

    I think the bigger problem is the mindset(s) of the developers.
    Few have any significant experience programming using sockets
    (which seems to be the essence of the problems my folks are having).
    So, are unaccustomed to the fact that the *mechanism* has a part
    to play in the function invocation. It's not just a simple BAL/CALL
    that you can count on the CPU being able to execute.

    Add to this the true parallelism in place AND the distributed
    nature (of my project) and its just hard to imagine that an
    object that you successfully accessed on line 45 of your code
    might no longer be accessible on line 46; "Through no fault
    (action) of your own". And, that the "fault" might lie in
    the "invocation mechanism" *or* the actions of some other agency
    in the system.

    For example,
    if (SUCCESS != function()) ...
    doesn't mean function() "failed"; it could also mean function() was
    never actually executed! Perhaps the object on which it operates
    is no longer accessible to you, or you are attempting to use a
    method that you don't have permission to use, or the server that
    was backing it is now offline, or... (presumably, you will invoke
    different remedies in each case)

    So, either make these "other cases" (status possibilities) more
    visible OR put in some default handling that ensures a lazy developer
    shoots himself in the foot in a very obvious way.

    [For user applets, the idea of a default handler makes lots of sense;
    force their code to crash if any of these "unexpected" situations arise]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Thu May 12 13:26:52 2022
    On Wed, 11 May 2022 00:24:58 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Hi George,

    Hope you are well and surviving this "global inconvenience"? :>

    I am well. The "inconvenience" is inconvenient.
    Hope you are the same. 8-)


    On 5/10/2022 2:49 PM, George Neuner wrote:

    I don't think the manner in which the error gets reported is as
    important as /separating/ error returns from data returns. IMNSHO it
    is a bad mistake for a single value sometimes to represent good data
    and sometimes to represent an error.

    Agreed. I return tuples/structs so the "intended result" can be
    separated from the "status of the invocation". This (hopefully)
    reinforces the *need* (?) for the developer to explicitly check
    the "status" before assuming the "result" makes sense (or that
    the operation actually completed).

    However, there is nothing that ensures the developer will actually
    *examine* the status returned. Or, that he will handle ALL of the
    potential status codes possible!

    Which is the attraction of exceptions - at least in those languages
    which force the programmer to declare what exceptions may be thrown
    and explicitly handle them in calling code.

    Of course, they all allow installing a generic "whatever" handler at
    the top level, so the notion of 'required' is dubious at best ... but
    most languages having exceptions don't require anything and simply
    abort the program if an unhandled exception occurs.

    :
    But, it precludes use of the return value directly in an expression
    (which clutters up the code). [exceptions really go a long way towards >cleaning this up as you can "catch" the SET of exceptions at the end
    of a block instead of having to explicitly test each "status" returned.

    If the value might be data or might be error then the only expression
    you /could/ return it to would be have to some kind of conditional.

    Non-FP languages generally classify conditionals as 'statements'
    rather than 'expressions'. Statements can introduce side-effects but
    they don't directly produce return values.

    Score another point for exceptions ... returned data can be assumed
    'good' [for some definition] and directly can be used by surrounding
    code.


    OTOH, the notion of "status" is portable to all (?) language bindings >(whereas exceptions require language support or "other mechanisms"
    to implement consistently)

    Which is why Microsoft introduced SEH - putting exception support
    directly into the OS. Of course, very few developers outside MS ever
    used it ... but that's a different issue.

    [I am not particularly a fan of MS, but I at least try to acknowledge
    when, IMO, good things are done (even if not intentionally).]


    The locality argument I think too often is misunderstood: it's
    definitely not a "source" problem - you can place CATCH code as close
    to or as far away from the TRY code as you wish.

    Likewise with "status codes". But, you tend not to want to write
    much code without KNOWING that the code you've hoped to have executed >actually DID execute (as intended). I.e., you wouldn't want to invoke >several such functions (potentially relying on the results of earlier >function calls) before sorting out if you've already stubbed your toe!

    There are just too many places where you can get an "unexpected" status >returned, not just an "operation failed".

    And, with status codes, you need to examine EACH status code individually >instead of just looking for, e.g., *any* INSUFFICIENT_PERMISSION exception
    or HOST_NOT_AVAILABLE, etc. thrown by any of the functions/methods in the
    TRY block.

    [You also know that the exception abends execution before any subsequent >functions/methods are invoked; no need to *conditionally* execute function2 >AFTER function1 has returned "uneventfully"]

    I.e., the code just ends up looking a lot "cleaner". More functionality in
    a given amount of paper.

    Yes, exception code generally gives better (source visual) separation
    between the 'good' path and the 'error' path.


    The question really is "how far errors can be allowed to propragate?"

    If an error /must/ be handled close to where it occurs, there is
    little point to using exceptions.

    Exceptions really start to make sense from the developer POV when a
    significant [for some metric] amount of code can be made (mostly) free
    from inline error handling, OR when there is a large set of possible
    errors that can be grouped meaningfully: e.g., instead of parsing
    bitfields or writing an N-way switch on the error value, you can write
    one or a few exception handlers each of which deals just with some
    subset of possible errors.
    [And 'yes', this is a code 'cleanliness' issue: what is likely to be
    easier to understand when you have to look at it again a year later.]



    The real problem with exceptions is that most compilers turn the CATCH
    block into a separate function. Then because exceptions are expected
    to be /low probability/ events, the CATCH function by default will be
    treated as a 'cold code' path. In the best case it will be in the
    same load module but separated from the TRY code. In the worst case
    it could end up in a completely different load module.

    I think part of the problem is that there is an expectation that
    exceptions will be rare events. This leads to a certain laziness
    on the developer's part (akin to NOT expecting malloc() to fail
    "very often"). This is especially true if synthesizing those events
    (in test scaffolding) is difficult or not repeatable.

    The belief that exceptions are rare certainly is not universal: a
    significant percentage of application developers believe that
    exceptions are - more or less - interchangeable with conditional
    branching code.

    But the idea that exceptions ARE rare events is all too prevalent
    among compiler/runtime developers. Among system programmers there is
    the notion that "errors are not exceptional". Unfortunately, if you
    continue along that line of thinking, you might come to the conclusion
    that exceptions are not meant for error handling.
    [But if not error handling, then what ARE they good for?]


    But, there are cases where exceptions can be as common as "normal
    execution". Part of the problem is defining WHAT is an "exception"
    and what is a "failure" (bad choices of words) return.

    Terminology always has been a problem: errors are not necessarily 'exceptional', and exceptional conditions are not necessarily
    'errors'. And there is no agreement among developers on the best way
    to handle either situation.

    There always has been some amount of confusion and debate among
    developers about what place exceptions can occupy in various
    programming models.


    [Is "Name not found" an error? Or, an exception? A lot depends
    on the mindset of the developer -- if he is planning on only
    searching for names that he knows/assumes to exist, then the "failure" >suggests something beyond his original conception has happened.]

    Exactly.


    I can implement either case with similar effort (even a compile/run-time >>> switch) but would prefer to just "take a stand" and be done with it...

    The answer depends on what languages you are using and what your
    compiler(s) can do.

    I think the bigger problem is the mindset(s) of the developers.
    Few have any significant experience programming using sockets
    (which seems to be the essence of the problems my folks are having).
    So, are unaccustomed to the fact that the *mechanism* has a part
    to play in the function invocation. It's not just a simple BAL/CALL
    that you can count on the CPU being able to execute.

    Sockets are a royal PITA - many possible return 'conditions', and many
    which are (or should be) non-fatal 'just retry it' issues from the POV
    of the application.

    If you try (or are required by spec) to enumerate and handle all of
    the possibilities, you end up with many pages of handler code.
    BTDTGTTS.


    Add to this the true parallelism in place AND the distributed
    nature (of my project) and its just hard to imagine that an
    object that you successfully accessed on line 45 of your code
    might no longer be accessible on line 46; "Through no fault
    (action) of your own". And, that the "fault" might lie in
    the "invocation mechanism" *or* the actions of some other agency
    in the system.

    For example,
    if (SUCCESS != function()) ...
    doesn't mean function() "failed"; it could also mean function() was
    never actually executed! Perhaps the object on which it operates
    is no longer accessible to you, or you are attempting to use a
    method that you don't have permission to use, or the server that
    was backing it is now offline, or... (presumably, you will invoke
    different remedies in each case)

    So everything has to be written as asynchronous event code. I'd say
    "so what", but there is too much evidence that a large percentage of programmers have a lot of trouble writing asynch code.

    I find it curious because Javascript is said to have more programmers
    than ALL OTHER popular languages combined. Between browsers and
    'node.js', it's available on almost any platform. And in Javascript
    all I/O is asynchronous.

    You'd think there would be lots of programmers able to deal with
    asynch code (modulo learning a new syntax) ... but it isn't true.


    The parallel issues just compound the problem. Most developers have
    problems with thread or task parallelism - never mind distributed.

    Protecting less-capable programmers from themselves is one of the
    major design goals of the virtual machine 'managed' runtimes - JVM,
    CLR, etc. - and of the languages that target them. They often
    sacrifice (sometimes significant) performance for correct operation.


    The real problem is, nobody ever could make transactional operating
    systems [remember IBM's 'Quicksilver'?] performant enough to be
    competitive even for normal use.


    So, either make these "other cases" (status possibilities) more
    visible OR put in some default handling that ensures a lazy developer
    shoots himself in the foot in a very obvious way.

    [For user applets, the idea of a default handler makes lots of sense;
    force their code to crash if any of these "unexpected" situations arise]

    It's easy enough to have a default exception handler that kills the application, just as the default signal handler does in Unix. But
    that doesn't help developers very much.


    Java, at least, offers 'checked' exceptions that must be handled -
    either by catching or /deliberate/ proprogation ... or else the
    program build will fail. If you call a function that throws checked
    exceptions and don't catch or propagate, the code won't compile. You
    can propagate all the way to top level - right out of the compilation
    module - but if a checked exception isn't handled somewhere, the
    program will fail to link.

    The result, of course, is that relatively few Java developers use
    checked exceptions extensively.


    There are some other languages that work even harder to guarantee that exceptions are handled, but they aren't very popular.


    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Fri May 13 01:00:54 2022
    On 5/12/2022 10:26 AM, George Neuner wrote:
    Hope you are well and surviving this "global inconvenience"? :>

    I am well. The "inconvenience" is inconvenient.
    Hope you are the same. 8-)

    Running ragged. She's found all sorts of things that need to be
    done and, of course, I'm the prime (sole?) candidate for them! <frown>
    Backing up her "backup", tonight. "Is there a reason you've changed
    the file hierarchy from what I'd previously backed up???" (sigh)

    However, there is nothing that ensures the developer will actually
    *examine* the status returned. Or, that he will handle ALL of the
    potential status codes possible!

    Which is the attraction of exceptions - at least in those languages
    which force the programmer to declare what exceptions may be thrown
    and explicitly handle them in calling code.

    Of course, they all allow installing a generic "whatever" handler at
    the top level, so the notion of 'required' is dubious at best ... but
    most languages having exceptions don't require anything and simply
    abort the program if an unhandled exception occurs.

    Which is good in the sense that it makes the developer's shortsightedness visible. But, bad in that it likely isn't what the developer *would* have wanted, had he thought about <whatever> he'd neglected!

    And, as you don't see the thrown exception until the shit has hit
    the fan (i.e., after release), it's not much help DURING development.

    :
    But, it precludes use of the return value directly in an expression
    (which clutters up the code). [exceptions really go a long way towards
    cleaning this up as you can "catch" the SET of exceptions at the end
    of a block instead of having to explicitly test each "status" returned.

    If the value might be data or might be error then the only expression
    you /could/ return it to would be have to some kind of conditional.

    Yes, but with exceptions, the conditional execution (of subsequent statement) is implicitly handled. So, you can string together a bunch of function invocations and KNOW that ftn2 won't be invoked if ftn1 shits the bed.

    Given that any function may be used for its side-effects, it adds an extra dimension to the development if you have to consider WHAT ftn1 will "return"
    if the status is !SUCCESS. And, how that return value might be interpreted
    by ftn2 -- or any other statements subsequent to ftn1's invocation.

    I.e., you end up having to write:

    (status, result) := ftn1(args)
    if (SUCCESS != status) {
    // handle error
    } else {
    (status, result) := ftn2(args)
    if (SUCCESS != status) {
    // handle error
    } else {
    (status, result) := ftn3(args)
    if (SUCCESS != status) {
    // handle error
    } else {
    ...

    There are other more appealing representations of the control structure
    but the point is that you have to stop and check the status of each
    invocation before proceeding. You *might* be able to do something like:

    (status1, result1) := ftn1(args1)
    (status2, result2) := ftn2(args2)
    (status3, result3) := ftn3(args3)
    ...
    if ( (SUCCESS != status1)
    || (SUCCESS != status2)
    || (SUCCESS != status3)
    ... ) {
    // handle error

    But, that's not guaranteed. And, if argsN relies on any resultM (M<N),
    then you have to ensure ftnN is well-behaved wrt those argsN. AND REMAINS WELL-BEHAVED as ftnN evolves!

    I.e., the code just ends up looking a lot "cleaner". More functionality in >> a given amount of paper.

    Yes, exception code generally gives better (source visual) separation
    between the 'good' path and the 'error' path.

    The question really is "how far errors can be allowed to propragate?"

    If an error /must/ be handled close to where it occurs, there is
    little point to using exceptions.

    Unless there are whole classes of exceptions that can pertain to
    most functions where the handler need not be concerned with WHERE
    (or "who/what") the exception was thrown but, rather, just address
    the fact that an exception of that particular type was thrown.

    E.g., you likely don't care where a "Not Found" exception was
    thrown. The code was likely written KNOWING that there would
    be no such exceptions encountered (normally). The fact that one
    was encountered indicates that something "exceptional" has happened
    (which can be perfectly "legal") and it should deal with that
    "anomaly".

    For example, if the RDBMS is unavailable (for whatever reason),
    then a task can't "look up" some aspect of its desired future
    behavior. But, that doesn't (necessarily) mean that it should
    abend.

    Imagine an HVAC system not being able to find the next scheduled
    "setting" -- because the RDBMS isn't responding to the SELECT.
    You'd likely *still* want to maintain the current HVAC settings,
    even if not "ideal" -- rather than killing off the HVAC process.

    Or not being able to contact the time server -- do you suddenly
    *forget* what time it is?

    I.e., the developer is likely expecting each of these ftn invocations
    to *work* and is only dealing with the *results* they may provide.
    The fact that the ftn may not actually be invoked (despite what
    his sources say) hasn't been considered.

    Exceptions really start to make sense from the developer POV when a significant [for some metric] amount of code can be made (mostly) free
    from inline error handling, OR when there is a large set of possible
    errors that can be grouped meaningfully: e.g., instead of parsing
    bitfields or writing an N-way switch on the error value, you can write
    one or a few exception handlers each of which deals just with some
    subset of possible errors.

    Exactly.

    But, there are cases where exceptions can be as common as "normal
    execution". Part of the problem is defining WHAT is an "exception"
    and what is a "failure" (bad choices of words) return.

    Terminology always has been a problem: errors are not necessarily 'exceptional', and exceptional conditions are not necessarily
    'errors'. And there is no agreement among developers on the best way
    to handle either situation.

    There always has been some amount of confusion and debate among
    developers about what place exceptions can occupy in various
    programming models.

    In my world, as the system is open (in a dynamic sense), you can't
    really count on anything being static. A service that you used "three statements earlier" may be shutdown before your latest invocation.
    Sure, a notification is "in the mail" -- but, you may not have received
    (or processed) it, yet.

    Do "you" end up faulting as a result? What are *you* serving??

    I can implement either case with similar effort (even a compile/run-time >>>> switch) but would prefer to just "take a stand" and be done with it...

    The answer depends on what languages you are using and what your
    compiler(s) can do.

    I think the bigger problem is the mindset(s) of the developers.
    Few have any significant experience programming using sockets
    (which seems to be the essence of the problems my folks are having).
    So, are unaccustomed to the fact that the *mechanism* has a part
    to play in the function invocation. It's not just a simple BAL/CALL
    that you can count on the CPU being able to execute.

    Sockets are a royal PITA - many possible return 'conditions', and many
    which are (or should be) non-fatal 'just retry it' issues from the POV
    of the application.

    If you try (or are required by spec) to enumerate and handle all of
    the possibilities, you end up with many pages of handler code.
    BTDTGTTS.

    What I'm seeing is folks who "think malloc never (rarely?) returns NULL".
    It's just sloppy engineering. You *know* what the system design puts
    in place. You KNOW what guarantees you have -- and DON'T have. Why
    code in ignorance of those realities? Then, be surprised when the
    events that the architecture was designed to tolerate come along and
    bite you in the ass?!

    Add to this the true parallelism in place AND the distributed
    nature (of my project) and its just hard to imagine that an
    object that you successfully accessed on line 45 of your code
    might no longer be accessible on line 46; "Through no fault
    (action) of your own". And, that the "fault" might lie in
    the "invocation mechanism" *or* the actions of some other agency
    in the system.

    For example,
    if (SUCCESS != function()) ...
    doesn't mean function() "failed"; it could also mean function() was
    never actually executed! Perhaps the object on which it operates
    is no longer accessible to you, or you are attempting to use a
    method that you don't have permission to use, or the server that
    was backing it is now offline, or... (presumably, you will invoke
    different remedies in each case)

    So everything has to be written as asynchronous event code. I'd say
    "so what", but there is too much evidence that a large percentage of programmers have a lot of trouble writing asynch code.

    The function calls are still synchronous. But, the possibility that the *mechanism* may fault has to be addressed. It's no longer a bi-valued
    status result: SUCCESS vs. FAILURE. Instead, it's SUCCESS vs. FAILURE
    (as returned by the ftn), INVALID_OBJECT, INSUFFICIENT_PERMISSION, RESOURCE_SHORTAGE, etc.

    But, the developer isn't thinking about any possibility other than SUCCESS/FAILURE.

    [And a fool who tests for != FAILURE -- thinking that implies SUCCESS -- will get royally bitten!]

    I find it curious because Javascript is said to have more programmers
    than ALL OTHER popular languages combined. Between browsers and
    'node.js', it's available on almost any platform. And in Javascript
    all I/O is asynchronous.

    You'd think there would be lots of programmers able to deal with
    asynch code (modulo learning a new syntax) ... but it isn't true.

    The parallel issues just compound the problem. Most developers have
    problems with thread or task parallelism - never mind distributed.

    I've the worst of all worlds -- multitasking, multicore, multiprocessor, distributed, "open", RT w/ asynchronous notification channels.

    At the highest (most abstract) application level, all of this detail
    disappears -- cuz the applications are too complex to do much if the
    services on which they rely fail.

    At the *lowest* level, the scope of "your" problem is usually focused
    enough that you can manage the complexity.

    It's the domain in the middle where things get unruly. There, you're
    usually juggling lots of issues/services and can get distracted from
    the minutiae that can eat your lunch.

    Protecting less-capable programmers from themselves is one of the
    major design goals of the virtual machine 'managed' runtimes - JVM,
    CLR, etc. - and of the languages that target them. They often
    sacrifice (sometimes significant) performance for correct operation.

    The real problem is, nobody ever could make transactional operating
    systems [remember IBM's 'Quicksilver'?] performant enough to be
    competitive even for normal use.

    I don't sweat the performance issue; I've got capacity up the wazoo!

    But, I don't think the tools are yet available to deal with "multicomponent" systems, like this. People are still accustomed to dealing with specific devices for specific purposes.

    I refuse to believe future systems will consist of oodles of "dumb devices" talking to a "big" controller that does all of the real thinking. It just won't scale.

    OTOH, there may be a push towards overkill in terms of individual "appliances" and "wasting resources" when those devices are otherwise idle. Its possible
    as things are getting incredibly inexpensive! (but I can't imagine folks
    won't want to find some use for those wasted resources. e.g., SETI-ish)

    So, either make these "other cases" (status possibilities) more
    visible OR put in some default handling that ensures a lazy developer
    shoots himself in the foot in a very obvious way.

    [For user applets, the idea of a default handler makes lots of sense;
    force their code to crash if any of these "unexpected" situations arise]

    It's easy enough to have a default exception handler that kills the application, just as the default signal handler does in Unix. But
    that doesn't help developers very much.

    Esp if they don't encounter it in their testing. But users suffer
    its consequences ("Gee, it never did that in the lab...")

    Java, at least, offers 'checked' exceptions that must be handled -
    either by catching or /deliberate/ proprogation ... or else the
    program build will fail. If you call a function that throws checked exceptions and don't catch or propagate, the code won't compile. You
    can propagate all the way to top level - right out of the compilation
    module - but if a checked exception isn't handled somewhere, the
    program will fail to link.

    The result, of course, is that relatively few Java developers use
    checked exceptions extensively.

    Amusing how readily folks prefer to avoid the "training wheels"...
    at the expense of FALLING, often!

    There are some other languages that work even harder to guarantee that exceptions are handled, but they aren't very popular.

    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for most
    to use reliably -- but is a great "shortcut" for *my* work!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Sun May 15 17:11:54 2022
    Hi Don,

    Sorry for the delay ... lotsa stuff going on.


    On Fri, 13 May 2022 01:00:54 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 5/12/2022 10:26 AM, George Neuner wrote:


    In my world, as the system is open (in a dynamic sense), you can't
    really count on anything being static. A service that you used "three >statements earlier" may be shutdown before your latest invocation.
    Sure, a notification is "in the mail" -- but, you may not have received
    (or processed) it, yet.

    Do "you" end up faulting as a result? What are *you* serving??

    :

    I think the bigger problem is the mindset(s) of the developers.
    Few have any significant experience programming using sockets
    (which seems to be the essence of the problems my folks are having).
    So, are unaccustomed to the fact that the *mechanism* has a part
    to play in the function invocation. It's not just a simple BAL/CALL
    that you can count on the CPU being able to execute.

    Sockets are a royal PITA - many possible return 'conditions', and many
    which are (or should be) non-fatal 'just retry it' issues from the POV
    of the application.

    If you try (or are required by spec) to enumerate and handle all of
    the possibilities, you end up with many pages of handler code.
    BTDTGTTS.

    What I'm seeing is folks who "think malloc never (rarely?) returns NULL". >It's just sloppy engineering. You *know* what the system design puts
    in place. You KNOW what guarantees you have -- and DON'T have. Why
    code in ignorance of those realities? Then, be surprised when the
    events that the architecture was designed to tolerate come along and
    bite you in the ass?!

    On modern Linux, malloc (almost) never does return NULL.

    By default, Linux allocates logical address space, not physical space
    (ie. RAM or SWAP pages) ... physical space isn't reserved until you
    actually try to use the corresponding addresses. In a default
    configuration, nothing prevents you to malloc more space than you
    actually have. Unless the request exceeds the total possible address
    space, it won't fail.

    Then too, Linux has this nifty (or nasty, depending) OutOfMemory
    service which activates when the real physical space is about to be overcommitted. The OOM service randomly terminates running programs
    in an attempt to free space and 'fix' the problem. Of course, it is
    NOT guaranteed to terminate the application whose page reservation
    caused the overcommit.

    If you actually want to know whether malloc failed - and keep innocent
    programs running - you need to disable the OOM service and change the
    system allocation policy so that it provides *physically backed*
    address space rather than simply logical address space.


    Add to this the true parallelism in place AND the distributed
    nature (of my project) and its just hard to imagine that an
    object that you successfully accessed on line 45 of your code
    might no longer be accessible on line 46; "Through no fault
    (action) of your own". And, that the "fault" might lie in
    the "invocation mechanism" *or* the actions of some other agency
    in the system.

    For example,
    if (SUCCESS != function()) ...
    doesn't mean function() "failed"; it could also mean function() was
    never actually executed! Perhaps the object on which it operates
    is no longer accessible to you, or you are attempting to use a
    method that you don't have permission to use, or the server that
    was backing it is now offline, or... (presumably, you will invoke
    different remedies in each case)

    So everything has to be written as asynchronous event code. I'd say
    "so what", but there is too much evidence that a large percentage of
    programmers have a lot of trouble writing asynch code.

    The function calls are still synchronous. ...

    At least on the surface.


    ... But, the possibility that the
    *mechanism* may fault has to be addressed. It's no longer a bi-valued
    status result: SUCCESS vs. FAILURE. Instead, it's SUCCESS vs. FAILURE
    (as returned by the ftn), INVALID_OBJECT, INSUFFICIENT_PERMISSION, >RESOURCE_SHORTAGE, etc.

    But, the developer isn't thinking about any possibility other than >SUCCESS/FAILURE.

    [And a fool who tests for != FAILURE -- thinking that implies SUCCESS -- will >get royally bitten!]

    The funny thing is that many common APIs follow the original C custom
    of returning zero for success, or (not necessarily negative numbers,
    but) non-zero codes for a failure. Though not as prevalent, some APIs
    also feature multiple notions of 'success'.
    [Of course, some do use a different value for 'success', but zero is
    the most common.]

    In a lot of cases you can consider [in C terms] FAILURE == (!SUCCESS).
    At least as a 1st approximation.

    Anyone much beyond 'newbie' /should/ realize this and should study the
    API to find out what is (and is not) realistic.


    Protecting less-capable programmers from themselves is one of the
    major design goals of the virtual machine 'managed' runtimes - JVM,
    CLR, etc. - and of the languages that target them. They often
    sacrifice (sometimes significant) performance for correct operation.

    The real problem is, nobody ever could make transactional operating
    systems [remember IBM's 'Quicksilver'?] performant enough to be
    competitive even for normal use.

    I don't sweat the performance issue; I've got capacity up the wazoo!

    But, I don't think the tools are yet available to deal with "multicomponent" >systems, like this. People are still accustomed to dealing with specific >devices for specific purposes.

    I refuse to believe future systems will consist of oodles of "dumb devices" >talking to a "big" controller that does all of the real thinking. It just >won't scale.

    There has been a fair amount of research into so-called 'coordination' languages ... particularly for distributed systems. The problem, in
    general, is that it means yet-another (typically declaritive) language
    for developers to learn, and yet-another toolchain to master.

    One unsolved problem is whether it is better to embed coordination or
    to impose it. The 'embed' approach is typified by MPI and various
    (Linda-esc) 'tuple-space' systems.

    The 'impose' approach typically involves using a declarative language
    to specify how some group of processes will interact. The spec then
    is used to generate frameworks for the participating processes.
    Typically these systems are designed to create actively monitored
    groups, and processes are distinguished as being 'compute' nodes or 'coordinate' nodes.
    [But there are some systems that can create dynamic 'self-organizing'
    groups.]

    Research has shown that programmers usually find embedded coordination
    easier to work with ... when the requirements are fluid it's often
    difficult to statically enumerate the needed interactions and get
    their semantics correct - which typically results in longer
    development times for imposed methods. But it's also easier to F_ up
    using embedded methods because of the ad hoc nature of their growth.


    OTOH, there may be a push towards overkill in terms of individual "appliances" >and "wasting resources" when those devices are otherwise idle. Its possible >as things are getting incredibly inexpensive! (but I can't imagine folks >won't want to find some use for those wasted resources. e.g., SETI-ish)

    Not 'wasting' per se, but certainly there is a strong trend toward
    spending of some resources to make programmers' lives easier. Problem
    comes when it starts to make programming /possible/ rather than simply 'easier'.


    So, either make these "other cases" (status possibilities) more
    visible OR put in some default handling that ensures a lazy developer
    shoots himself in the foot in a very obvious way.

    The problem with forcing exceptions to always be handled, is that the
    code can become cluttered even if the exception simply is propogated
    up the call chain. Most languages default to propogating rather than deliberately requiring a declaration of intent, and so handling of
    propogated exceptions can be forgotten.

    In general there is no practical way to force a result code to be
    examined. It is relatively easy for a compiler to require that a
    function's return value always be caught - but without a lot of extra
    syntax it is impossible to distinguish a 'return value' from a 'result
    code', and impossible to guarantee that every possible return code is enumerated and dealt with.

    And when essentially any data value can be thrown as an exception you
    invite messes like:

    try {
    :
    throw( some_integer );
    :
    } catch (int v) {
    :
    }

    which - even with an enum of possible codes - really is no more useful
    than a return code.


    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for most
    to use reliably -- but is a great "shortcut" for *my* work!

    Not sure how far you can go without a lot of effort ... C++ is very
    permissive wrt exception handling: an unhandled exception will
    terminate the process, but there's no requirement to enumerate what
    exceptions code will throw, and you always can

    try {
    :
    catch (...) {
    }

    to ignore exceptions entirely.


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Mon May 16 03:42:33 2022
    On 5/15/2022 2:11 PM, George Neuner wrote:
    Sorry for the delay ... lotsa stuff going on.

    Yeah, I've got to make some ice cream for a neighbor and *then* I
    can get back to *my* chores ("work" being pretty far down that list!)

    [But, first, I have a pair of disk shelves to build (30 drives).
    Always stressful as a screwup means losing LOTS of data!]

    What I'm seeing is folks who "think malloc never (rarely?) returns NULL".
    It's just sloppy engineering. You *know* what the system design puts
    in place. You KNOW what guarantees you have -- and DON'T have. Why
    code in ignorance of those realities? Then, be surprised when the
    events that the architecture was designed to tolerate come along and
    bite you in the ass?!

    On modern Linux, malloc (almost) never does return NULL.

    By default, Linux allocates logical address space, not physical space
    (ie. RAM or SWAP pages) ... physical space isn't reserved until you
    actually try to use the corresponding addresses. In a default

    Ditto here. You can force the pages to be mapped, wire them down *or*
    let them be mapped-as-referenced.

    But, each task has strict resource limits so, eventually, attempting
    to allocate additional memory will exceed your quota and the OS will
    throw an exception. (if you can allocate to your heart's content, then
    what's to stop you from faulting in all of those pages?)

    configuration, nothing prevents you to malloc more space than you
    actually have. Unless the request exceeds the total possible address
    space, it won't fail.

    Then too, Linux has this nifty (or nasty, depending) OutOfMemory
    service which activates when the real physical space is about to be overcommitted. The OOM service randomly terminates running programs
    in an attempt to free space and 'fix' the problem. Of course, it is
    NOT guaranteed to terminate the application whose page reservation
    caused the overcommit.

    A similar problem exists when my scheduler decides to kill off a task
    (if I know you're not going to meet your deadline, then why should I
    let you keep burning resources?) or the workload manager decides to shed
    load (e.g., so it can power down some nodes to conserve backup power).

    These are exactly the "exceptional situations" that are causing my
    colleagues grief. They *expect* to be able to access an object
    backed by a particular server and are surprised when that server
    (and all of the objects that it backs) "suddenly" is not available!

    [Of course, every task having an outstanding reference to an object
    on that server is notified of this event. But, the notifications
    are *asynchronous* (of course) so you may not have handled it before
    you try to access said object. The "synchronous" notification comes
    in the form of a method failure with some status other than "FAILURE"
    (e.g., INVALID_OBJECT, PERMISSION_DENIED, etc.)]

    If you actually want to know whether malloc failed - and keep innocent programs running - you need to disable the OOM service and change the
    system allocation policy so that it provides *physically backed*
    address space rather than simply logical address space.

    I suspect most (many?) Linux boxen also have secondary storage to
    fall back on, in a pinch.

    In my case, there's no spinning rust... and any memory objects used
    to extend physical memory (onto other nodes) is subject to being
    killed off just like any other process.

    [It's a really interesting programming environment when you can't
    count on anything remaining available! Someone/something can always
    be more important than *you*!]

    ... But, the possibility that the
    *mechanism* may fault has to be addressed. It's no longer a bi-valued
    status result: SUCCESS vs. FAILURE. Instead, it's SUCCESS vs. FAILURE
    (as returned by the ftn), INVALID_OBJECT, INSUFFICIENT_PERMISSION,
    RESOURCE_SHORTAGE, etc.

    But, the developer isn't thinking about any possibility other than
    SUCCESS/FAILURE.

    [And a fool who tests for != FAILURE -- thinking that implies SUCCESS -- will
    get royally bitten!]

    The funny thing is that many common APIs follow the original C custom
    of returning zero for success, or (not necessarily negative numbers,
    but) non-zero codes for a failure. Though not as prevalent, some APIs
    also feature multiple notions of 'success'.
    [Of course, some do use a different value for 'success', but zero is
    the most common.]

    Yeah, annoying to have to say "if (!ftn(args)...)" instead of
    "if (ftn(args)...". <shrug> Folks get used to idioms

    In a lot of cases you can consider [in C terms] FAILURE == (!SUCCESS).
    At least as a 1st approximation.

    Anyone much beyond 'newbie' /should/ realize this and should study the
    API to find out what is (and is not) realistic.

    "In an ideal world..."

    The problem with many of these "conditions" is that they are hard
    to simulate and can botch *any* RMI. "Let's assume the first one succeeds
    (or fails) and now assume the second encounters an invalid object exception..."

    Protecting less-capable programmers from themselves is one of the
    major design goals of the virtual machine 'managed' runtimes - JVM,
    CLR, etc. - and of the languages that target them. They often
    sacrifice (sometimes significant) performance for correct operation.

    The real problem is, nobody ever could make transactional operating
    systems [remember IBM's 'Quicksilver'?] performant enough to be
    competitive even for normal use.

    I don't sweat the performance issue; I've got capacity up the wazoo!

    But, I don't think the tools are yet available to deal with "multicomponent" >> systems, like this. People are still accustomed to dealing with specific
    devices for specific purposes.

    I refuse to believe future systems will consist of oodles of "dumb devices" >> talking to a "big" controller that does all of the real thinking. It just >> won't scale.

    There has been a fair amount of research into so-called 'coordination' languages ... particularly for distributed systems. The problem, in
    general, is that it means yet-another (typically declaritive) language
    for developers to learn, and yet-another toolchain to master.

    Yep. Along with the "sell".

    One unsolved problem is whether it is better to embed coordination or
    to impose it. The 'embed' approach is typified by MPI and various (Linda-esc) 'tuple-space' systems.

    The 'impose' approach typically involves using a declarative language
    to specify how some group of processes will interact. The spec then
    is used to generate frameworks for the participating processes.
    Typically these systems are designed to create actively monitored
    groups, and processes are distinguished as being 'compute' nodes or 'coordinate' nodes.
    [But there are some systems that can create dynamic 'self-organizing' groups.]

    The problem is expecting "after-the-sale" add-ons to accurately adopt
    such methodologies; you don't have control over those offerings.

    If an after-market product implements (offers) a service -- but doesn't adequately address these "exceptional" conditions, then anything that
    (later) relies on that service "inherits" its flaws, even if those
    clients are well-behaved.

    Research has shown that programmers usually find embedded coordination
    easier to work with ... when the requirements are fluid it's often
    difficult to statically enumerate the needed interactions and get
    their semantics correct - which typically results in longer
    development times for imposed methods. But it's also easier to F_ up
    using embedded methods because of the ad hoc nature of their growth.

    OTOH, there may be a push towards overkill in terms of individual "appliances"
    and "wasting resources" when those devices are otherwise idle. Its possible >> as things are getting incredibly inexpensive! (but I can't imagine folks
    won't want to find some use for those wasted resources. e.g., SETI-ish)

    Not 'wasting' per se, but certainly there is a strong trend toward
    spending of some resources to make programmers' lives easier. Problem
    comes when it starts to make programming /possible/ rather than simply 'easier'.

    Yes, but often this *still* leaves resources on the table.

    What is your smart thermostat doing 98% of the time? Waiting
    for the temperature to climb-above/fall-below the current setpoint??
    Why can't it work on transcoding some videos recorded by your TV
    last night? Or, retraining speech models? Or...

    (instead, you over-specify the hardware needed for those *other*
    products to tackle all of these tasks "on their own dime" and
    leave other, *excess* capacity unused!)

    So, either make these "other cases" (status possibilities) more
    visible OR put in some default handling that ensures a lazy developer
    shoots himself in the foot in a very obvious way.

    The problem with forcing exceptions to always be handled, is that the
    code can become cluttered even if the exception simply is propogated
    up the call chain. Most languages default to propogating rather than deliberately requiring a declaration of intent, and so handling of
    propogated exceptions can be forgotten.

    And, many developers can't/won't bother with the "out-of-the-ordinary" conditions.

    If the RDBMS is not available to capture your updated speech models,
    will you bother trying to maintain them locally -- for the next
    time they are required (instead of fetching them from the RDBMS at
    that time)? Or, will you just drop them and worry about whether or
    not the RDBMS is available when *next* you need it to source the data?

    OTOH, if your algorithm inherently relies on the fact that those
    updates will be passed forward (via the RDBMS), then your future
    behavior will suffer (or fail!) when you've not adequately handled
    that !SUCCESS.

    In general there is no practical way to force a result code to be
    examined. It is relatively easy for a compiler to require that a
    function's return value always be caught - but without a lot of extra
    syntax it is impossible to distinguish a 'return value' from a 'result
    code', and impossible to guarantee that every possible return code is enumerated and dealt with.

    Regardless, it gets "messy". And, messy tends to mean "contains
    undiscovered bugs"!

    And when essentially any data value can be thrown as an exception you
    invite messes like:

    try {
    :
    throw( some_integer );
    :
    } catch (int v) {
    :
    }

    which - even with an enum of possible codes - really is no more useful
    than a return code.

    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for most
    to use reliably -- but is a great "shortcut" for *my* work!

    Not sure how far you can go without a lot of effort ... C++ is very permissive wrt exception handling: an unhandled exception will
    terminate the process, but there's no requirement to enumerate what exceptions code will throw, and you always can

    try {
    :
    catch (...) {
    }

    to ignore exceptions entirely.

    The hope was to provide a friendlier environment for dealing
    with the exceptions (my C exception framework is brittle).

    But, I think the whole OOPS approach is flawed for what I'm
    doing. It's too easy/common for something that the compiler
    never sees (is completely unaware of) to alter the current
    environment in ways that will confuse the code that it
    puts in place.

    E.g., the object that you're about to reference on the next line
    no longer exists. Please remember to invoke its destructor
    (or, never try to reference it, again; or, expect exceptions
    to be thrown for every reference to it; or...)

    [consider the case of an object that has such an object embedded
    within it]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Thu May 19 18:42:54 2022
    On Mon, 16 May 2022 03:42:33 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 5/15/2022 2:11 PM, George Neuner wrote:

    What I'm seeing is folks who "think malloc never (rarely?) returns NULL". >>> It's just sloppy engineering. You *know* what the system design puts
    in place. You KNOW what guarantees you have -- and DON'T have. Why
    code in ignorance of those realities? Then, be surprised when the
    events that the architecture was designed to tolerate come along and
    bite you in the ass?!

    On modern Linux, malloc (almost) never does return NULL.

    By default, Linux allocates logical address space, not physical space
    (ie. RAM or SWAP pages) ... physical space isn't reserved until you
    actually try to use the corresponding addresses. In a default

    Ditto here. You can force the pages to be mapped, wire them down *or*
    let them be mapped-as-referenced.

    Well yes, but not what I meant. You can go to the lower level mmap(2)
    and do whatever you want ... but the standard library malloc/calloc
    behaves according to a system-wide allocation policy which can only be
    changed by an administrator.

    But, each task has strict resource limits so, eventually, attempting
    to allocate additional memory will exceed your quota and the OS will
    throw an exception. (if you can allocate to your heart's content, then >what's to stop you from faulting in all of those pages?)

    Well, you can impose certain limits via ulimit(3) or the shell command
    of the same name. But ulimit is only for physical space.


    If you actually want to know whether malloc failed - and keep innocent
    programs running - you need to disable the OOM service and change the
    system allocation policy so that it provides *physically backed*
    address space rather than simply logical address space.

    I suspect most (many?) Linux boxen also have secondary storage to
    fall back on, in a pinch.

    Actually no. The overwhelming majority of Linux instances are hosted
    VMs - personal workstations and on-metal servers all together account
    for just a tiny fraction. And while the majority of VMs certainly DO
    have underlying writable storage, typically they are configured (at
    least initially) without any swap device.

    Of course, an administrator can create a swap /file/ if necessary. But
    that's the easy part.

    The larger issue is that, while usually not providing a swap device,
    most cloud providers also forget to change the default swap and memory overcommit behavior. By default, the system allows unbacked address
    allocation and also tries to keep a significant portion (about ~30%)
    of physical memory free.

    There are dozens of system parameters that affect the allocation and out-of-memory behavior ... unless you are willing to substantially
    over provision memory, the system has to be carefully configured to
    run well with no swap. Among the most common questions you'll see in
    Linux support forums are

    - what the *&^%$ is 'OOM' and why is it killing my tasks?
    - how do I change the OOM policy?
    - how do I change the swap policy?
    - how do I change the memory overcommit policy?


    In my case, there's no spinning rust... and any memory objects used
    to extend physical memory (onto other nodes) is subject to being
    killed off just like any other process.

    Nowadays a large proportion of storage is SSD ... no spinning rust ...
    but I understand. <grin>


    [It's a really interesting programming environment when you can't
    count on anything remaining available! Someone/something can always
    be more important than *you*!]



    The problem with many of these "conditions" is that they are hard
    to simulate and can botch *any* RMI. "Let's assume the first one succeeds >(or fails) and now assume the second encounters an invalid object exception..."

    Which is why a (hierarchical) transactional model is so appealing. The
    problem is providing /comprehensive/ system-wide support [taking into
    account the multiple different failure modes], and then teaching
    programmers to work with it.

    The real problem is, nobody ever could make transactional operating
    systems [remember IBM's 'Quicksilver'?] performant enough to be
    competitive even for normal use.

    I don't sweat the performance issue; I've got capacity up the wazoo!

    But, I don't think the tools are yet available to deal with "multicomponent"
    systems, like this. People are still accustomed to dealing with specific >>> devices for specific purposes.

    And specific services. The problem comes when you are dealing
    /simultaneously/ with multiple things, any of which can fail or
    disappear at any moment.


    I refuse to believe future systems will consist of oodles of "dumb devices" >>> talking to a "big" controller that does all of the real thinking. It just >>> won't scale.

    And, pretty much, nobody else believes it either.

    So-called 'edge' computing largely is based on distributed tuple-space
    models specifically /because/ they are (or can be) self-organizing and
    are temporally decoupled: individual devices can come and go at will,
    but the state of ongoing computations is maintained in the fabric.


    The problem is expecting "after-the-sale" add-ons to accurately adopt
    such methodologies; you don't have control over those offerings.

    There's no practical way to guarantee 3rd party add-ons will be
    compatible with an existing system. Comprehensive documentation,
    adherence to external standards, and reliance on vendor-provided tools
    only can go so far.


    And, many developers can't/won't bother with the "out-of-the-ordinary" >conditions.

    If the RDBMS is not available to capture your updated speech models,
    will you bother trying to maintain them locally -- for the next
    time they are required (instead of fetching them from the RDBMS at
    that time)? Or, will you just drop them and worry about whether or
    not the RDBMS is available when *next* you need it to source the data?

    OTOH, if your algorithm inherently relies on the fact that those
    updates will be passed forward (via the RDBMS), then your future
    behavior will suffer (or fail!) when you've not adequately handled
    that !SUCCESS.

    Now you're conflating 'policy' with 'mechanism' ... which is where
    most [otherwise reasonable] ideas go off the rails.

    If you're worried that a particular service will be a point of failure
    you always can massively over-design and over-provision it such that
    failure becomes statistically however unlikely you desire.

    The end result of such thinking is that every service becomes
    distributed, self-organizing, and self-repairing, and every node
    features multiply redundant hardware (in case there's a fire in the
    'fire-proof vault), and node software must be able to handle any and
    every contigency for any possible program in case the (multiply
    redundant) fabric fails.


    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for
    most to use reliably -- but is a great "shortcut" for *my* work!

    :

    The hope was to provide a friendlier environment for dealing
    with the exceptions (my C exception framework is brittle).

    But, I think the whole OOPS approach is flawed for what I'm
    doing. It's too easy/common for something that the compiler
    never sees (is completely unaware of) to alter the current
    environment in ways that will confuse the code that it
    puts in place.

    Well, hypothetically, use of objects could include compiler generated fill-in-the-blank type error handling. But that severely restricts
    what languages [certainly what compilers] could be used and places a
    high burden on developers to provide relevant error templates.


    E.g., the object that you're about to reference on the next line
    no longer exists. Please remember to invoke its destructor
    (or, never try to reference it, again; or, expect exceptions
    to be thrown for every reference to it; or...)

    [consider the case of an object that has such an object embedded
    within it]

    There's no particularly good answer to that: the problem really is not
    OO - the same issues need to be addressed no matter what paradigm is
    used. Rather the problem is that certain languages lack the notion of 'try-finally' to guarantee cleanup of local state (in whatever form
    that state exists).

    Another useful syntactic form is one that both opens AND closes the
    object [for some definition of 'open' and 'close'] so the object can
    be made static/global.

    Neither C nor C++ supports 'try-finally'. As a technical matter, all
    the modern compilers actually DO offer extensions with more-or-less
    equivalent functionality - but use of the extensions is non-portable.


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Thu May 19 19:43:09 2022
    On 5/19/2022 3:42 PM, George Neuner wrote:
    On Mon, 16 May 2022 03:42:33 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 5/15/2022 2:11 PM, George Neuner wrote:

    What I'm seeing is folks who "think malloc never (rarely?) returns NULL". >>>> It's just sloppy engineering. You *know* what the system design puts >>>> in place. You KNOW what guarantees you have -- and DON'T have. Why
    code in ignorance of those realities? Then, be surprised when the
    events that the architecture was designed to tolerate come along and
    bite you in the ass?!

    On modern Linux, malloc (almost) never does return NULL.

    By default, Linux allocates logical address space, not physical space
    (ie. RAM or SWAP pages) ... physical space isn't reserved until you
    actually try to use the corresponding addresses. In a default

    Ditto here. You can force the pages to be mapped, wire them down *or*
    let them be mapped-as-referenced.

    Well yes, but not what I meant. You can go to the lower level mmap(2)
    and do whatever you want ... but the standard library malloc/calloc
    behaves according to a system-wide allocation policy which can only be changed by an administrator.

    I treat every request as potentially malevolent (or incompetent).
    So, if you allocate memory (on the heap), I assume that you WILL
    reference it (which will cause the underlying pages to be mapped).

    Rather than wait for you to access memory exceeding your
    allocation limit, I take this action as an "early warning" that
    you are likely going to violate that constraint (if not, why
    did you do it? "Just in case"? Just-in-case-WHAT? You'll
    not be able to use it, ever! Barring the case of deciding to use
    allocation B and not A or C, etc.)

    But, each task has strict resource limits so, eventually, attempting
    to allocate additional memory will exceed your quota and the OS will
    throw an exception. (if you can allocate to your heart's content, then
    what's to stop you from faulting in all of those pages?)

    Well, you can impose certain limits via ulimit(3) or the shell command
    of the same name. But ulimit is only for physical space.

    Yes, or set defaults (stack size, etc.) for each process.

    In my case, before a piece of code is admitted to the system, it must
    declare its resource requirements, dependencies, etc. and the user
    (installer) must accept those.

    The flip side of that is the application ALSO must accept them as
    constraints! (you can't claim to use X and, in actuality, use 10X
    once you're in the door!)

    If you actually want to know whether malloc failed - and keep innocent
    programs running - you need to disable the OOM service and change the
    system allocation policy so that it provides *physically backed*
    address space rather than simply logical address space.

    I suspect most (many?) Linux boxen also have secondary storage to
    fall back on, in a pinch.

    Actually no. The overwhelming majority of Linux instances are hosted
    VMs - personal workstations and on-metal servers all together account
    for just a tiny fraction. And while the majority of VMs certainly DO
    have underlying writable storage, typically they are configured (at
    least initially) without any swap device.

    Of course, an administrator can create a swap /file/ if necessary. But
    that's the easy part.

    The larger issue is that, while usually not providing a swap device,
    most cloud providers also forget to change the default swap and memory overcommit behavior. By default, the system allows unbacked address allocation and also tries to keep a significant portion (about ~30%)
    of physical memory free.

    There are dozens of system parameters that affect the allocation and out-of-memory behavior ... unless you are willing to substantially
    over provision memory, the system has to be carefully configured to
    run well with no swap. Among the most common questions you'll see in
    Linux support forums are

    - what the *&^%$ is 'OOM' and why is it killing my tasks?
    - how do I change the OOM policy?
    - how do I change the swap policy?
    - how do I change the memory overcommit policy?

    That would be the equivalent of me letting applications "take what
    they want" and dealing with the consequences at run time.

    I want the user/consumer to know what he's getting into. E.g.,
    if you choose to install an app that is a resource hog, then
    you KNOW it, up front. If the system later decides to kill
    off that application (as precious as you might think it to be),
    then you understand the reasoning (in the circumstances at the
    time).

    The personal workstation notion of just letting the entire
    system degrade isn't a viable option because some of the services
    might be critical -- delaying them so some "silly" app can continue
    to execute (poorly) isn't an option.

    In my case, there's no spinning rust... and any memory objects used
    to extend physical memory (onto other nodes) is subject to being
    killed off just like any other process.

    Nowadays a large proportion of storage is SSD ... no spinning rust ...
    but I understand. <grin>

    For me, the SSD is (one or more) "memory object" possibly PHYSICALLY
    residing on different nodes. In a pinch, I could build "tables"
    to back those on the RDBMS.

    The goal is for you to make known your needs and for the system to arrange for them to be available to you. Or not.

    [It's a really interesting programming environment when you can't
    count on anything remaining available! Someone/something can always
    be more important than *you*!]

    The problem with many of these "conditions" is that they are hard
    to simulate and can botch *any* RMI. "Let's assume the first one succeeds >> (or fails) and now assume the second encounters an invalid object exception..."

    Which is why a (hierarchical) transactional model is so appealing. The problem is providing /comprehensive/ system-wide support [taking into
    account the multiple different failure modes], and then teaching
    programmers to work with it.

    I think that's a long way off -- at least in iOT terms. Especially if
    you're counting on that processing being "local".

    Think about how "embedded" devices have evolved, over the years.

    Initially, the I/Os (field) was (essentially), on-board -- or, tethered
    by a short pigtail harness.

    As the need for increasing the distance between field devices (sensors, actuators) increased, there were new connection mechanisms developed:
    current loop, differential, etc. But, the interface (to the processor)
    was still "local".

    In such "closed/contained" devices/systems, you would know at boot time (BIST/POST) that something was wrong and throw an error message to
    get it sorted. If someone unplugged something or something failed
    while running, you either shit the bed *or* caught the error and
    flagged it. But, that was the exception, not the normal way of operation.

    Eventually, digital comms allowed for the remoting of some of these
    devices -- again with a "local" interface (for the "main" CPU) and a
    virtual interface (for the protocol). Responsible system designers
    would include some sort of verification in POST to ensure the
    outboard devices were present, powered and functional. And, likely
    include some sort of runtime detection (comms failure, timeout,
    etc.) to "notice" when they were unplugged or failed.

    "Processors" primarily went outboard with net/web services. But, the interactions with them were fairly well constrained and easily identified: "*this* is where we try to talk to the server..."

    A developer could *expect* the service to be unavailable and address
    that need in *that* spot (in the code).

    Now, we're looking at dynamically distributed applications where
    the code runs "wherever". And, (in my case) can move or "disappear"
    from one moment to the next.

    I think folks just aren't used to considering "every" function (method) invocation as a potential source for "mechanism failure". They don't
    recognize a ftn invocation as potentially suspect.

    The real problem is, nobody ever could make transactional operating
    systems [remember IBM's 'Quicksilver'?] performant enough to be
    competitive even for normal use.

    I don't sweat the performance issue; I've got capacity up the wazoo!

    But, I don't think the tools are yet available to deal with "multicomponent"
    systems, like this. People are still accustomed to dealing with specific >>>> devices for specific purposes.

    And specific services. The problem comes when you are dealing /simultaneously/ with multiple things, any of which can fail or
    disappear at any moment.

    Yes. And, the more ubiquitous the service... the more desirable...
    the more it pervades a design.

    I.e., you can avoid all of these problems -- by implementing everything yourself, in your own process space. Then, the prospect of "something" disappearing is a non-issue -- YOU "disappear" (and thus never know
    you're gone!)

    I refuse to believe future systems will consist of oodles of "dumb devices"
    talking to a "big" controller that does all of the real thinking. It just >>>> won't scale.

    And, pretty much, nobody else believes it either.

    So-called 'edge' computing largely is based on distributed tuple-space
    models specifically /because/ they are (or can be) self-organizing and
    are temporally decoupled: individual devices can come and go at will,
    but the state of ongoing computations is maintained in the fabric.

    But (in the embedded/RT system world) they are still "devices" with
    specific functionalities. We're not (yet) accustomed to treating
    "processing" as a resource that can be dispatched as needed. There
    are no mechanisms where you can *request* more processing (beyond
    creating another *process* and hoping <something> recognizes that
    it can co-execute elsewhere)

    The problem is expecting "after-the-sale" add-ons to accurately adopt
    such methodologies; you don't have control over those offerings.

    There's no practical way to guarantee 3rd party add-ons will be
    compatible with an existing system. Comprehensive documentation,
    adherence to external standards, and reliance on vendor-provided tools
    only can go so far.

    Exactly. I try to make it considerably easier for you to "do things
    my way" instead of rolling your own. E.g., nothing to stop you from implementing your own kludge file system *in* the RDBMS (with
    generic tables acting as "sector stores"). But, the fact that the
    RDBMS will *organize* your data so you aren't wasting effort parsing
    the contents of "files" that you've stashed is, hopefully, a big
    incentive to "do it my way".

    And, many developers can't/won't bother with the "out-of-the-ordinary"
    conditions.

    If the RDBMS is not available to capture your updated speech models,
    will you bother trying to maintain them locally -- for the next
    time they are required (instead of fetching them from the RDBMS at
    that time)? Or, will you just drop them and worry about whether or
    not the RDBMS is available when *next* you need it to source the data?

    OTOH, if your algorithm inherently relies on the fact that those
    updates will be passed forward (via the RDBMS), then your future
    behavior will suffer (or fail!) when you've not adequately handled
    that !SUCCESS.

    Now you're conflating 'policy' with 'mechanism' ... which is where
    most [otherwise reasonable] ideas go off the rails.

    If you're worried that a particular service will be a point of failure
    you always can massively over-design and over-provision it such that
    failure becomes statistically however unlikely you desire.

    But I can only do that for the services that *I* design. I can't
    coerce a 3rd party to design with that goal in mind.

    I have provisions for apps to request redundancy where the RTOS will automatically checkpoint the app/process and redispatch a process
    after a fault. But, that comes at a cost (to the app and the system)
    and could quickly become a crutch for sloppy developers. Just
    because your process is redundantly backed, doesn't mean it is free
    of design flaws.

    And, doesn't mean there will be resources to guarantee you the performance you'd like!

    The end result of such thinking is that every service becomes
    distributed, self-organizing, and self-repairing, and every node
    features multiply redundant hardware (in case there's a fire in the 'fire-proof vault), and node software must be able to handle any and
    every contigency for any possible program in case the (multiply
    redundant) fabric fails.

    I don't think many systems need that sort of "robustness".

    If the air handler for the coating pan shits the bed, the coating pan
    is effectively off-line until you can replace that mechanism. How
    is that any different than the processor in the AHU failing? (Sure
    a helluvalot easier to replace a processor module than a *mechanism*!)

    If there's not enough power to run the facial recognition software to
    announce visitors at your front door, how is that different than the
    doorbell not having power?

    If a user overloads his system with too many apps/tasks, how is that
    any worse than having too many apps running on a "centralized processor"
    and performance suffering across the board?

    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for
    most to use reliably -- but is a great "shortcut" for *my* work!

    :

    The hope was to provide a friendlier environment for dealing
    with the exceptions (my C exception framework is brittle).

    But, I think the whole OOPS approach is flawed for what I'm
    doing. It's too easy/common for something that the compiler
    never sees (is completely unaware of) to alter the current
    environment in ways that will confuse the code that it
    puts in place.

    Well, hypothetically, use of objects could include compiler generated fill-in-the-blank type error handling. But that severely restricts
    what languages [certainly what compilers] could be used and places a
    high burden on developers to provide relevant error templates.

    I thought of doing that with the stub generator. But, then I would
    need a different binding for each client of a particular service.

    And, it would mean that a particular service would ALWAYS behave
    a certain way for a particular task; you couldn't expect one type of
    behavior on line 24 and another on 25.

    E.g., the object that you're about to reference on the next line
    no longer exists. Please remember to invoke its destructor
    (or, never try to reference it, again; or, expect exceptions
    to be thrown for every reference to it; or...)

    [consider the case of an object that has such an object embedded
    within it]

    There's no particularly good answer to that: the problem really is not
    OO - the same issues need to be addressed no matter what paradigm is
    used. Rather the problem is that certain languages lack the notion of 'try-finally' to guarantee cleanup of local state (in whatever form
    that state exists).

    How you expose the "exceptional behaviors" is the issue. There's
    a discrete boundary between objects, apps and RTOS so anything that
    relies on blurring that boundary runs the risk of restricting future
    solutions, language bindings, etc.

    Another useful syntactic form is one that both opens AND closes the
    object [for some definition of 'open' and 'close'] so the object can
    be made static/global.

    An object is implicitly "open"ed when a reference to it is created.
    From that point forward, it can be referenced and operated upon by its methods. When the last reference to an object disappears, the object essentially is closed/destroyed.

    At any time in the process, it can be migrated or destroyed by some other
    actor (or, by the actions/demise of its server).

    You saved an address to your Thunderbird address book. You *hope* it will
    be there, later. But, you can only reasonably expect that if you have sole control over who can access that address book (directly or indirectly). Likewise for emails (stored on server); just because you haven't
    deleted one doesn't GUARANTEE that it will still be there! (your entire account could go away at the whim of the provider)

    Neither C nor C++ supports 'try-finally'. As a technical matter, all
    the modern compilers actually DO offer extensions with more-or-less equivalent functionality - but use of the extensions is non-portable.

    The problem is still in the mindsets of the developers. Until they actively embrace the possibility of these RMI failing, they won't even begin to consider how to address that possibility.

    [That's why I suggested the "automatically insert a template after each invocation" -- to remind them of each of the possible outcomes in a
    very hard to ignore way!]

    I work with a lot of REALLY talented people so I am surprised that this
    is such an issue. They understand the mechanism. They have to understand
    that wires can break, devices can become disconnected, etc. AND that this
    can happen AT ANY TIME (not just "prior to POST). So, why no grok?

    [OTOH, C complains that I fail to see the crumbs I leave on the counter
    each time I prepare her biscotti: "How can you not SEE them??" <shrug>]

    I had a similar problem trying to get them used to having two different
    notions of "time": system time and wall time. And, the fact that they
    are entirely different schemes with different rules governing their
    behavior. I.e., if you want to do something "in an hour", then say
    "in an hour" and use the contiguous system time scheme. OTOH, if you
    want to do something at 9PM, then use the wall time. Don't expect any correlation between the two!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Fri May 20 21:59:54 2022
    On Thu, 19 May 2022 19:43:09 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 5/19/2022 3:42 PM, George Neuner wrote:

    :

    Now, we're looking at dynamically distributed applications where
    the code runs "wherever". And, (in my case) can move or "disappear"
    from one moment to the next.

    I think folks just aren't used to considering "every" function (method) >invocation as a potential source for "mechanism failure". They don't >recognize a ftn invocation as potentially suspect.

    Multiuser databases have been that way since ~1960s: you have to code
    with the expectation that /anything/ you try to do in the database may
    fail ... not necessarily due to error, but simply because another
    concurrent transaction is holding some resource you need. Your
    transaction may be arbitrarily delayed, or even terminated: in the
    case of deadlock, participating transactions are killed one by one
    until /some/ one of them [not necessarily yours] is able to complete.

    You have to be ready at all times to retry operations or do something
    else reasonable with their failure.


    I.e., you can avoid all of these problems -- by implementing everything >yourself, in your own process space. Then, the prospect of "something" >disappearing is a non-issue -- YOU "disappear" (and thus never know
    you're gone!)

    Which is how things were done in ye old days. <grin>


    So-called 'edge' computing largely is based on distributed tuple-space
    models specifically /because/ they are (or can be) self-organizing and
    are temporally decoupled: individual devices can come and go at will,
    but the state of ongoing computations is maintained in the fabric.

    But (in the embedded/RT system world) they are still "devices" with
    specific functionalities. We're not (yet) accustomed to treating >"processing" as a resource that can be dispatched as needed. There
    are no mechanisms where you can *request* more processing (beyond
    creating another *process* and hoping <something> recognizes that
    it can co-execute elsewhere)

    The idea doesn't preclude having specialized nodes ... the idea is
    simply that if a node crashes, the task state [for some approximation]
    is preserved "in the cloud" and so can be restored if the same node
    returns, or the task can be assumed by another node (if possible).

    It often requires moving code as well as data, and programs need to be
    written specifically to regularly checkpoint / save state to the
    cloud, and to be able to resume from a given checkpoint.

    The "tuple-space" aspect specifically is to coordinate efforts by
    multiple nodes without imposing any particular structure or
    communication pattern on partipating nodes ... with appropriate TS
    support many different communication patterns can be accomodated simultaneously.


    I have provisions for apps to request redundancy where the RTOS will >automatically checkpoint the app/process and redispatch a process
    after a fault. But, that comes at a cost (to the app and the system)
    and could quickly become a crutch for sloppy developers. Just
    because your process is redundantly backed, doesn't mean it is free
    of design flaws.

    "Image" snapshots are useful in many situations, but they largely are impractical to communicate through a wide-area distributed system.
    [I know your system is LAN based - I'm just making a point.]

    For many programs, checkpoint data will be much more compact than a
    snapshot of the running process, so it makes more sense to design
    programs to be resumed - particularly if you can arrange that reset of
    a faulting node doesn't eliminate the program, so code doesn't have to
    be downloaded as often (or at all).

    Even if the checkpoint data set is enormous, it often can be saved incrementally. You then have to weigh the cost of resuming, which
    requires the whole data set be downloaded.


    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for
    most to use reliably -- but is a great "shortcut" for *my* work!
    :
    The hope was to provide a friendlier environment for dealing
    with the exceptions (my C exception framework is brittle).

    But, I think the whole OOPS approach is flawed for what I'm
    doing. It's too easy/common for something that the compiler
    never sees (is completely unaware of) to alter the current
    environment in ways that will confuse the code that it
    puts in place.

    Well, hypothetically, use of objects could include compiler generated
    fill-in-the-blank type error handling. But that severely restricts
    what languages [certainly what compilers] could be used and places a
    high burden on developers to provide relevant error templates.

    I thought of doing that with the stub generator. But, then I would
    need a different binding for each client of a particular service.

    And, it would mean that a particular service would ALWAYS behave
    a certain way for a particular task; you couldn't expect one type of
    behavior on line 24 and another on 25.

    I wasn't thinking of building the error handling into the interface
    object, but rather a "wizard" type code skeleton inserted at the point
    of use. You couldn't do it with the IDL compiler ... unless it also
    generated skeleton for clients as well as for servers - which is not
    typical.

    I suppose you /could/ have the IDL generate different variants of the
    same interface ... perhaps in response to a checklist of errors to be
    handled provided by the programmer when the stub is created.

    But then to allow for different behaviors in the same program, you
    might need to generate multiple variants of the same interface object
    and make sure to use the right one in the right place. Too much
    potential for F_up there.
    [Also, IIRC, you are based on CORBA? So potentially a resource drain
    given that just the interface object in the client can initiate a
    'session' with the server.]



    How you expose the "exceptional behaviors" is the issue. There's
    a discrete boundary between objects, apps and RTOS so anything that
    relies on blurring that boundary runs the risk of restricting future >solutions, language bindings, etc.

    The problem really is that RPC tries to make all functions appear as
    if they are local ... 'local' meaning "in the same process".

    At least /some/ of the "blurring" you speak of goes away in languages
    like Scala where every out-of-process call - e.g., I/O, invoking
    functions from a shared library, messaging another process, etc. - all
    are treated AS IF RPC to a /physically/ remote server, regardless of
    whether that actually is true. If an unhandled error occurs for any out-of-process call, the process is terminated.
    [Scala is single threaded, but its 'processes' are very lightweight,
    more like 'threads' in other systems.]


    The problem is still in the mindsets of the developers. Until they actively >embrace the possibility of these RMI failing, they won't even begin to consider
    how to address that possibility.

    [That's why I suggested the "automatically insert a template after each >invocation" -- to remind them of each of the possible outcomes in a
    very hard to ignore way!]

    I work with a lot of REALLY talented people so I am surprised that this
    is such an issue. They understand the mechanism. They have to understand >that wires can break, devices can become disconnected, etc. AND that this
    can happen AT ANY TIME (not just "prior to POST). So, why no grok?

    I didn't understand it until I started working seriously with DBMS.

    A lot of the code I write now looks and feels transactional,
    regardless of what it's actually doing. I try to make there are no
    side effects [other than failure] and I routinely (ab)use exceptions
    and Duff's device to back out of complex situations, release
    resources, undo (where necessary) changes to data structures, etc.

    I won't hesitate to wrap a raw service API, and create different
    versions of it that handle things differently. Of course this results
    in some (usually moderate) code growth - which is less a problem in
    more powerful systems. I have written for small devices in the past,
    but I don't do that anymore.


    [OTOH, C complains that I fail to see the crumbs I leave on the counter
    each time I prepare her biscotti: "How can you not SEE them??" <shrug>]

    When the puppy pees on the carpet, everyone develops indoor blindness.
    <grin>


    I had a similar problem trying to get them used to having two different >notions of "time": system time and wall time. And, the fact that they
    are entirely different schemes with different rules governing their
    behavior. I.e., if you want to do something "in an hour", then say
    "in an hour" and use the contiguous system time scheme. OTOH, if you
    want to do something at 9PM, then use the wall time. Don't expect any >correlation between the two!

    Once I wrote a small(ish) parser, in Scheme, for AT-like time specs.
    You could say things like "(AT now + 40 minutes)" or "(AT 9pm
    tomorrow)", etc. and it would figure out what that meant wrt the
    system clock. The result was the requested epoch expressed in seconds.
    That epoch could be used directly to set an absolute timer, or to poll expiration with a simple 'now() >= epoch'.

    It was implemented as a compile time macro that generated and spliced
    in code at the call site to compute the desired answer at runtime.

    The parser understood 12 and 24 hour clocks, names of days and months, expressions like 'next tuesday', etc. There were a few small
    auxialiary functions required to be linked into the executable to
    figure out, e.g., what day it was, on-the-fly, but many typical uses
    just reduced to something like '(+ (current_time) <computed_offset>)'.
    The parser itself was never in the executable.

    Simplified a lot of complicated clock handling.

    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Sat May 21 05:03:57 2022
    On 5/20/2022 6:59 PM, George Neuner wrote:
    Now, we're looking at dynamically distributed applications where
    the code runs "wherever". And, (in my case) can move or "disappear"
    from one moment to the next.

    I think folks just aren't used to considering "every" function (method)
    invocation as a potential source for "mechanism failure". They don't
    recognize a ftn invocation as potentially suspect.

    Multiuser databases have been that way since ~1960s: you have to code
    with the expectation that /anything/ you try to do in the database may
    fail ... not necessarily due to error, but simply because another
    concurrent transaction is holding some resource you need. Your
    transaction may be arbitrarily delayed, or even terminated: in the
    case of deadlock, participating transactions are killed one by one
    until /some/ one of them [not necessarily yours] is able to complete.

    Yes. The "code" (SQL) isn't really executing on the client. What
    you have, in effect, is an RPC; the client is telling the DBMS what to do
    and then waiting on its results. Inherent in that is the fact that
    there is a disconnect between the request and execution -- a *mechanism*
    that can fail.

    But, in my (limited) experience, DB apps tend to be relatively short
    and easily decomposed into transactions. You "feel" like you've accomplished some portion of your goal after each interaction with the DB.

    By contrast, "procedural" apps tend to have finer-grained actions; you
    don't get the feeling that you've made "definable" progress until you've largely met your goal.

    E.g., you'd have to sum N values and then divide the sum by N to get
    an average. If any of those "adds" was interrupted, you'd not feel
    like you'd got anything done. Likewise, if you did the division
    to get the average, you'd likely still be looking to do something MORE
    with that figure.

    The DBMS doesn't have an "out" when it has summed the first M (M<N) values;
    it sums them and forms the average BEFORE it can abend. Or, shits the bed completely. There's no "half done" state.

    You have to be ready at all times to retry operations or do something
    else reasonable with their failure.

    Yes. And, that recovery can be complicated, esp if the operations up
    to this point have had side effects, etc. How do you "unring the bell"?

    When *I* write code for the system, I leverage lots of separate threads,
    cuz threads are cheap. I do this even if thread B is intended to wait
    until A is complete. This helps me compartmentalize the actions that I'm taking. And, lets the exception handler (for the task) simply examine the states of the threads and decide what has dropped dead and how to unwind
    the operation.

    Or, restart THAT portion handled by the thread that choked.

    E.g., (silly)
    thread1 get N values
    thread2 sum values
    thread3 form average
    If any thread fails to complete, I can restart it (at some loss of
    already completed work). I don't need to keep track of WHERE a
    thread has shit the bed to be able to restart it from that point.

    I.e., you can avoid all of these problems -- by implementing everything
    yourself, in your own process space. Then, the prospect of "something"
    disappearing is a non-issue -- YOU "disappear" (and thus never know
    you're gone!)

    Which is how things were done in ye old days. <grin>

    While it means you have to do everything, it also means you
    KNOW how everything is being done!

    But, that means you're forever reinventing the wheel. It's why
    app #1 says a file is 1023KB and another says 1024KB and a third says
    1MB.

    So-called 'edge' computing largely is based on distributed tuple-space
    models specifically /because/ they are (or can be) self-organizing and
    are temporally decoupled: individual devices can come and go at will,
    but the state of ongoing computations is maintained in the fabric.

    But (in the embedded/RT system world) they are still "devices" with
    specific functionalities. We're not (yet) accustomed to treating
    "processing" as a resource that can be dispatched as needed. There
    are no mechanisms where you can *request* more processing (beyond
    creating another *process* and hoping <something> recognizes that
    it can co-execute elsewhere)

    The idea doesn't preclude having specialized nodes ... the idea is

    I'm arguing for the case of treating each node as "specialized + generic"
    and making the generic portion available for other uses that aren't
    applicable to the "specialized" nature of the node (hardware).

    Your doorbell sits in an idiot loop waiting to "do something" -- instead
    of spending that "idle time" working on something *else* so the "device"
    that would traditionally be charged with doing that something else
    can get by with less resources on-board.

    [I use cameras galore. Imagining feeding all that video to a single
    "PC" would require me to keep looking for bigger and faster PCs!]

    simply that if a node crashes, the task state [for some approximation]
    is preserved "in the cloud" and so can be restored if the same node
    returns, or the task can be assumed by another node (if possible).

    It often requires moving code as well as data, and programs need to be written specifically to regularly checkpoint / save state to the
    cloud, and to be able to resume from a given checkpoint.

    Yes. For me, all memory is wrapped in "memory objects". Each has particular attributes (and policies/behaviors), depending on its intended use.

    E.g., the TEXT resides in an R/O object -- attempts to corrupt it
    will result in a SIGSEGV, etc. Whether it is all wired down, faulted
    in (and out!) as required or some combination is defined by the nature
    of the app -- and how "obliging" it wants to be to the system wrt its
    resource allocation. (if you are considerate, it's less likely you will
    be axed during a resource shortage!)

    The DATA resides in an R/W object. Ditto for the BSS and stack.
    The policies applied to each vary based on their usage. E.g., the
    stack has an upper limit and an "auto-allocate" policy and traps to
    ensure it doesn't overflow or underflow. By contrast, the DATA is
    a fixed size which can optionally be mapped (entirely) at task
    creation or "on-demand".

    I leverage my ability to "migrate" a task (task is resource
    container) to *pause* the task and capture a snapshot of each
    memory object (some may not need to be captured if they are
    copies of identical objects elsewhere in the system) AS IF it
    was going to be migrated.

    But, instead of migrating the task, I simply let it resume, in place.

    The problem with this is watching for side-effects that happen
    between snapshots. I can hook all of the "handles" out of the
    task -- but, no way I can know what each of those "external
    objects" might be doing.

    OTOH, if I know that no external references have taken place since the
    last "snapshot", then I can safely restart the task from the last
    snapshot.

    It is great for applications that are well suited to checkpointing,
    WITHOUT requiring the application to explicitly checkpoint itself.

    The "tuple-space" aspect specifically is to coordinate efforts by
    multiple nodes without imposing any particular structure or
    communication pattern on partipating nodes ... with appropriate TS
    support many different communication patterns can be accomodated simultaneously.

    I have provisions for apps to request redundancy where the RTOS will
    automatically checkpoint the app/process and redispatch a process
    after a fault. But, that comes at a cost (to the app and the system)
    and could quickly become a crutch for sloppy developers. Just
    because your process is redundantly backed, doesn't mean it is free
    of design flaws.

    "Image" snapshots are useful in many situations, but they largely are impractical to communicate through a wide-area distributed system.
    [I know your system is LAN based - I'm just making a point.]

    Yup. For me, I can "move" a process in a few milliseconds -- because
    most are lightweight, small, etc. Having lots of hooks *out* of a process
    into other processes helps reduce the state and amount of TEXT (someone
    ELSE is handling those aspects of the problem) that are "in" that process.

    For many programs, checkpoint data will be much more compact than a
    snapshot of the running process, so it makes more sense to design
    programs to be resumed - particularly if you can arrange that reset of
    a faulting node doesn't eliminate the program, so code doesn't have to
    be downloaded as often (or at all).

    Yes, but that requires more skill on the part of the developer.
    And, makes it more challenging for him to test ("What if your
    app dies *here*? Have you checkpointed the RIGHT things to
    be able to recover? And, what about *here*??")

    I'm particularly focused on user-level apps (scripts) where I can
    build hooks into the primitives that the user employs to effectively
    keep track of what they've previously been asked to do -- keeping in
    mind that these will tend to be very high-levels of abstraction
    (from the user's perspective).

    E.g.,
    At 5:30PM record localnews
    At 6:00PM record nationalnews
    remove_commercials(localnews)
    remove_commercials(nationalnews)
    when restarted, each primitive can look at the current time -- and state
    of the "record" processes -- to sort out where they are in the sequence.
    And, the presence/absence of the "commercial-removed" results. (obviously
    you can't record a broadcast that has already ended so why even try!)

    Note that the above can be a KB of "code" + "state" -- because
    all of the heavy lifting is (was?) done in other processes.

    Even if the checkpoint data set is enormous, it often can be saved incrementally. You then have to weigh the cost of resuming, which
    requires the whole data set be downloaded.

    I'm starting on a C++ binding. That's C-ish enough that it won't
    be a tough sell. My C-binding exception handler is too brittle for >>>>>> most to use reliably -- but is a great "shortcut" for *my* work!
    :
    The hope was to provide a friendlier environment for dealing
    with the exceptions (my C exception framework is brittle).

    But, I think the whole OOPS approach is flawed for what I'm
    doing. It's too easy/common for something that the compiler
    never sees (is completely unaware of) to alter the current
    environment in ways that will confuse the code that it
    puts in place.

    Well, hypothetically, use of objects could include compiler generated
    fill-in-the-blank type error handling. But that severely restricts
    what languages [certainly what compilers] could be used and places a
    high burden on developers to provide relevant error templates.

    I thought of doing that with the stub generator. But, then I would
    need a different binding for each client of a particular service.

    And, it would mean that a particular service would ALWAYS behave
    a certain way for a particular task; you couldn't expect one type of
    behavior on line 24 and another on 25.

    I wasn't thinking of building the error handling into the interface
    object, but rather a "wizard" type code skeleton inserted at the point
    of use. You couldn't do it with the IDL compiler ... unless it also generated skeleton for clients as well as for servers - which is not
    typical.

    It generates client- and server-side stubs. But, those are *stubs*.
    So, their internals are not editable as part of the invoking client
    code.

    I suppose you /could/ have the IDL generate different variants of the
    same interface ... perhaps in response to a checklist of errors to be
    handled provided by the programmer when the stub is created.

    But then to allow for different behaviors in the same program, you
    might need to generate multiple variants of the same interface object
    and make sure to use the right one in the right place. Too much
    potential for F_up there.

    Exactly. What flavor of vanilla do I want, *here*?

    [Also, IIRC, you are based on CORBA? So potentially a resource drain
    given that just the interface object in the client can initiate a
    'session' with the server.]

    How you expose the "exceptional behaviors" is the issue. There's
    a discrete boundary between objects, apps and RTOS so anything that
    relies on blurring that boundary runs the risk of restricting future
    solutions, language bindings, etc.

    The problem really is that RPC tries to make all functions appear as
    if they are local ... 'local' meaning "in the same process".

    Of course! That's the beauty -- and curse!

    No dicking around with sockets, marshalling/packing arguments
    into specific message formats (defined by the called party),
    checking parameters, converting data formats/endian-ness, etc.

    But, by hiding all of that detail, it is too easy for folks to
    forget that it is still *there*!

    Its sort of like KNOWING that you can have arithmetic overflow
    in any operation -- yet never DELIBERATELY taking the steps to
    verify that it "can't happen" in each particular instance. You
    sort of assume it's not going to be a problem unless you're
    pushing the limits of a data type.

    Then, you're *surprised* when it is!

    At least /some/ of the "blurring" you speak of goes away in languages
    like Scala where every out-of-process call - e.g., I/O, invoking
    functions from a shared library, messaging another process, etc. - all
    are treated AS IF RPC to a /physically/ remote server, regardless of
    whether that actually is true. If an unhandled error occurs for any out-of-process call, the process is terminated.
    [Scala is single threaded, but its 'processes' are very lightweight,
    more like 'threads' in other systems.]

    The problem is still in the mindsets of the developers. Until they actively >> embrace the possibility of these RMI failing, they won't even begin to consider
    how to address that possibility.

    [That's why I suggested the "automatically insert a template after each
    invocation" -- to remind them of each of the possible outcomes in a
    very hard to ignore way!]

    I work with a lot of REALLY talented people so I am surprised that this
    is such an issue. They understand the mechanism. They have to understand >> that wires can break, devices can become disconnected, etc. AND that this
    can happen AT ANY TIME (not just "prior to POST). So, why no grok?

    I didn't understand it until I started working seriously with DBMS.

    A lot of the code I write now looks and feels transactional,
    regardless of what it's actually doing. I try to make there are no
    side effects [other than failure] and I routinely (ab)use exceptions
    and Duff's device to back out of complex situations, release
    resources, undo (where necessary) changes to data structures, etc.

    My "archive" database code is like that. But, as the basic
    algorithms are relatively simple and obvious, even handling a lot of
    potential "error/fault" cases isn't an intimidating task.

    And, by design, it can be restarted and deduce it's most recent
    state MONTHS later with no real impact on the results because the
    "state" of the archive resides *in* the archive -- where it can
    be reexamined when the code next runs.

    I won't hesitate to wrap a raw service API, and create different
    versions of it that handle things differently. Of course this results
    in some (usually moderate) code growth - which is less a problem in
    more powerful systems. I have written for small devices in the past,
    but I don't do that anymore.

    I do that in the server-side stubs as a particular handle may
    only support some subset of the methods available on the referenced
    object. E.g., one reference might allow "write" while another
    only allows "read", etc. Attempting to invoke a method that isn't
    supported for your particular handle instance has to throw an
    error (or exception).

    As all of the permissions are bound *in* the server, it makes sense for
    it to enforce them (because *it* created the handle, originally -- on
    the request of something with "authority")

    My devices, now, are still largely small -- in complexity. But, the
    apps that I layer onto them are getting exceedingly complex. <shrug>
    MIPS are cheap. E.g., I "check for (snail)mail by scanning the video
    feed from the camera that "sees" the mailbox in the portion of the
    day when the mail is likely delivered and watch for "something big"
    to be stopped adjacent to the mailbox for a period of several seconds.

    There's no "mail detected" hardware, per se; it's just part of the
    overall surveillance hardware. Ditto "someone is walking up the front
    walkway (likely to come to the front door and 'ring' the bell!)".

    [OTOH, C complains that I fail to see the crumbs I leave on the counter
    each time I prepare her biscotti: "How can you not SEE them??" <shrug>]

    When the puppy pees on the carpet, everyone develops indoor blindness.
    <grin>

    I'm not trying to be "lazy" or avoid some task. E.g., I will have already spent a few hours baking. Plus a good deal of time washing all of the utensils, pots, pans, etc. Even the cooling racks. "Cookies" wrapped and stored away.

    But, wiping down the counters is just a blind spot, for me.

    [I rationalize this by saying "it's the LEAST she can do..." :>
    And, she is far too grateful for the baked goods to make much
    hay of this oversight!]

    [[I am starting another batch as soon as I finish this email :< ]]

    I had a similar problem trying to get them used to having two different
    notions of "time": system time and wall time. And, the fact that they
    are entirely different schemes with different rules governing their
    behavior. I.e., if you want to do something "in an hour", then say
    "in an hour" and use the contiguous system time scheme. OTOH, if you
    want to do something at 9PM, then use the wall time. Don't expect any
    correlation between the two!

    Once I wrote a small(ish) parser, in Scheme, for AT-like time specs.
    You could say things like "(AT now + 40 minutes)" or "(AT 9pm
    tomorrow)", etc. and it would figure out what that meant wrt the
    system clock. The result was the requested epoch expressed in seconds.
    That epoch could be used directly to set an absolute timer, or to poll expiration with a simple 'now() >= epoch'.

    It was implemented as a compile time macro that generated and spliced
    in code at the call site to compute the desired answer at runtime.

    The parser understood 12 and 24 hour clocks, names of days and months, expressions like 'next tuesday', etc. There were a few small
    auxialiary functions required to be linked into the executable to
    figure out, e.g., what day it was, on-the-fly, but many typical uses
    just reduced to something like '(+ (current_time) <computed_offset>)'.
    The parser itself was never in the executable.

    Simplified a lot of complicated clock handling.

    This is a more fundamental "misunderstanding".

    I have two time services. The "system time" is very precise and
    synchronized to a fine degree between nodes. So, a process can
    "ask" a process on another node "what time is it" and get an answer
    that (within transport latencies) is identical to the answer
    he got when he asked the system time service on the current node.
    I.e., you can measure the transport/processing delay.

    So, if a process on node A says it did something at Ta and a process
    on node B says it did something at Tb, I can order those events even
    for very small differences between Ta and Tb.

    But, it bears no relationship to "wall time".

    However, the RATE at which system time progresses (monotonically increasing) mimics the rate at which "the earth revolves on its axis" -- which SHOULD be similar to wall time.

    So, if you want to do something "in 5 minutes", then you set a timer based on the system time service. When 300.000000000 seconds have elapsed, that event will be signaled.

    OTOH, if you want to do something at 9:05 -- assuming it is 9:00 now -- you
    set THAT timer based on the wall time. The guarantee it gives is that
    it will trigger at or after "9:05"... regardless of how many seconds elapse between now and then!

    So, if something changes the current wall time, the "in 5 minutes" timer
    will not be affected by that change; it will still wait the full 300 seconds. OTOH, the timer set for 9:05 will expire *at* 9:05. If the *current* notion
    of wall time claims that it is now 7:15, then you've got a long wait ahead!

    On smaller systems, the two ideas of time are often closely intertwined;
    the system tick (jiffy) effectively drives the time-of-day clock. And,
    "event times" might be bound at time of syscall *or* resolved late.
    So, if the system time changes (can actually go backwards in some poorly designed systems!), your notion of "the present time" -- and, with it,
    your expectations of FUTURE times -- changes.

    Again, it should be a simple distinction to get straight in your
    head. When you're dealing with times that the rest of the world
    uses, use the wall time. When you're dealing with relative times,
    use the system time. And, be prepared for there to be discontinuities
    between the two!

    E.g., if you "notice" a 9:00 broadcast is just starting, you might decide
    to skip over the first two minutes (known to be commercials) by setting
    an event for 9:02. A smarter approach (as you KNOW that the broadcast
    is just starting -- regardless of what the wall time claims) is to
    set an event for "2 minutes from now".

    It may seem trivial but if you are allowing something to interfere with
    your notion of "now", then you have to be prepared when that changes
    outside of your control.

    [I have an "atomic" clock that was off by 14 hours. WTF??? When
    your day-night schedule is as freewheeling as mine, it makes a
    difference if the clock tells you a time that suggests the sun is
    *rising* when, in fact, it is SETTING! <frown>]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Sun May 22 06:34:06 2022
    On Sat, 21 May 2022 05:03:57 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 5/20/2022 6:59 PM, George Neuner wrote:
    Now, we're looking at dynamically distributed applications where
    the code runs "wherever". And, (in my case) can move or "disappear" >>>from one moment to the next.

    I think folks just aren't used to considering "every" function (method)
    invocation as a potential source for "mechanism failure". They don't
    recognize a ftn invocation as potentially suspect.

    Multiuser databases have been that way since ~1960s: you have to code
    with the expectation that /anything/ you try to do in the database may
    fail ... not necessarily due to error, but simply because another
    concurrent transaction is holding some resource you need. Your
    transaction may be arbitrarily delayed, or even terminated: in the
    case of deadlock, participating transactions are killed one by one
    until /some/ one of them [not necessarily yours] is able to complete.

    Yes. The "code" (SQL) isn't really executing on the client. What
    you have, in effect, is an RPC; the client is telling the DBMS what to do
    and then waiting on its results. Inherent in that is the fact that
    there is a disconnect between the request and execution -- a *mechanism*
    that can fail.

    But, in my (limited) experience, DB apps tend to be relatively short
    and easily decomposed into transactions. You "feel" like you've accomplished >some portion of your goal after each interaction with the DB.

    <grin> You'll feel different when a single SQL "request" is 50+ lines
    long, contains multiple CTEs and sub-transactions, and takes minutes
    to process.

    But you're correct ... the majority of applications the use databases
    really need only structured storage and don't use much of the power of
    the DBMS.


    By contrast, "procedural" apps tend to have finer-grained actions; you
    don't get the feeling that you've made "definable" progress until you've >largely met your goal.

    E.g., you'd have to sum N values and then divide the sum by N to get
    an average. If any of those "adds" was interrupted, you'd not feel
    like you'd got anything done. Likewise, if you did the division
    to get the average, you'd likely still be looking to do something MORE
    with that figure.

    The DBMS doesn't have an "out" when it has summed the first M (M<N) values; >it sums them and forms the average BEFORE it can abend. Or, shits the bed >completely. There's no "half done" state.

    Not exactly.

    Your particular example isn't possible, but other things are -
    including having values seem to appear or disappear when they are
    examined at different points within your transaction.

    There absolutely IS the notion of partial completion when you use
    inner (ie. sub-) transactions, which can succeed and fail
    independently of each other and of the outer transaction(s) in which
    they are nested. Differences in isolation can permit side effects from
    other ongoing transactions to be visible.


    Yes. And, that recovery can be complicated, esp if the operations up
    to this point have had side effects, etc. How do you "unring the bell"?

    Use white noise to generate an "anti-ring".

    Seriously, though you are correct - DBMS only manages storage, and by
    design only changes storage bits at well defined times. That doesn't
    apply to objects in other realms, such as the bit that launches the
    ICBM ...


    So-called 'edge' computing largely is based on distributed tuple-space >>>> models specifically /because/ they are (or can be) self-organizing and >>>> are temporally decoupled: individual devices can come and go at will,
    but the state of ongoing computations is maintained in the fabric.

    But (in the embedded/RT system world) they are still "devices" with
    specific functionalities. We're not (yet) accustomed to treating
    "processing" as a resource that can be dispatched as needed. There
    are no mechanisms where you can *request* more processing (beyond
    creating another *process* and hoping <something> recognizes that
    it can co-execute elsewhere)

    The idea doesn't preclude having specialized nodes ... the idea is

    I'm arguing for the case of treating each node as "specialized + generic"
    and making the generic portion available for other uses that aren't >applicable to the "specialized" nature of the node (hardware).

    Your doorbell sits in an idiot loop waiting to "do something" -- instead
    of spending that "idle time" working on something *else* so the "device"
    that would traditionally be charged with doing that something else
    can get by with less resources on-board.

    [I use cameras galore. Imagining feeding all that video to a single
    "PC" would require me to keep looking for bigger and faster PCs!]

    And if frames from the camera are uploaded into a cloud queue, any
    device able to process them could look there for new work. And store
    its results into a different cloud queue for the next step(s). Faster
    and/or more often 'idle' CPUs will do more work.

    Pipelines can be 'logical' as well as 'physical': opportunistically
    processed data queues qualify as 'pipeline' stages.


    You often seem to get hung up on specific examples and fail to see how
    the idea(s) can be applied more generally.


    simply that if a node crashes, the task state [for some approximation]
    is preserved "in the cloud" and so can be restored if the same node
    returns, or the task can be assumed by another node (if possible).

    It often requires moving code as well as data, and programs need to be
    written specifically to regularly checkpoint / save state to the
    cloud, and to be able to resume from a given checkpoint.

    TS models produce an implicit 'sequence' checkpoint with every datum
    tuple uploaded into the cloud. In many cases that sequencing is all
    that's needed by external processes to accomplish the goal.

    Explicit checkpoint is required to resume only when processing is so
    time consuming that you /expect/ the node may fail (or be reassigned)
    before completing work on its current 'morsel' of input. To avoid
    REDOing lots of work - e.g., by starting over - it makes more sense to periodically checkpoint your progress.

    Different meta levels.


    Processing of your camera video above is implicitly checkpointed with
    every frame that's completed (at whatever stage). It's a perfect
    situation for distributed TS.



    Yes. For me, all memory is wrapped in "memory objects". Each has particular >attributes (and policies/behaviors), depending on its intended use.

    E.g., the TEXT resides in an R/O object ...
    The DATA resides in an R/W object ...

    I leverage my ability to "migrate" a task (task is resource
    container) to *pause* the task and capture a snapshot of each
    memory object (some may not need to be captured if they are
    copies of identical objects elsewhere in the system) AS IF it
    was going to be migrated.

    But, instead of migrating the task, I simply let it resume, in place.

    The problem with this is watching for side-effects that happen
    between snapshots. I can hook all of the "handles" out of the
    task -- but, no way I can know what each of those "external
    objects" might be doing.

    Or necessarily be able to reconnect the plumbing.


    OTOH, if I know that no external references have taken place since the
    last "snapshot", then I can safely restart the task from the last
    snapshot.

    It is great for applications that are well suited to checkpointing,
    WITHOUT requiring the application to explicitly checkpoint itself.

    The point I was making above is that TS models implicitly checkpoint
    when they upload data into the cloud. If that data contains explicit sequencing, then it can be an explicit checkpoint as well.

    Obviously this depends on how you write the program and the
    granularity of the data. A program like your AI that needs to
    save/restore a whole web of inferences is very different from one that
    when idle grabs a few frames of video and transcodes them to MPG.


    The "tuple-space" aspect specifically is to coordinate efforts by
    multiple nodes without imposing any particular structure or
    communication pattern on partipating nodes ... with appropriate TS
    support many different communication patterns can be accomodated
    simultaneously.

    :

    For many programs, checkpoint data will be much more compact than a
    snapshot of the running process, so it makes more sense to design
    programs to be resumed - particularly if you can arrange that reset of
    a faulting node doesn't eliminate the program, so code doesn't have to
    be downloaded as often (or at all).

    Yes, but that requires more skill on the part of the developer.
    And, makes it more challenging for him to test ("What if your
    app dies *here*? Have you checkpointed the RIGHT things to
    be able to recover? And, what about *here*??")

    Only for resuming non-sequenced work internal to the node. Whether
    you need to do this depends on the complexity of the program.

    Like I said, if the work is simple enough to just do over, the
    implicit checkpoint of a datum being in the 'input' queue may be
    sufficient.


    There are TS models that actively support the notion of 'checking out'
    work, tracking who is doing what, timing-out unfinished work,
    restoring 'checked-out' (removed) data, ignoring results from
    timed-out workers (should the result show up eventually), etc.

    The TS server is more complicated, but the clients don't have to be.


    I'm particularly focused on user-level apps (scripts) where I can
    build hooks into the primitives that the user employs to effectively
    keep track of what they've previously been asked to do -- keeping in
    mind that these will tend to be very high-levels of abstraction
    (from the user's perspective).

    E.g.,
    At 5:30PM record localnews
    At 6:00PM record nationalnews
    remove_commercials(localnews)
    remove_commercials(nationalnews)
    when restarted, each primitive can look at the current time -- and state
    of the "record" processes -- to sort out where they are in the sequence.
    And, the presence/absence of the "commercial-removed" results. (obviously >you can't record a broadcast that has already ended so why even try!)

    Note that the above can be a KB of "code" + "state" -- because
    all of the heavy lifting is (was?) done in other processes.

    Even if the checkpoint data set is enormous, it often can be saved
    incrementally. You then have to weigh the cost of resuming, which
    requires the whole data set be downloaded.

    Well, recording is a sequential, single node process. Obviously
    different nodes can record different things simultaneously.

    But - depending on how you identify content vs junk - removing the
    commercials could be done in parallel by a gang, each of which needs
    only to look at a few video frames at a time.



    OTOH, if you want to do something at 9:05 -- assuming it is 9:00 now -- you >set THAT timer based on the wall time. The guarantee it gives is that
    it will trigger at or after "9:05"... regardless of how many seconds elapse >between now and then!

    So, if something changes the current wall time, the "in 5 minutes" timer
    will not be affected by that change; it will still wait the full 300 seconds. >OTOH, the timer set for 9:05 will expire *at* 9:05. If the *current* notion >of wall time claims that it is now 7:15, then you've got a long wait ahead!

    On smaller systems, the two ideas of time are often closely intertwined;

    In large systems too!

    If time goes backwards, all bets are off. Most systems are designed
    so that can't happen unless a priveleged user intervenes. System time generally is kept in UTC and 'display' time is computed wrt system
    time when necessary.


    But your notion of 'wall time' seems unusual: typically it refers to
    a notion of time INDEPENDENT of the computer - ie. the clock on the
    wall, the watch on my wrist, etc. - not to whatever the computer may
    /display/ as the time.

    Ie. if you turn back the wall clock, the computer doesn't notice. If
    you turn back the computer's system clock, then you are an
    administrator and you get what you deserve.


    There are a number of monotonic time conventions, but mostly you just
    work in UTC if you want to ignore local time conventions like
    'daylight saving' that might result in time moving backwards. Network
    time-set protocols never move the local clock backwards: they adjust
    the length of the local clock tick such that going forward the local
    time converges with the external time at some (hopefully near) point
    in the future.

    You still might encounter leap-seconds every so often, but [so far]
    they have only gone forward so as yet they haven't caused problems
    with computed delays. Not guaranteed though.


    My AT parser produced results that depended on current system time to calculate, but the results were fixed points in UTC time wrt the 1970
    Unix epoch. The computed point might be 300 seconds or might be
    3,000,000 seconds from time of the parse - but it didn't matter so
    long as nobody F_d with the system clock.


    the system tick (jiffy) effectively drives the time-of-day clock. And, >"event times" might be bound at time of syscall *or* resolved late.
    So, if the system time changes (can actually go backwards in some poorly >designed systems!), your notion of "the present time" -- and, with it,
    your expectations of FUTURE times -- changes.

    Again, it should be a simple distinction to get straight in your
    head. When you're dealing with times that the rest of the world
    uses, use the wall time. When you're dealing with relative times,
    use the system time. And, be prepared for there to be discontinuities >between the two!

    Yes, but if you have to guard against (for lack of a better term)
    'timebase' changes, then your only recourse is to use absolute
    countdown.

    The problem is, the current state of a countdown has to be maintained continuously and it can't easily be used in a 'now() >= epoch' polling
    software timer. That makes it very inconvenient for some uses.


    And there are times when you really do want the delay to reflect the
    new clock setting: ie. the evening news comes on at 6pm regardless of
    daylight saving, so the showtime moves (in opposition) with the change
    in the clock.

    Either countdown or fixed epoch can handle this if computed
    appropriately (i.e. daily with reference to calendar) AND the computer
    remains online to maintain the countdown for the duration. If the
    computer may be offline during the delay period, then only fixed epoch
    will work.


    It may seem trivial but if you are allowing something to interfere with
    your notion of "now", then you have to be prepared when that changes
    outside of your control.

    [I have an "atomic" clock that was off by 14 hours. WTF??? When
    your day-night schedule is as freewheeling as mine, it makes a
    difference if the clock tells you a time that suggests the sun is
    *rising* when, in fact, it is SETTING! <frown>]

    WTF indeed. The broadcast is in UTC or GMT (depending), so if your
    clock was off it had to be because it's offset was wrong.

    I say 'offset' rather than 'timezone' because some "atomic" clocks
    have no setup other than what is the local time. Internally, the
    mechanism just notes the difference between local and broadcast time
    during setup, and if the differential becomes wrong it adjusts the
    local display to fix it.

    I have an analog electro-mechanical (hands on dial) "atomic" clock
    that does this. Position the hands so they reflect the current local
    time and push a button on the mechanism. From that point the time
    broadcast keeps it correct [at least until the batteries die].

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to All on Sun May 22 14:12:39 2022
    [Some aggressive eliding as we're getting pretty far afield of
    "exception vs. error code"]

    Your particular example isn't possible, but other things are -
    including having values seem to appear or disappear when they are
    examined at different points within your transaction.

    But the point of the transaction is to lock these changes
    (or recognize their occurrence) so this "ambiguity" can't
    manifest. (?)

    The "client" either sees the result of entire transaction or none of it.

    There absolutely IS the notion of partial completion when you use
    inner (ie. sub-) transactions, which can succeed and fail
    independently of each other and of the outer transaction(s) in which
    they are nested. Differences in isolation can permit side effects from
    other ongoing transactions to be visible.

    But you don't expose those partial results (?) How would the client
    know that he's seeing partial results?

    So-called 'edge' computing largely is based on distributed tuple-space >>>>> models specifically /because/ they are (or can be) self-organizing and >>>>> are temporally decoupled: individual devices can come and go at will, >>>>> but the state of ongoing computations is maintained in the fabric.

    But (in the embedded/RT system world) they are still "devices" with
    specific functionalities. We're not (yet) accustomed to treating
    "processing" as a resource that can be dispatched as needed. There
    are no mechanisms where you can *request* more processing (beyond
    creating another *process* and hoping <something> recognizes that
    it can co-execute elsewhere)

    The idea doesn't preclude having specialized nodes ... the idea is

    I'm arguing for the case of treating each node as "specialized + generic"
    and making the generic portion available for other uses that aren't
    applicable to the "specialized" nature of the node (hardware).

    Your doorbell sits in an idiot loop waiting to "do something" -- instead
    of spending that "idle time" working on something *else* so the "device"
    that would traditionally be charged with doing that something else
    can get by with less resources on-board.

    [I use cameras galore. Imagining feeding all that video to a single
    "PC" would require me to keep looking for bigger and faster PCs!]

    And if frames from the camera are uploaded into a cloud queue, any
    device able to process them could look there for new work. And store
    its results into a different cloud queue for the next step(s). Faster
    and/or more often 'idle' CPUs will do more work.

    That means every CPU must know how to recognize that sort of "work"
    and be able to handle it. Each of those nodes then bears a cost
    even if it doesn't actually end up contributing to the result.

    it also makes the "cloud" a shared resource akin to the "main computer".
    What do you do when it isn't available?

    If the current resource set is insufficient for the current workload,
    then (by definition) something has to be shed. My "workload manager"
    handles that -- deciding that there *is* a resource shortage (by looking
    at how many deadlines are being missed/aborted) as well as sorting out
    what the likeliest candidates to "off-migrate" would be.

    Similarly, deciding when there is an abundance of resources that
    could be offered to other nodes.

    So, if a node is powered up *solely* for its compute resources
    (or, it's unique hardware-related tasks have been satisfied) AND
    it discovers another node(s) has enough resources to address
    its needs, it can push it's workload off to that/those node(s) and
    then power itself down.

    Each node effectively implements part of a *distributed* cloud
    "service" by holding onto resources as they are being used and
    facilitating their distribution when there are "greener pastures"
    available.

    But, unlike a "physical" cloud service, they accommodate the
    possibility of "no better space" by keeping the resources
    (and loads) that already reside on themselves until such a place
    can be found -- or created (i.e., bring more compute resources
    on-line, on-demand). They don't have the option of "parking"
    resources elsewhere, even as a transient measure.

    When a "cloud service" is unavailable, you have to have a backup
    policy in place as to how you'll deal with these overloads.

    Pipelines can be 'logical' as well as 'physical': opportunistically
    processed data queues qualify as 'pipeline' stages.

    You often seem to get hung up on specific examples and fail to see how
    the idea(s) can be applied more generally.

    I don't have a "general" system. :> And, suspect future (distributed) embedded systems will shy away from the notion of any centralized "controller node" for the obvious dependencies that that imposes on the solution.

    Sooner or later, that node will suffer from scale. Or, reliability.
    (one of the initial constraints I put on the system was NOT to rely on
    any "outside" service; why not use a DBMS "in the cloud"? :> )

    It's only a matter of time before we discover some egregious data
    breach or system unavailability related to cloud services. You're
    reliant on that service keeping itself available for YOUR operation
    AND the fabric to access it being operational. Two big dependencies
    that you have no control over (beyond paying your usage fees)

    simply that if a node crashes, the task state [for some approximation]
    is preserved "in the cloud" and so can be restored if the same node
    returns, or the task can be assumed by another node (if possible).

    It often requires moving code as well as data, and programs need to be
    written specifically to regularly checkpoint / save state to the
    cloud, and to be able to resume from a given checkpoint.

    TS models produce an implicit 'sequence' checkpoint with every datum
    tuple uploaded into the cloud. In many cases that sequencing is all
    that's needed by external processes to accomplish the goal.

    I can get that result just by letting a process migrate itself to its
    current node -- *if* it wants to "remember" that it can resume cleanly
    from that point (but not any point beyond that unless side-effects
    are eliminated). The first step in the migration effectively creates
    the process snapshot.

    There is overhead to taking that snapshot -- or pushing those
    "intermediate results" to the cloud. You have to have designed your *application* with that in mind.

    Just like an application can choose to push "temporary data" into
    the DBMS, in my world. And, incur those costs at run-time.

    The more interesting problem is seeing what you can do "from the
    outside" without the involvement of the application.

    E.g., if an application had to take special measures in order to
    be migrate-able, then I suspect most applications wouldn't be!
    And, as a result, the system wouldn't have that flexibility.

    OTOH, if the rules laid out for the environment allow me to wedge
    that type of service *under* the applications, then there's no
    cost-adder for the developers.

    Explicit checkpoint is required to resume only when processing is so
    time consuming that you /expect/ the node may fail (or be reassigned)
    before completing work on its current 'morsel' of input. To avoid
    REDOing lots of work - e.g., by starting over - it makes more sense to periodically checkpoint your progress.

    Different meta levels.

    Processing of your camera video above is implicitly checkpointed with
    every frame that's completed (at whatever stage). It's a perfect
    situation for distributed TS.

    But that means the post processing has to happen WHILE the video is
    being captured. I.e., you need "record" and "record-and-commercial-detect" primitives. Or, to expose the internals of the "record" operation.

    Similarly, you could retrain the speech models WHILE you are listening
    to a phone call. But, that means you need the horsepower to do so
    AT THAT TIME, instead of just capturing the audio ("record") and
    doing the retraining "when convenient" ("retrain").

    I've settled on simpler primitives that can be applied in more varied situations. E.g., you will want to "record" the video when someone
    wanders onto your property. But, there won't be any "commercials"
    to detect in that stream.

    Trying to make "primitives" that handle each possible combination of
    actions seems like a recipe for disaster; you discover some "issue"
    and handle it in one implementation and imperfectly (or not at all)
    handle it in the other. "Why does it work if I do 'A then B' but
    'B while A' chokes?"

    Yes. For me, all memory is wrapped in "memory objects". Each has particular
    attributes (and policies/behaviors), depending on its intended use.

    E.g., the TEXT resides in an R/O object ...
    The DATA resides in an R/W object ...

    I leverage my ability to "migrate" a task (task is resource
    container) to *pause* the task and capture a snapshot of each
    memory object (some may not need to be captured if they are
    copies of identical objects elsewhere in the system) AS IF it
    was going to be migrated.

    But, instead of migrating the task, I simply let it resume, in place.

    The problem with this is watching for side-effects that happen
    between snapshots. I can hook all of the "handles" out of the
    task -- but, no way I can know what each of those "external
    objects" might be doing.

    Or necessarily be able to reconnect the plumbing.

    If the endpoint objects still exist, the plumbing remains intact
    (even if the endpoints have moved "physically").

    If an endpoint is gone, then the (all!) reference is notified and
    has to do its own cleanup. But, that's the case even in "normal
    operation" -- the "exceptions" we've been talking about.

    OTOH, if I know that no external references have taken place since the
    last "snapshot", then I can safely restart the task from the last
    snapshot.

    It is great for applications that are well suited to checkpointing,
    WITHOUT requiring the application to explicitly checkpoint itself.

    The point I was making above is that TS models implicitly checkpoint
    when they upload data into the cloud. If that data contains explicit sequencing, then it can be an explicit checkpoint as well.

    Obviously this depends on how you write the program and the
    granularity of the data. A program like your AI that needs to
    save/restore a whole web of inferences is very different from one that
    when idle grabs a few frames of video and transcodes them to MPG.

    Remember, we're (I'm) trying to address something as "simple" as
    "exceptions vs error codes", here. Expecting a developer to write
    code with the notion of partial recovery in mind goes far beyond
    that!

    He can *choose* to structure his application/object/service in such
    a way that makes that happen. Or not.

    E.g., the archive DB treats each "file processed/examined" as a
    single event. Kill off the process before a file is completely
    processed and it will look like NO work was done for that file.
    Kill off the DB before it can be updated to REMEMBER the work
    that was done and the same is true.

    So, I can SIGKILL the process and restart it at any time, knowing
    that it will eventually sort out where it was when it died
    (it may have a different workload to process *now* but that's
    just a consequence of calendar time elapsing (e.g., "list files
    that haven't been verified in the past N hours")

    I think it's hard to *generally* design solutions that can be
    interrupted and partially restored. You have to make a deliberate
    effort to remember what you've done and what you were doing.
    We seem to have developed the habit/practice of not "formalizing"
    intermediate results as we expect them to be transitory.

    [E.g., if I do an RMI to a node that uses a different endianness,
    the application doesn't address that issue by representing the
    data in some endian-neutral manner. Instead, the client-/server-
    -side stubs handle that without the caller knowing it is happening.]

    *Then*, you need some assurance that you *will* be restarted; otherwise,
    the progress that you've already made may no longer be useful.

    I don't, for example, universally use my checkpoint-via-OS hack
    because it will cause more grief than it will save. *But*, if
    a developer knows it is available (as a service) and the constraints
    of how it works, he can offer a hint (at install time) to suggest
    his app/service be installed with that feature enabled *instead* of
    having to explicitly code for resumption.

    Again, my goal is always to make it more enticing for you to do things
    "my way" than to try to invent your own mechanism -- yet not
    forcing you to comply!

    The "tuple-space" aspect specifically is to coordinate efforts by
    multiple nodes without imposing any particular structure or
    communication pattern on partipating nodes ... with appropriate TS
    support many different communication patterns can be accomodated
    simultaneously.

    :

    For many programs, checkpoint data will be much more compact than a
    snapshot of the running process, so it makes more sense to design
    programs to be resumed - particularly if you can arrange that reset of
    a faulting node doesn't eliminate the program, so code doesn't have to
    be downloaded as often (or at all).

    Yes, but that requires more skill on the part of the developer.
    And, makes it more challenging for him to test ("What if your
    app dies *here*? Have you checkpointed the RIGHT things to
    be able to recover? And, what about *here*??")

    Only for resuming non-sequenced work internal to the node. Whether
    you need to do this depends on the complexity of the program.

    Of course. But, you have to be aware of "what won't have been done"
    when you are restored from a checkpoint and ensure that you haven't
    done something that is difficult to "undo"/safe-to-redo.

    Like I said, if the work is simple enough to just do over, the
    implicit checkpoint of a datum being in the 'input' queue may be
    sufficient.

    There are TS models that actively support the notion of 'checking out'
    work, tracking who is doing what, timing-out unfinished work,
    restoring 'checked-out' (removed) data, ignoring results from
    timed-out workers (should the result show up eventually), etc.

    I limit the complexity to just tracking local load and local resources.
    If I push a job off to another node, I have no further knowledge of
    it; it may have subsequently been killed off, etc. (e.g., maybe it failed
    to meet its deadline and was aborted)

    No "one" is watching the system to coordinate actions. I'm not worried
    about "OPTIMAL load distribution" as the load is dynamic and, by the time
    the various agents recognize that there may be a better "reshuffling"
    of tasks, the task set will likely have changed.

    What I want to design against is the need to over-specify resources
    *just* for some "job" that may be infrequent or transitory. That
    leads to nodes costing more than they have to. Or, doing less than
    they *could*!

    If someTHING can notice imbalances in resources/demands and dynamically
    adjust them, then one node can act as if it has more "capability" than
    its own hardware would suggest.

    [I'm transcoding some videos for SWMBO in the other room. And, having to
    wait for that *one* workstation to finish the job. Why can't *it* ask
    for help from any of the 4 other machines presently running in the house?
    Why do *I* have to distribute the workload if I want to finish sooner?]

    The TS server is more complicated, but the clients don't have to be.

    I'm particularly focused on user-level apps (scripts) where I can
    build hooks into the primitives that the user employs to effectively
    keep track of what they've previously been asked to do -- keeping in
    mind that these will tend to be very high-levels of abstraction
    (from the user's perspective).

    E.g.,
    At 5:30PM record localnews
    At 6:00PM record nationalnews
    remove_commercials(localnews)
    remove_commercials(nationalnews)
    when restarted, each primitive can look at the current time -- and state
    of the "record" processes -- to sort out where they are in the sequence.
    And, the presence/absence of the "commercial-removed" results. (obviously >> you can't record a broadcast that has already ended so why even try!)

    Note that the above can be a KB of "code" + "state" -- because
    all of the heavy lifting is (was?) done in other processes.

    Even if the checkpoint data set is enormous, it often can be saved
    incrementally. You then have to weigh the cost of resuming, which
    requires the whole data set be downloaded.

    Well, recording is a sequential, single node process. Obviously
    different nodes can record different things simultaneously.

    I push frames into an object ("recorder"). It's possible that a new recorder could distribute those frames to a set of cooperating nodes. But, the intent is for the "recorder" to act as an elastic store (the "real" store may not
    have sufficient bandwidth to handle all of the instantaneous demands placed
    on it so let the recorder buffer things locally) as it moves frames onto
    the "storage medium" (another object).

    I can, conceivably, arrange for the "store object" to be a "commercial detector" but that requires the thing that interprets the script to recognize this possibility for parallelism instead of just processing the script
    as a "sequencer".

    But, I want to ensure the policy decisions aren't embedded in the implementation. E.g., if I want to preserve a "raw" version of the video
    (to guard against the case where something may have been elided that
    was NOT a commercial), then I should be able to do so.

    Or, if I want to represent the "commercial detected" version as a *script*
    that can be fed to the video player ("When you get to timestamp X, skip
    forward to timestamp Y").

    But - depending on how you identify content vs junk - removing the commercials could be done in parallel by a gang, each of which needs
    only to look at a few video frames at a time.

    OTOH, if you want to do something at 9:05 -- assuming it is 9:00 now -- you >> set THAT timer based on the wall time. The guarantee it gives is that
    it will trigger at or after "9:05"... regardless of how many seconds elapse >> between now and then!

    So, if something changes the current wall time, the "in 5 minutes" timer
    will not be affected by that change; it will still wait the full 300 seconds.
    OTOH, the timer set for 9:05 will expire *at* 9:05. If the *current* notion >> of wall time claims that it is now 7:15, then you've got a long wait ahead! >>
    On smaller systems, the two ideas of time are often closely intertwined;

    In large systems too!

    If time goes backwards, all bets are off. Most systems are designed
    so that can't happen unless a priveleged user intervenes. System time generally is kept in UTC and 'display' time is computed wrt system
    time when necessary.

    But your notion of 'wall time' seems unusual: typically it refers to
    a notion of time INDEPENDENT of the computer - ie. the clock on the
    wall, the watch on my wrist, etc. - not to whatever the computer may /display/ as the time.

    The "computer" (system) has no need for the wall time, except as a
    convenient reference frame for activities that interact with the user(s).

    OTOH, it *does* need some notion of "time" ("system time") in order to
    make scheduling and resource decisions.

    E.g., I can get a performance metric from the video transcoder and
    use that to predict *when* the transcoding task will be complete.
    With this, I can decide whether or not some other task(s) will be
    able to meet it's deadline(s) IF THE TRANSCODER IS CONSUMING RESOURCES
    for that interval. And, decide whether I should kill off the transcoder
    to facilitate those other tasks meeting their deadlines *or* kill off
    (not admit) the other task(s) as I know they won't meet *their*
    deadlines while the transcoder is running.

    None of those decisions require knowledge of the angular position of
    the earth on its access.

    Ie. if you turn back the wall clock, the computer doesn't notice. If

    That's not true in all (embedded) systems. Often, the system time
    and wall time are linear functions of each other. In effect, when
    you say "do something at 9:05", the current wall time is used to
    determine how far in the future (past?) that will be. And, from this,
    compute the associated *system* time -- which is then used as the "alarm
    time". You've implicitly converted an absolute time into a relative
    time offset -- by assuming the current wall time is immutable.
    "Wall time" is an ephemeral concept as far as the system is concerned.

    So, change the wall time to 9:04 and be surprised when the event DOESN'T
    happen in 60 seconds!

    This was my point wrt having two notions of time in a system and
    two ways of referencing "it" (them?)

    In my system, if you schedule an event for "9:05", then the current
    notion of wall time is used to determine if that event should
    activate. If you perpetually kept reseting the wall clock to
    7:00, then the event would NEVER occur.

    By contrast, an event scheduled for "300 seconds from now" WILL
    happen in 300 seconds.

    you turn back the computer's system clock, then you are an
    administrator and you get what you deserve.

    But you can't alter the *system* time in my system. It marches steadily forward. The "wall time", OTOH, is a convenience that can be redefined
    at will. Anything *tied* to those references would be at the mercy
    of such a redefinition.

    So, using system time, I can tell you what the average transfer rate for
    an FTP transfer was. If I had examined the *wall* time at the start
    and end of the transfer, there's no guarantee that the resulting
    computation would be correct (cuz the wall time might have been changed
    in that period).

    There are a number of monotonic time conventions, but mostly you just
    work in UTC if you want to ignore local time conventions like
    'daylight saving' that might result in time moving backwards. Network time-set protocols never move the local clock backwards: they adjust
    the length of the local clock tick such that going forward the local
    time converges with the external time at some (hopefully near) point
    in the future.

    Even guaranteeing that time never goes backwards doesn't leave time
    as a useful metric. If you slow my "wall clock" by 10% to allow
    "real" time to catch up to it, then any measurements made with that
    "timebase" are off by 10%.

    I keep system time pretty tightly synchromized between nodes.
    So, time moves at a consistent rate across the system.

    *Wall* time, OTOH, is subject to the quality of the references
    that I have available. If I am reliant on the user to manually tell
    me the current time, then the possibility of large discontinuities
    is a real issue. If the user sets the time to 10:00 and you presently
    think it to be *12:00*, you can't slowly absorb the difference!
    The user would wonder why you were still "indicating" 12:00 despite
    his recent (re)setting of the time. The idea that you don't want to
    "move backwards" is anathema to him; of COURSE he wants you to move
    backwards because IT IS 10:00, NOT 12:00 (in his mind).

    [Time is a *huge* project because of all the related issues. You
    still need some "reference" for the timebases -- wall and system.
    And, a way to ensure they track in some reasonably consistent
    manner: 11:00 + 60*wait(60 sec) should bring you to 11:00
    even though you're using times from two different domains!]

    You still might encounter leap-seconds every so often, but [so far]
    they have only gone forward so as yet they haven't caused problems
    with computed delays. Not guaranteed though.

    Keeping all of that separate from "system time" makes system time
    a much more useful facility. A leap second doesn't turn a 5 second
    delay into a *6* second delay (if the leap second manifested within
    that window).

    And, if the wall time was ignorant of the leap second's existence,
    the only consequence would be that the "external" notion of
    "current time of day" would be off by a second. If you've set
    a task to record a broadcast at 9:00, it will actually be recorded
    at 8:59:59 (presumably, the broadcaster has accounted for the
    leap second in his notion of "now", even if you haven't). The *user*
    might complain but then the user could do something about it
    (including filing a bug report).

    My AT parser produced results that depended on current system time to calculate, but the results were fixed points in UTC time wrt the 1970
    Unix epoch. The computed point might be 300 seconds or might be
    3,000,000 seconds from time of the parse - but it didn't matter so
    long as nobody F_d with the system clock.

    I have no "epoch". Timestamps reflect the system time at which the
    event occurred. If the events had some relation to "wall time"
    ("human time"), then any discontinuities in that time frame are
    the problem of the human. Time "starts" at the factory. Your
    system's "system time" need bear no relationship to mine.

    E.g., it's 9:00. Someone comes to the door and drops off a package.
    Some time later, you (or some agent) change the wall time to
    reflect an earlier time. Someone is seen picking up at 8:52 (!).
    How do you present these events to the user? He sees his package
    "stolen" before it was delivered! (if you treat wall time as
    significant).

    But, the "code" knows that the dropoff occurred at system time X
    and the theft at X+n so it knows the proper ordering of the events,
    even if the (wall) time being displayed on the imagery is "confused".

    the system tick (jiffy) effectively drives the time-of-day clock. And,
    "event times" might be bound at time of syscall *or* resolved late.
    So, if the system time changes (can actually go backwards in some poorly
    designed systems!), your notion of "the present time" -- and, with it,
    your expectations of FUTURE times -- changes.

    Again, it should be a simple distinction to get straight in your
    head. When you're dealing with times that the rest of the world
    uses, use the wall time. When you're dealing with relative times,
    use the system time. And, be prepared for there to be discontinuities
    between the two!

    Yes, but if you have to guard against (for lack of a better term)
    'timebase' changes, then your only recourse is to use absolute
    countdown.

    The problem is, the current state of a countdown has to be maintained continuously and it can't easily be used in a 'now() >= epoch' polling software timer. That makes it very inconvenient for some uses.

    If you want to deal with a relative time -- either SPECIFYING one or
    measuring one -- you use the system time. Wanna know how much time
    has elapsed since a point in time?

    reference := get_system_time(...)
    ...
    elapsed_time := get_system_time(...) - reference

    Want to know how long until some "future" (human time) event?

    wait := scheduled_time - get_wall_time(...)

    Note that "wait" can be negative, even if you had THOUGHT it was a
    "future" event! And, if something has dicked with the "wall time",
    the magnitude is essentially boundless.

    OTOH, "elapsed_time" is *always* non-negative. Regardless of what
    "time" the clock on the wall claims it to be! And, always reflects
    the actual rotation of the Earth on its axis.

    And there are times when you really do want the delay to reflect the
    new clock setting: ie. the evening news comes on at 6pm regardless of daylight saving, so the showtime moves (in opposition) with the change
    in the clock.

    Then you say "record at 6:00PM" -- using the wall clock time.
    If "something" ensures that the wall time is *accurate* AND
    reflects savings/standard time changes, the recording will take
    place exactly as intended.

    Because the time in question is an EXTERNALLY IMPOSED notion of time,
    not one inherent to the system.

    [When the system boots, it has no idea "what time it is" until it can
    get a time fix from some external agency. That *can* be an RTC -- but,
    RTC batteries can die, etc. It can be manually specified -- but, that
    can be in error. <shrug> The system doesn't care. The *user* might
    care (if his shows didn't get recorded at the right times)...]

    Either countdown or fixed epoch can handle this if computed
    appropriately (i.e. daily with reference to calendar) AND the computer remains online to maintain the countdown for the duration. If the
    computer may be offline during the delay period, then only fixed epoch
    will work.

    You still require some sort of "current time" indicator/reference
    (in either timing system).

    For me, time doesn't exist when EVERYTHING is off. Anything that
    was supposed to happen during that interval obviously can't happen.
    And, nothing that has happened (in the environment) can be "noticed"
    so there's no way of ordering those observations!

    If I want some "event" to be remembered beyond an outage, then
    the time of the event has to be intentionally stored in persistent
    storage (i.e., the DBMS) and retrieved from it (and rescheduled)
    once the system restarts.

    These tend to be "human time" events (broadcast schedules, HVAC
    events, etc.). Most "system time" events are short-term and don't
    make sense spanning an outage *or* aren't particularly concerned
    with accuracy (e.g., vacuum the DB every 4 hours).

    I have a freerunning timer that is intended to track the passage of
    time during an outage (like an RTC would). It can never be set
    (reset) so, in theory, tells me what the system timer WOULD have
    been, had the system not suffered an outage.

    [The passage of time via this mechanism isn't guaranteed to be
    identical to the rate time passes on the "real" system timer.
    But, observations of it while the system is running give
    me an idea as to how fast/slow it may be so I can compute a
    one-time offset between its reported value and the deduced
    system time and use that to initialize the system time
    (knowing that it will always be greater than the system time
    at which the outage occurred)]

    [[I use a similar scheme in my digital clocks, using the AC mains
    frequency to "discipline" the XTAL oscillator so it tracks,
    long term]]

    I have several potential sources for "wall time":
    - the user (boo hiss!)
    - WWVB
    - OTA DTV
    - GPS
    each of which may/mayn't be available and has characteristics
    that I've previously observed (e.g., DTV time is sometimes
    off by an hour here, as we don't observe DST).

    I pick the best of these, as available, and use that to
    initialize the wall time. Depending on my confidence in
    the source, I may revise the *system* time FORWARD by some
    amount (but never backwards). Because the system time
    has to be well-behaved.

    It may seem trivial but if you are allowing something to interfere with
    your notion of "now", then you have to be prepared when that changes
    outside of your control.

    [I have an "atomic" clock that was off by 14 hours. WTF??? When
    your day-night schedule is as freewheeling as mine, it makes a
    difference if the clock tells you a time that suggests the sun is
    *rising* when, in fact, it is SETTING! <frown>]

    WTF indeed. The broadcast is in UTC or GMT (depending), so if your
    clock was off it had to be because it's offset was wrong.

    There are only 4 offsets: Pacific, Mountain, Central and Eastern
    timezones. It's wrong half of the year due to our lack of DST
    (so, I tell it we're in California for that half year!)

    Something internally must have gotten wedged and, because it is battery
    backed, never got unwedged (I pulled the batteries out when I noticed
    this problem and let it "reset" itself).

    I say 'offset' rather than 'timezone' because some "atomic" clocks
    have no setup other than what is the local time. Internally, the
    mechanism just notes the difference between local and broadcast time
    during setup, and if the differential becomes wrong it adjusts the

    [continued in next message]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Wed May 25 02:11:54 2022
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    [Some aggressive eliding as we're getting pretty far afield of
    "exception vs. error code"]

    Best discussions always wander. <grin>


    Your particular example isn't possible, but other things are -
    including having values seem to appear or disappear when they are
    examined at different points within your transaction.

    But the point of the transaction is to lock these changes
    (or recognize their occurrence) so this "ambiguity" can't
    manifest. (?)

    Yes ... and no.

    The "client" either sees the result of entire transaction or none of it.

    Remember SQL permits 4 isolation levels. In increasing order of
    isolation:

    - read uncommited (RU)
    - read committed (RC)
    - repeatable read (RR)
    - serializable (S)

    Think about "repeatable read" and what could happen if you don't have,
    at least, that level of isolation.

    Taking your summation example, consider:


    TA starts
    TA sums table.V to A1

    TB starts
    TB updates values in table.V
    TA sums table.V to A2
    TB rolls back
    TA sums table.V to A3

    TC starts
    TC updates values in table.V
    TA sums table.V to A4
    TC commits
    TA sums table.V to A5

    TD starts
    TD adds rows to table
    TD commits
    TA sums table.V to A6
    TE starts
    TE deletes rows from table
    TE commits
    TA sums table.V to A7

    :


    Are all the results A_ equal? Maybe. Maybe not. The answer depends
    on the isolation level ... not just of TA, but of all the other
    transactions as well.

    If TA is "read uncommitted", then A1 and A3 will be equal, A4 and A5
    will be equal (but different from A1,A3), and all the other A_ values
    will all be different.

    If TA is "read committed", then A1, A2, A3 and A4 will equal, but the
    other A_ values will be different. ["read committed" prevents TA from
    seeing uncommitted side effects from TB and TC.]

    If TA is "repeatable read" then all A_ are equal.

    If TA is "serializable" then all A_ are equal.



    Now you'll ask "what is the difference between RR and serializable?"
    The answer is that serializable is RR+ ... the '+' being beyond the
    scope of this discussion. Isolation levels are about controlling
    leakage of side effects from other concurrent processing INTO your
    transaction. You can't prevent leakage OUT in any real sense except
    by security (owner/access) partitioning of the underlying schema.

    High isolation levels often result in measurably lower performance - "repeatable read" requires that when any underlying table is first
    touched, the selected rows be locked, or be copied for local use. RR
    also changes how subsequent uses of those rows are evaluated (see
    below). Locking limits concurrency, copying uses resources (which
    also may limit concurrency).

    Often you don't need the highest isolation levels - in fact, a large
    percentage of use cases will do just fine "read committed".



    Admittedly the example above is contrived: a programmer will not
    likely write code that recomputes a result many times. But under
    certain circumstances it happens anyway (see below).

    Rereading results, however, is more common: holding a cursor open on
    any table (even a computed temporary result), at low isolation levels
    may allow the client to see dynamic changes to the table as the cursor
    is moved through it. This absolutely IS the case when using ODBC or
    similar client-side DB libraries.

    for more on isolation, see: https://www.postgresql.org/docs/14/transaction-iso.html



    There absolutely IS the notion of partial completion when you use
    inner (ie. sub-) transactions, which can succeed and fail
    independently of each other and of the outer transaction(s) in which
    they are nested. Differences in isolation can permit side effects from
    other ongoing transactions to be visible.

    But you don't expose those partial results (?)

    Expounding on the above:

    The transaction's isolation level comes into play with access to any
    instance of a table within the transaction. The result of the query
    may be a table, and as noted above, that result table (and the
    transaction that produced it) may be held open indefinitely by a
    cursor held by a client.

    However, complex queries which do a lot of server-side computation
    typically involve table-valued temporaries (TVTs). These essentially
    are variables within the transaction which are the results of
    evaluating subexpressions of the query. They be implicitly created by
    the query compiler, or explicitly created by the programmer using
    named CTEs (common table expressions).

    Programmers normally expect that local variables and compiler produced temporaries in a function will keep their values unless the programmer
    does something to change them. But this is not [necessarily] true in
    SQL.

    TVTs in SQL act like 'volatile' variables in C ... you have to expect
    that the value(s) may be different every time you look at it. The
    expression that produced the TVT may be re-evaluted prior to any use
    [mention in the code]. The way to prevent the value(s) changing is to
    choose a high isolation level for the controlling transaction.


    How would the client know that he's seeing partial results?

    The client can't perceive that TVTs within the query produced partial
    or dynamic results other than by repeating the query and noting
    differences in what is returned.

    Whether the client can perceive partial or dynamic results OF the
    query depends on how the client interacts with the server.

    The result of any query is a TVT - a compiler generated variable. It
    could be an actual generated/computed rowset, or it could be just a
    reference to some already existing table in the underlying schema.

    If the client can accept/copy the entire result rowset all at once,
    then it won't perceive any changes to that result: once the client has
    received the result, the transaction is complete, the server cleans up
    and the result TVT is gone.

    However, most client-side DB libraries do NOT accept whole tables as
    results. Instead they open a cursor on the result and cache some
    (typically small) number of rows surrounding the one currently
    referenced by the cursor. Moving the cursor fetches new rows to
    maintain the illusion that the whole table is available. Meanwhile
    the transaction that produced the result is kept alive (modulo
    timeout) because the result TVT is still needed until the client
    closes the cursor.

    It doesn't matter whether the query just selects rows from a declared
    base table, or does some complex computation that produces a new table
    result: if the client gets a cursor rather than a copy of the result,
    the client may be able to perceive changes made while it is still
    examining the data.


    [Ways around the problem of cursor based client libraries may include increasing the size of the row cache so that it can hold the entire
    expected result table, or (if you can't do that) to make a local copy
    of the result as quickly as possible. Remembering that the row cache
    is per-cursor, per-connection, increasing the size of the cache(s) may
    force restructuring your application to limit the number of open
    connections.]


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Wed May 25 02:12:00 2022
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    [Some aggressive eliding as we're getting pretty far afield of
    "exception vs. error code"]


    I'm arguing for the case of treating each node as "specialized + generic" >>> and making the generic portion available for other uses that aren't
    applicable to the "specialized" nature of the node (hardware).

    Your doorbell sits in an idiot loop waiting to "do something" -- instead >>> of spending that "idle time" working on something *else* so the "device" >>> that would traditionally be charged with doing that something else
    can get by with less resources on-board.

    [I use cameras galore. Imagining feeding all that video to a single
    "PC" would require me to keep looking for bigger and faster PCs!]

    ... if frames from the camera are uploaded into a cloud queue, any
    device able to process them could look there for new work. And store
    its results into a different cloud queue for the next step(s). Faster
    and/or more often 'idle' CPUs will do more work.

    That means every CPU must know how to recognize that sort of "work"
    and be able to handle it. Each of those nodes then bears a cost
    even if it doesn't actually end up contributing to the result.

    Every node must know how to look for work, but if suitable code can be downloaded on demand, then the nodes do NOT have to know how to do any particular kind of work.

    Ie. looking for new work yields a program. Running the program looks
    for data to process. You can put limits into it such as to terminate
    the program if no new data is seen for a while.

    Similarly on the cloud side, the server(s) might be aware of
    program/data associations and only serve up programs that have data
    queued to process. You can tweak this with priorities.


    it also makes the "cloud" a shared resource akin to the "main computer".
    What do you do when it isn't available?

    The cloud is distributed so that [to some meaningful statistical
    value] it always is available to any node having a working network
    connection.

    You have to design cloud services with expectation that network
    partitioning will occur and the cloud servers themselves will lose
    contact with one another. They need to be self-organizing (for when
    the network is restored) and databases they maintain should be
    redundant, self-repairing, and modeled on BASE rather than ACID.


    If the current resource set is insufficient for the current workload,
    then (by definition) something has to be shed. My "workload manager"
    handles that -- deciding that there *is* a resource shortage (by looking
    at how many deadlines are being missed/aborted) as well as sorting out
    what the likeliest candidates to "off-migrate" would be.

    Similarly, deciding when there is an abundance of resources that
    could be offered to other nodes.

    So, if a node is powered up *solely* for its compute resources
    (or, it's unique hardware-related tasks have been satisfied) AND
    it discovers another node(s) has enough resources to address
    its needs, it can push it's workload off to that/those node(s) and
    then power itself down.

    Each node effectively implements part of a *distributed* cloud
    "service" by holding onto resources as they are being used and
    facilitating their distribution when there are "greener pastures"
    available.

    But, unlike a "physical" cloud service, they accommodate the
    possibility of "no better space" by keeping the resources
    (and loads) that already reside on themselves until such a place
    can be found -- or created (i.e., bring more compute resources
    on-line, on-demand). They don't have the option of "parking"
    resources elsewhere, even as a transient measure.

    When a "cloud service" is unavailable, you have to have a backup
    policy in place as to how you'll deal with these overloads.

    Right, but you're still thinking (mostly) in terms of self-contained
    programs that do some significant end-to-end processing.

    And that is fine, but it is not the best way to employ a lot of
    small(ish) idle CPUs.


    Think instead of a pipeline of small, special/single purpose programs
    that can be strung together to accomplish the processing in well
    defined stages.
    Like a command-line gang in Unix: e.g., "find | grep | sort ..."

    Then imagine that they are connected not directly by pipes or sockets,
    but indirectly through shared 'queues' maintained by an external
    service.

    Then imagine that each pipeline stage is or can be on a different
    node, and that if the node crashes, the program it was executing can
    be reassigned to a new node. If no node is available, the pipeline
    halts ... with its state and partial results preserved in the cloud
    ... until a new node can take over.


    Obviously, not every process can be decomposed in this way - but with
    a bit of thought, surprisingly many processing tasks CAN BE adapted to
    this model.



    I don't have a "general" system. :> And, suspect future (distributed) >embedded systems will shy away from the notion of any centralized "controller >node" for the obvious dependencies that that imposes on the solution.

    "central control" is not a problem if the group can self-organize and
    (s)elect a new controller.


    Sooner or later, that node will suffer from scale. Or, reliability.
    (one of the initial constraints I put on the system was NOT to rely on
    any "outside" service; why not use a DBMS "in the cloud"? :> )

    "Cloud" is not any particular implementation - it's the notion of
    ubiquitous, high availability service. It also does not necessarily
    imply "wide-area" - a cloud can serve a building or a campus.

    [I'm really tired of the notion that words have only the meaning that
    marketing departments and the last N months of public consciousness
    have bestowed on them. It's true that the actual term 'cloud
    computing' originated in the 1990s, but the concepts embodied by the
    term date from the 1950s.]


    I can [checkpoint] just by letting a process migrate itself to its
    current node -- *if* it wants to "remember" that it can resume cleanly
    from that point (but not any point beyond that unless side-effects
    are eliminated). The first step in the migration effectively creates
    the process snapshot.

    There is overhead to taking that snapshot -- or pushing those
    "intermediate results" to the cloud. You have to have designed your >*application* with that in mind.

    Just like an application can choose to push "temporary data" into
    the DBMS, in my world. And, incur those costs at run-time.

    The more interesting problem is seeing what you can do "from the
    outside" without the involvement of the application.

    E.g., if an application had to take special measures in order to
    be migrate-able, then I suspect most applications wouldn't be!
    And, as a result, the system wouldn't have that flexibility.

    OTOH, if the rules laid out for the environment allow me to wedge
    that type of service *under* the applications, then there's no
    cost-adder for the developers.

    Understood. The problem is that notion of 'process' still is too
    heavyweight. You can easily checkpoint in place, but migrating the
    process creates (at least) issues with reconnecting all the network
    plumbing.

    Ie. if the program wasn't written to notice that important connections
    were lost and handle those situations ... everywhere ... then it can't
    survive a migration.

    The TS model makes the plumbing /stateless/ and can - with a bit of
    care - make the process more elastic and more resiliant in the face of
    various failures.



    Processing of your camera video above is implicitly checkpointed with
    every frame that's completed (at whatever stage). It's a perfect
    situation for distributed TS.

    But that means the post processing has to happen WHILE the video is
    being captured. I.e., you need "record" and "record-and-commercial-detect" >primitives. Or, to expose the internals of the "record" operation.

    Not at all. What it means is that recording does not produce an
    integrated video stream, but rather a sequence of frames. The frame
    sequence then can be accessed by 'commercial-detect' which consumes[*]
    the input sequence and produces a new output sequence lacking those
    frames which represent commercial content. Finally, some other little
    program could take that commercial-less sequence and produce the
    desired video stream.

    [*] consumes or copies - the original data sequence could be left
    intact for some other unrelated processing.

    All of this can be done elastically, in the background, and
    potentially in parallel by a gang of (otherwise idle) nodes.


    Similarly, you could retrain the speech models WHILE you are listening
    to a phone call. But, that means you need the horsepower to do so
    AT THAT TIME, instead of just capturing the audio ("record") and
    doing the retraining "when convenient" ("retrain").

    Recording produces a sequence of clips to be analyzed by 'retrain'.


    I've settled on simpler primitives that can be applied in more varied >situations. E.g., you will want to "record" the video when someone
    wanders onto your property. But, there won't be any "commercials"
    to detect in that stream.

    Trying to make "primitives" that handle each possible combination of
    actions seems like a recipe for disaster; you discover some "issue"
    and handle it in one implementation and imperfectly (or not at all)
    handle it in the other. "Why does it work if I do 'A then B' but
    'B while A' chokes?"

    You're simultaneously thinking too small AND too big.

    Breaking the operation into pipeline(able) stages is the right idea,
    but you need to think harder about what kinds of pipelines make sense
    and what is the /minimum/maximum/average amount of processing that
    makes sense for a pipeline stage.


    Remember, we're (I'm) trying to address something as "simple" as
    "exceptions vs error codes", here. Expecting a developer to write
    code with the notion of partial recovery in mind goes far beyond
    that!

    He can *choose* to structure his application/object/service in such
    a way that makes that happen. Or not.

    :

    I think it's hard to *generally* design solutions that can be
    interrupted and partially restored. You have to make a deliberate
    effort to remember what you've done and what you were doing.
    We seem to have developed the habit/practice of not "formalizing" >intermediate results as we expect them to be transitory.

    Right. But generally it is the case that only certain intermediate
    points are even worthwhile to checkpoint.

    But again, this line of thinking assumes that the processing both is
    complex and time consuming - enough so that it is /expected/ to fail
    before completion.


    *Then*, you need some assurance that you *will* be restarted; otherwise,
    the progress that you've already made may no longer be useful.

    Any time a process is descheduled (suspended), for whatever reason,
    there is no guarantee that it will wake up again. But it has to
    behave AS IF it will.


    I don't, for example, universally use my checkpoint-via-OS hack
    because it will cause more grief than it will save. *But*, if
    a developer knows it is available (as a service) and the constraints
    of how it works, he can offer a hint (at install time) to suggest
    his app/service be installed with that feature enabled *instead* of
    having to explicitly code for resumption.

    Snapshots of a single process /may/ be useful for testing or debugging
    [though I have doubts about how much]. I'm not sure what purpose they
    really can serve in a production environment. After all, you don't
    (usually) write programs /intending/ for them to crash.

    For comparison: VM hypervisors offer snapshots also, but they capture
    the state of the whole system. You not only save the state of your
    process, but also of any system services it was using, any (local)
    peer processes it was communicating with, etc. This seems far more
    useful from a developer POV.

    Obviously MMV and there may be uses I have not considered.


    What I want to design against is the need to over-specify resources
    *just* for some "job" that may be infrequent or transitory. That
    leads to nodes costing more than they have to. Or, doing less than
    they *could*!

    If someTHING can notice imbalances in resources/demands and dynamically >adjust them, then one node can act as if it has more "capability" than
    its own hardware would suggest.



    [Time is a *huge* project because of all the related issues. You
    still need some "reference" for the timebases -- wall and system.
    And, a way to ensure they track in some reasonably consistent
    manner: 11:00 + 60*wait(60 sec) should bring you to 11:00
    even though you're using times from two different domains!]

    Time always is a problem for any application that cares. Separating
    the notion of 'system' time from human notions of time is necessary
    but is not sufficient for all cases.



    I have no "epoch". Timestamps reflect the system time at which the
    event occurred. If the events had some relation to "wall time"
    ("human time"), then any discontinuities in that time frame are
    the problem of the human. Time "starts" at the factory. Your
    system's "system time" need bear no relationship to mine.

    How is system time preserved through a system-wide crash? [consider a
    power outage that depletes any/all UPSs.]

    What happens if/when your (shared) precision clock source dies?


    [When the system boots, it has no idea "what time it is" until it can
    get a time fix from some external agency. That *can* be an RTC -- but,
    RTC batteries can die, etc. It can be manually specified -- but, that
    can be in error. <shrug> The system doesn't care. The *user* might
    care (if his shows didn't get recorded at the right times)...]

    Answer ... system time is NOT preserved in all cases.


    For me, time doesn't exist when EVERYTHING is off. Anything that
    was supposed to happen during that interval obviously can't happen.
    And, nothing that has happened (in the environment) can be "noticed"
    so there's no way of ordering those observations!

    If I want some "event" to be remembered beyond an outage, then
    the time of the event has to be intentionally stored in persistent
    storage (i.e., the DBMS) and retrieved from it (and rescheduled)
    once the system restarts.

    Scheduled relative to the (new?, updated?) system time at the moment
    of scheduling. But how do you store that desired event time? Ie., a
    countdown won't survive if up-to-moment residue can't be persisted
    through a shutdown (or crash) and/or the system time at restart does
    not reflect the outage period.

    But if [as you said above] there's no starting 'epoch' to your
    timebase - ie. no zero point corresponding to a point in human time -
    then there also is no way to specify an absolute point in human time
    for an event in the future.


    These tend to be "human time" events (broadcast schedules, HVAC
    events, etc.). Most "system time" events are short-term and don't
    make sense spanning an outage *or* aren't particularly concerned
    with accuracy (e.g., vacuum the DB every 4 hours).

    I have a freerunning timer that is intended to track the passage of
    time during an outage (like an RTC would). It can never be set
    (reset) so, in theory, tells me what the system timer WOULD have
    been, had the system not suffered an outage.

    Assuming it does not roll-over during the outage.


    ----

    Now, back to my error/exception problem. I have to see if there are
    any downsides to offering a dual API to address each developer's
    "style"...


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Wed May 25 02:34:12 2022
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Now, back to my error/exception problem. I have to see if there are
    any downsides to offering a dual API to address each developer's
    "style"...

    I don't see any downside to a dual API other than naming conventions,
    which might be avoided if it's possible to link a 'return value'
    library vs an exception throwing library ...


    But I think it would be difficult to do both all the way down in
    parallel. Simpler to pick one as the basis, implement everything in
    your chosen model, and then provide a set of wrappers that convert to
    the other model.

    The question then is: which do you choose as the basis?

    Return value doesn't involve compiler 'magic', so you can code in any
    languge - including ones that don't offer exceptions. However, code
    may be more complicated by the need to propagate errors.

    Exceptions lead to cleaner code and (generally) more convenient
    writing, but they do involve compiler 'magic' and so limit the
    langauges that may be used. Code that depends on exceptions also may
    be slower to handle errors.


    13 of one. Baker's dozen of the other.
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Thu May 26 14:31:29 2022
    On 5/24/2022 11:34 PM, George Neuner wrote:
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Now, back to my error/exception problem. I have to see if there are
    any downsides to offering a dual API to address each developer's
    "style"...

    I don't see any downside to a dual API other than naming conventions,
    which might be avoided if it's possible to link a 'return value'
    library vs an exception throwing library ...

    At first, I thought you would link one OR the other into your app.
    So, two different libraries with the same ftn names inside. And,
    (potentially) different prototypes for each.

    But, then realized there may be cases where you'd want *both*...
    handling some errors with return codes and others with an exception
    mechanism.

    I have to think harder on this.

    But I think it would be difficult to do both all the way down in
    parallel. Simpler to pick one as the basis, implement everything in
    your chosen model, and then provide a set of wrappers that convert to
    the other model.

    The question then is: which do you choose as the basis?

    The "error code". It is portable to bindings that won't have
    hooks for the exceptions from the "exception" model.

    Return value doesn't involve compiler 'magic', so you can code in any
    languge - including ones that don't offer exceptions. However, code
    may be more complicated by the need to propagate errors.

    Exceptions lead to cleaner code and (generally) more convenient
    writing, but they do involve compiler 'magic' and so limit the
    langauges that may be used. Code that depends on exceptions also may
    be slower to handle errors.

    So far, I've NOT focused on performance in most of the design decisions.
    MIPS are cheap -- and getting cheaper every day! I periodically
    redesign my "processor core" to fit into the current price constraints;
    the amount of horsepower available is... disturbing! (thinking back to
    my i4004 days)

    AFAICT, the biggest selling point of the exception approach is the
    "code cleanliness" -- you can just let the exceptions propagate
    upwards instead of handling them where thrown.

    OTOH, with suitable choices for error codes (i.e., a common set across
    all ftn invocations), you could just early-exit and return the "error"
    that you encountered "locally" (after suitable cleanup)

    [I still keep hoping for an insight that makes all this excess
    mechanism more transparent]

    13 of one. Baker's dozen of the other.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Thu May 26 17:39:26 2022
    On 5/24/2022 11:12 PM, George Neuner wrote:
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    I'm arguing for the case of treating each node as "specialized + generic" >>>> and making the generic portion available for other uses that aren't
    applicable to the "specialized" nature of the node (hardware).

    Your doorbell sits in an idiot loop waiting to "do something" -- instead >>>> of spending that "idle time" working on something *else* so the "device" >>>> that would traditionally be charged with doing that something else
    can get by with less resources on-board.

    [I use cameras galore. Imagining feeding all that video to a single
    "PC" would require me to keep looking for bigger and faster PCs!]

    ... if frames from the camera are uploaded into a cloud queue, any
    device able to process them could look there for new work. And store
    its results into a different cloud queue for the next step(s). Faster
    and/or more often 'idle' CPUs will do more work.

    That means every CPU must know how to recognize that sort of "work"
    and be able to handle it. Each of those nodes then bears a cost
    even if it doesn't actually end up contributing to the result.

    Every node must know how to look for work, but if suitable code can be downloaded on demand, then the nodes do NOT have to know how to do any particular kind of work.

    In my world, that's a pure *policy* decision.

    There are mechanisms that allow:
    - a node to look for work and "pull" it onto itself
    - a node to look for workers and "push" work onto them
    - a node to look at loads and capacities and MOVE work around
    - all of the above but with the nodes contributing "suggestions" to
    some other agent(s) that make the actual changes

    ANY node with a capability (handle) onto another node can create
    a task on that other node. And, specify how to initialize the
    ENTIRE state -- including copying a RUNNING state from another node.

    You can't really come up with an *ideal* partitioning
    (I suspect NP-complete under timeliness constraints?)
    but the goal is just to improve utilization and "headroom"
    in the short term (because the system is dynamic and
    you don't really know what the situation will be like
    in the near/far future)

    And, decide when (hardware) resources should be added/removed
    from the current run set.

    Ie. looking for new work yields a program. Running the program looks
    for data to process. You can put limits into it such as to terminate
    the program if no new data is seen for a while.

    In my approach, looking for work is just "looking for a load
    that you can *assume* -- from another! You already know there is
    data FOR THAT TASK to process. And, the TEXT+DATA to handle
    that is sitting there, waiting to be "plucked" (no need to
    download from the persistent store -- which might have been
    powered down to conserve power).

    The goal isn't to "help" a process that is running elsewhere
    but, rather, to find a better environment for that process.

    Similarly on the cloud side, the server(s) might be aware of
    program/data associations and only serve up programs that have data
    queued to process. You can tweak this with priorities.

    it also makes the "cloud" a shared resource akin to the "main computer".
    What do you do when it isn't available?

    The cloud is distributed so that [to some meaningful statistical
    value] it always is available to any node having a working network connection.

    I can create a redundant cloud service *locally*. But, if the cloud is *remote*, then I'm dependant on the link to the outside world to
    gain access to that (likely redundant) service.

    E.g., here, I rely on the microwave modem, a clear line of sight to
    the ISPs tower, the ISPs "far end" equipment, the ISPs link to the
    rest of The Internet *and* any cloud provider's kit. I have no control
    over any of those (other than the *power* to the microwave modem).

    If you've integrated that "cloud service" into your design, then
    you really can't afford to lose that service. It would be like
    having the OS reside somewhere beyond my control.

    I can tolerate reliance on LOCAL nodes -- because I can exercise
    some control over them. E.g., if the RDBMS's reliability was
    called into question, I could implement a redundant service.
    If the garage door controller shits the bed, someone (here) can
    replace it, OR, disconnect the door from the opener and go
    back to the 70's. :>

    So, using a local cloud would mean providing that service on
    local nodes in a way that is reliable and efficient.

    Letting workload managers on each node decide how to shuffle
    around the "work" gives me the storage (in place) that the
    cloud affords (the nodes are currently DOING the work!)

    You have to design cloud services with expectation that network
    partitioning will occur and the cloud servers themselves will lose
    contact with one another. They need to be self-organizing (for when
    the network is restored) and databases they maintain should be
    redundant, self-repairing, and modeled on BASE rather than ACID.

    Exactly. But they (The Service) also have to remain accessible
    at all times if you rely on them as a key part of your processing
    strategy.

    If the current resource set is insufficient for the current workload,
    then (by definition) something has to be shed. My "workload manager"
    handles that -- deciding that there *is* a resource shortage (by looking
    at how many deadlines are being missed/aborted) as well as sorting out
    what the likeliest candidates to "off-migrate" would be.

    Similarly, deciding when there is an abundance of resources that
    could be offered to other nodes.

    So, if a node is powered up *solely* for its compute resources
    (or, it's unique hardware-related tasks have been satisfied) AND
    it discovers another node(s) has enough resources to address
    its needs, it can push it's workload off to that/those node(s) and
    then power itself down.

    Each node effectively implements part of a *distributed* cloud
    "service" by holding onto resources as they are being used and
    facilitating their distribution when there are "greener pastures"
    available.

    But, unlike a "physical" cloud service, they accommodate the
    possibility of "no better space" by keeping the resources
    (and loads) that already reside on themselves until such a place
    can be found -- or created (i.e., bring more compute resources
    on-line, on-demand). They don't have the option of "parking"
    resources elsewhere, even as a transient measure.

    When a "cloud service" is unavailable, you have to have a backup
    policy in place as to how you'll deal with these overloads.

    Right, but you're still thinking (mostly) in terms of self-contained
    programs that do some significant end-to-end processing.

    And that is fine, but it is not the best way to employ a lot of
    small(ish) idle CPUs.

    Think instead of a pipeline of small, special/single purpose programs
    that can be strung together to accomplish the processing in well
    defined stages.
    Like a command-line gang in Unix: e.g., "find | grep | sort ..."

    Then imagine that they are connected not directly by pipes or sockets,
    but indirectly through shared 'queues' maintained by an external
    service.

    That's how things are handled presently. The "queues" are the message
    queues into each object's server. When I "read" data from a camera,
    it comes into process that has a suitable capability for that access.
    When that process wants to "store" the data that it has "read",
    it passes it to a handle into another object (a recorder). If the
    recorder wants to make a persistent copy of the data, it connects
    to the DBMS (via another handle) and pushes the data, there.

    But, at the user (applet) level, the user doesn't think in terms
    of pipes and concurrent operations (I think that would be a
    stretch for most people). Instead, he writes scripts that serialize
    these actions. In effect:

    find ... > temp1
    grep ... temp1 > temp2
    sort ... temp2 > temp3
    display temp3

    Then imagine that each pipeline stage is or can be on a different
    node, and that if the node crashes, the program it was executing can
    be reassigned to a new node. If no node is available, the pipeline
    halts ... with its state and partial results preserved in the cloud
    ... until a new node can take over.

    Understood.

    But, the nodes don't tend to "crash" (I handle that with a "redundancy" option).

    Instead, the nodes are (most commonly) KILLED off. Deliberate actions
    to satisfy some observed condition(s).

    E.g., if taskX misses it's deadline, then there's no point allowing
    it to continue running. It's BROKE (in this particular instance).
    Free up its resources so other tasks on that node (and on nodes that
    might be competing with that task) aren't burdened (and likely to
    suffer a similar fate!)

    Similarly, if I *know* that taskX won't be able to meet it's deadline
    (because I can see what the current scheduling set happens to be),
    then why let it waste resources until it misses its deadline? Kill
    it off, now.

    If I know that the system load will be shifting to account for
    the system's desire to shed *hardware* loads (to conserve power),
    then that is just a precipitator of the above. Kill those tasks
    off so the remaining hardware resources can meet the needs of the
    "essential" services/tasks.

    The *less* common case is "someone has pulled the plug" to a particular
    node and unceremoniously (and irrecoverably) killed off the entire
    task set hosted by that node.

    But, every bit of code knows that there are no guarantees that they
    will be ALLOWED to run -- let alone run-to-completion. As the system
    is open, something has to establish an admission policy and enforce it.

    [On your PC, you can run far too many applications to make any meaningful progress on any of them. But, there aren't likely any "critical" apps
    on your PC so <shrug> If, OTOH, it had some critical responsibilities
    that it must meet, then you'd have to design a mechanism that effectively wouldn't allow interference with those. E.g., processor reserves,
    elevated priorities, etc. All effectively controlling what jobs the
    PC would be allowed to perform]

    A "good" application will know how to be restarted -- from the
    beginning -- without needlessly duplicating work. E.g., like my
    "archive" application. I *could* have done a bunch of work
    and kept the status of my process in local RAM, writing it back
    to the DB at some point. But, then I would risk having to repeat
    that work if I was killed off before I managed to update the DB
    (or, if the DB crashed/went offline). By incrementally
    updating the DB with the status after each nominal operation,
    I can be killed off and only lose the "work" that I was doing
    at the time.

    The level of service a user gets from "non-good" applications is
    dubious (and nothing I can do to improve that)

    Obviously, not every process can be decomposed in this way - but with
    a bit of thought, surprisingly many processing tasks CAN BE adapted to
    this model.

    I don't have a "general" system. :> And, suspect future (distributed)
    embedded systems will shy away from the notion of any centralized "controller
    node" for the obvious dependencies that that imposes on the solution.

    "central control" is not a problem if the group can self-organize and (s)elect a new controller.

    Or, if the controller can be decentralized. This has special appeal
    as the system grows in complexity (imagine 10,000 nodes instead of 250)

    Sooner or later, that node will suffer from scale. Or, reliability.
    (one of the initial constraints I put on the system was NOT to rely on
    any "outside" service; why not use a DBMS "in the cloud"? :> )

    "Cloud" is not any particular implementation - it's the notion of
    ubiquitous, high availability service. It also does not necessarily
    imply "wide-area" - a cloud can serve a building or a campus.

    Yes, but it must be high availability!

    If any of my "single node" services goes down, then the services *uniquely* hosted by that node are unavailable. E.g., I can't recognize a person standing at the front door ("doorbell") if the node that provides the hooks to the
    front door camera is powered down/isolated. But, the *application* that
    does that recognition need not be tied to that node -- if some other source
    of a video feed is available.

    [This is a direct consequence of my being able to move processes around]

    In my case, the worst dependency lies in the RDBMS. But, it's loss
    can be tolerated if you don't have to access data that is ONLY
    available on the DB.

    [The switch is, of course, the extreme example of a single point failure]

    E.g., if you have the TEXT image for a "camera module" residing
    on node 23, then you can use that to initialize the TEXT segment
    of the camera on node 47. If you want to store something on the
    DBMS, it can be cached locally until the DBMS is available. etc.

    [I'm really tired of the notion that words have only the meaning that marketing departments and the last N months of public consciousness
    have bestowed on them. It's true that the actual term 'cloud
    computing' originated in the 1990s, but the concepts embodied by the
    term date from the 1950s.]

    I can [checkpoint] just by letting a process migrate itself to its
    current node -- *if* it wants to "remember" that it can resume cleanly >>from that point (but not any point beyond that unless side-effects
    are eliminated). The first step in the migration effectively creates
    the process snapshot.

    There is overhead to taking that snapshot -- or pushing those
    "intermediate results" to the cloud. You have to have designed your
    *application* with that in mind.

    Just like an application can choose to push "temporary data" into
    the DBMS, in my world. And, incur those costs at run-time.

    The more interesting problem is seeing what you can do "from the
    outside" without the involvement of the application.

    E.g., if an application had to take special measures in order to
    be migrate-able, then I suspect most applications wouldn't be!
    And, as a result, the system wouldn't have that flexibility.

    OTOH, if the rules laid out for the environment allow me to wedge
    that type of service *under* the applications, then there's no
    cost-adder for the developers.

    Understood. The problem is that notion of 'process' still is too heavyweight. You can easily checkpoint in place, but migrating the
    process creates (at least) issues with reconnecting all the network
    plumbing.

    If the process is still "live", then the connections remain intact.
    The kernels on the leaving and arriving nodes conspire to map the
    process-local handles to the correct services. (cuz each kernel
    knows how to get to object X and can tell the "arriving" kernel
    how this is done, even if the arriving kernel has to rename IT'S
    notion of object X's "home")

    If a process is SIGKILLed, then the connections are severed.
    The servers for each of the client handles are notified that
    the client task(s) have died and unwind those connections,
    freeing those resources. (if the last connection is severed
    to a server, then the server dies)

    Ie. if the program wasn't written to notice that important connections
    were lost and handle those situations ... everywhere ... then it can't survive a migration.

    The TS model makes the plumbing /stateless/ and can - with a bit of
    care - make the process more elastic and more resiliant in the face of various failures.

    Processing of your camera video above is implicitly checkpointed with
    every frame that's completed (at whatever stage). It's a perfect
    situation for distributed TS.

    But that means the post processing has to happen WHILE the video is
    being captured. I.e., you need "record" and "record-and-commercial-detect" >> primitives. Or, to expose the internals of the "record" operation.

    Not at all. What it means is that recording does not produce an
    integrated video stream, but rather a sequence of frames. The frame
    sequence then can be accessed by 'commercial-detect' which consumes[*]
    the input sequence and produces a new output sequence lacking those
    frames which represent commercial content. Finally, some other little program could take that commercial-less sequence and produce the
    desired video stream.

    But there's no advantage to this if the "commercial detect" is going
    to be done AFTER the recording. I.e., you're storing the recording,
    frame by frame, in the cloud. I'm storing it on a "record object"
    (memory or DB record). And, in a form that is convenient for
    a "record" operation (without concern for the "commercial detect"
    which may not be used in this case).

    Exploiting the frame-by-frame nature only makes sense if you're
    going to start nibbling on those frames AS they are generated.

    I'm designing a bridge to the on-board displays in the cars so
    the driver can view the video streams from the (stationary)
    cameras in the garage, as he pulls in. I just "connect" the
    camera's output to the radio transmitter through a suitable
    CODEC. The result is effectively "live". Because the CODEC
    just passes video through to <whatever> device is wired
    to its output.

    [*] consumes or copies - the original data sequence could be left
    intact for some other unrelated processing.

    All of this can be done elastically, in the background, and
    potentially in parallel by a gang of (otherwise idle) nodes.

    Similarly, you could retrain the speech models WHILE you are listening
    to a phone call. But, that means you need the horsepower to do so
    AT THAT TIME, instead of just capturing the audio ("record") and
    doing the retraining "when convenient" ("retrain").

    Recording produces a sequence of clips to be analyzed by 'retrain'.

    Same issue as above. If retrain doesn't get invoked until the "record"
    step (in the script) is complete, then you're storing "clips" where
    I'm storing the entire audio stream. There's no (easy) way to exploit
    the concurrency (that a user will understand).

    People are serial thinkers. Consider how challenging it is for many
    to prepare a *complete* meal -- and have everything served "hot"
    and at the same time. It's not a tough problem to solve. But, requires considerable planning lest the mashed potatoes be cold or the meat
    overcooked, vegetables like mush, etc.

    I've settled on simpler primitives that can be applied in more varied
    situations. E.g., you will want to "record" the video when someone
    wanders onto your property. But, there won't be any "commercials"
    to detect in that stream.

    Trying to make "primitives" that handle each possible combination of
    actions seems like a recipe for disaster; you discover some "issue"
    and handle it in one implementation and imperfectly (or not at all)
    handle it in the other. "Why does it work if I do 'A then B' but
    'B while A' chokes?"

    You're simultaneously thinking too small AND too big.

    Breaking the operation into pipeline(able) stages is the right idea,
    but you need to think harder about what kinds of pipelines make sense
    and what is the /minimum/maximum/average amount of processing that
    makes sense for a pipeline stage.

    Remember, we're (I'm) trying to address something as "simple" as
    "exceptions vs error codes", here. Expecting a developer to write
    code with the notion of partial recovery in mind goes far beyond
    that!

    He can *choose* to structure his application/object/service in such
    a way that makes that happen. Or not.

    :

    I think it's hard to *generally* design solutions that can be
    interrupted and partially restored. You have to make a deliberate
    effort to remember what you've done and what you were doing.
    We seem to have developed the habit/practice of not "formalizing"
    intermediate results as we expect them to be transitory.

    Right. But generally it is the case that only certain intermediate
    points are even worthwhile to checkpoint.

    Of course. But, you have to think hard about those. It's not just
    "how much REwork will this save if I checkpoint HERE" but, also,
    "is there anything that I've caused to happen (side effect) that
    might NOT happen when I resume at this checkpoint".

    I have a graphic FTP client that was causing me all sorts of
    problems. Sometimes, files would transfer but would be corrupted
    in the process.

    It took me a long time to track down the problem because it was
    (largely) unrelated to the CONTENT being transferred, size, etc.

    I finally realized that some transfers were approx treated as ASCII
    despite my having configured the application for BINARY transfers,
    exclusively. Eventually, I realized that this was happening whenever the
    FTP server had timed out a connection (the GUI client would dutifully
    hide this notification from me -- because *it* would reinitiate the
    connection if left idle for "too long"). The client code would dutifully reestablish the connection (without any action on my part -- or even
    a notification that it had to do this!). BUT, would fail to send
    the BINARY command to the server!

    So, if the server defaulted to ASCII transfers, the next transfer
    would be ASCII -- and corrupted -- even though the previous transfer
    executed in that GUI session was done in BINARY. To the client, the
    connection had "failed". But, it neglected to cross all the T's
    when reestablishing it.

    Simple bug. Disasterous results! (because the protocol doesn't
    verify that the stored image agreed with the original image!)

    But again, this line of thinking assumes that the processing both is
    complex and time consuming - enough so that it is /expected/ to fail
    before completion.

    Again, I'm not looking to guard against "failures" (see previous).

    Most user applets are tiny -- but run continuously. The equivalent
    of daemons.

    when sun comes in west windows
    close blinds on west side of house

    when sun sets
    ensure garage door is closed

    etc. I can move these in a fraction of a millisecond. But, likely
    wouldn't do so unless they were executing on a node that I wanted to
    power down.

    OTOH, if the node *failed*, they are problematic to restart unless
    expressed in terms of *states* (the first looks at the STATE of the
    sunshine while the second looks for an EVENT). How do I tell the
    Joe Average User the difference?

    *Then*, you need some assurance that you *will* be restarted; otherwise,
    the progress that you've already made may no longer be useful.

    Any time a process is descheduled (suspended), for whatever reason,
    there is no guarantee that it will wake up again. But it has to
    behave AS IF it will.

    What if it missed its deadline? There's no point in preserving state
    or trying to restart it. I.e., it can be "deferred" before its deadline
    but the system might not be able to support its restoration until
    AFTER its deadline has passed -- and then no longer of interest.

    If the sunset applet gets restarted, tomorrow, how will it know
    that any state saved YESTERDAY should not be pertinent?

    I don't, for example, universally use my checkpoint-via-OS hack
    because it will cause more grief than it will save. *But*, if
    a developer knows it is available (as a service) and the constraints
    of how it works, he can offer a hint (at install time) to suggest
    his app/service be installed with that feature enabled *instead* of
    having to explicitly code for resumption.

    Snapshots of a single process /may/ be useful for testing or debugging [though I have doubts about how much]. I'm not sure what purpose they
    really can serve in a production environment. After all, you don't
    (usually) write programs /intending/ for them to crash.

    Stop thinking about crashes. That implies something is "broken".

    Any application that is *inherently* resumable (e.g., my archive
    service) can benefit from an externally imposed snapshot that
    is later restored. The developer is the one who is qualified to
    make this assessment, not the system (it lacks the heuristics
    though, arguably, could "watch" an application to see what handles
    it invokes).

    Similarly, the system can't know the nature of a task's deadlines.
    But, can provide mechanisms to make use of that deadline data
    FROM the developer. If you don't provide it, then I will
    assume your deadline is at T=infinity... and your code will
    likely never be scheduled! :> (if you try to game it, your
    code also may never be scheduled: "That deadline is too soon
    for me to meet it so lets not bother trying!")

    For comparison: VM hypervisors offer snapshots also, but they capture
    the state of the whole system. You not only save the state of your
    process, but also of any system services it was using, any (local)
    peer processes it was communicating with, etc. This seems far more
    useful from a developer POV.

    Obviously MMV and there may be uses I have not considered.

    What I want to design against is the need to over-specify resources
    *just* for some "job" that may be infrequent or transitory. That
    leads to nodes costing more than they have to. Or, doing less than
    they *could*!

    If someTHING can notice imbalances in resources/demands and dynamically
    adjust them, then one node can act as if it has more "capability" than
    its own hardware would suggest.

    [Time is a *huge* project because of all the related issues. You
    still need some "reference" for the timebases -- wall and system.
    And, a way to ensure they track in some reasonably consistent
    manner: 11:00 + 60*wait(60 sec) should bring you to 11:00
    even though you're using times from two different domains!]

    Time always is a problem for any application that cares. Separating
    the notion of 'system' time from human notions of time is necessary
    but is not sufficient for all cases.

    I have no "epoch". Timestamps reflect the system time at which the
    event occurred. If the events had some relation to "wall time"
    ("human time"), then any discontinuities in that time frame are
    the problem of the human. Time "starts" at the factory. Your
    system's "system time" need bear no relationship to mine.

    How is system time preserved through a system-wide crash? [consider a
    power outage that depletes any/all UPSs.]

    There is a free-running battery backed counter that *loosely*
    tracks the normal rate of time progression. I.e., its notion
    of a second may be 1.24 seconds. Or, 0.56 seconds. I don't care
    about the *rate* that it thinks time is passing as long as it is
    reasonably consistent (over the sorts of intervals addressed)

    The system has learned this (from observations against external references)
    and can notice how much the counter has changed since last observation,
    then convert that into "real" seconds. The system time dameon
    periodically takes a peek at this counter and stores a tuple
    (system-time, counter-value). It need not capture this at the instant of
    an outage (because outages are asynchronous events).

    When the daemon restarts (after the outage), it again examines the
    counter, notes the difference from last stored tuple and determines delta-counter. Using the calibration factor observed previously
    for the counter's "tick rate", it can deduce a system-time that
    corresponds with that observed counter value -- round UP.

    Another system suffering from the same outage will experience a
    different "counter delta" (because it's tick may happen at a different
    rate) but the same "time difference". It's deduced system-time will
    depend on its prior observations of *its* counter's tick rate).

    The adjusted *system* time will be different (because they came off
    the assembly line at different times and had their clocks started
    at different "actual times")

    What happens if/when your (shared) precision clock source dies?

    The system is broken. Same thing if the network switch fails
    or the power distributed through it. Or, lightning fries all of
    the interconnect wiring.

    What if someone pulls the memory out of your PC? Or, steals your power
    cord? Or, drops a rock on it?

    [When the system boots, it has no idea "what time it is" until it can
    get a time fix from some external agency. That *can* be an RTC -- but,
    RTC batteries can die, etc. It can be manually specified -- but, that
    can be in error. <shrug> The system doesn't care. The *user* might
    care (if his shows didn't get recorded at the right times)...]

    Answer ... system time is NOT preserved in all cases.

    System time is preserved as long as batteries are maintained. If you don't want to take that PM task, then your system will (deliberately) forget everything that it was doing. Each time there is an outage that the UPS
    can't span.

    If you don't ever provide a wall time reference, then it will never know
    when your favorite radio program airs. Or, when to make your morning coffee. Or, when cooling season begins.

    What if you've not kept a backup of your disk? Was that an unspoken responsibility of the PC manufacturer?

    For me, time doesn't exist when EVERYTHING is off. Anything that
    was supposed to happen during that interval obviously can't happen.
    And, nothing that has happened (in the environment) can be "noticed"
    so there's no way of ordering those observations!

    If I want some "event" to be remembered beyond an outage, then
    the time of the event has to be intentionally stored in persistent
    storage (i.e., the DBMS) and retrieved from it (and rescheduled)
    once the system restarts.

    Scheduled relative to the (new?, updated?) system time at the moment
    of scheduling. But how do you store that desired event time? Ie., a countdown won't survive if up-to-moment residue can't be persisted
    through a shutdown (or crash) and/or the system time at restart does
    not reflect the outage period.

    The system time after an outage is restored from the battery backed counter, compensated by knowledge of how its nominal count frequency differs from
    the ideal count frequency. Events that are related to system time
    (i.e., observations of when events occured; times at which actions
    should be initiated; deadlines, etc.) are stored in units of system time relative to T=0... FOR THIS SYSTEM. There is no relation to Jan 1, 1970/80
    or any other "epoch".

    If I have a button wired to two different colocated systems and press it,
    one system may record it as occurring at Ts=23456 and the other might see
    it as occurring at Ts=1111111111. It is *released* at Ts=23466 on
    the first system and T=1111111121 on the second.

    The first may think the button was pressed at Tw=12:05 while the second
    thinks it happened at Tw=4:47. Who is to say which -- if any, is correct?
    But, each will see the button closure for ~10 system-time units (within
    the uncertainty of their clock events)

    The user(s) doesn't care about the differences in Ts values. And, may or may not care about the differences in Tw values. But, likely is concerned with WHEN (in the cosmic sense) the button was pressed/released and whether it
    was held activated "long enough" (for some particular criteria)

    If that button press started a job that was intended to blink a light
    at 0.5 Hz, the lights on the two systems would retain a fixed phase and frequency relationship to each other.

    But if [as you said above] there's no starting 'epoch' to your
    timebase - ie. no zero point corresponding to a point in human time -
    then there also is no way to specify an absolute point in human time
    for an event in the future.

    Human (wall) time relies on the quality of the time reference that the
    USER has chosen (directly or indirectly).

    If your VCR displays a flashing 12:00, then clearly you don't care what

    [continued in next message]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to blockedofcourse@foo.invalid on Sun May 29 01:14:47 2022
    On Thu, 26 May 2022 17:39:26 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 5/24/2022 11:12 PM, George Neuner wrote:
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    There are mechanisms that allow:
    - a node to look for work and "pull" it onto itself
    - a node to look for workers and "push" work onto them
    - a node to look at loads and capacities and MOVE work around
    - all of the above but with the nodes contributing "suggestions" to
    some other agent(s) that make the actual changes

    ANY node with a capability (handle) onto another node can create
    a task on that other node. And, specify how to initialize the
    ENTIRE state -- including copying a RUNNING state from another node.

    You can't really come up with an *ideal* partitioning
    (I suspect NP-complete under timeliness constraints?)

    NP anyway - it's a bin packing problem even without time constraints.


    In my approach, looking for work is just "looking for a load
    that you can *assume* -- from another! You already know there is
    data FOR THAT TASK to process. And, the TEXT+DATA to handle
    that is sitting there, waiting to be "plucked" (no need to
    download from the persistent store -- which might have been
    powered down to conserve power).

    The goal isn't to "help" a process that is running elsewhere
    but, rather, to find a better environment for that process.

    That's a form of "work stealing". But usually the idea is to migrate
    the task to a CPU which has [for some definition] a lighter load:
    "lighter" because it's faster, or because it has more memory, etc.
    Typically this leaves out the small fry.

    The TS model is both more competitive but less ruthless - idle (or
    lightly loaded) nodes don't steal from one another, but rather nodes
    that ARE doing the same kind of processing compete for the input data
    sequence. More capable nodes get more done - but not /necessarily/ at
    the expense of less capable nodes [that depends on the semantics of
    the tuple services].

    Very different.


    I can create a redundant cloud service *locally*. But, if the cloud is >*remote*, then I'm dependant on the link to the outside world to
    gain access to that (likely redundant) service.

    Yes. But you wouldn't choose to do that. If someone else chooses to
    do it, that is beyond your control.

    If you've integrated that "cloud service" into your design, then
    you really can't afford to lose that service. It would be like
    having the OS reside somewhere beyond my control.

    I can tolerate reliance on LOCAL nodes -- because I can exercise
    some control over them. ...

    So, using a local cloud would mean providing that service on
    local nodes in a way that is reliable and efficient.

    Letting workload managers on each node decide how to shuffle
    around the "work" gives me the storage (in place) that the
    cloud affords (the nodes are currently DOING the work!)

    But it means that even really /tiny/ nodes need some variant of your
    "workload manager". It's simpler just to look for something to do
    when idle.

    [Of course, "tiny" is relative: it's likely even the smallest nodes
    will have a 32-bit CPU. It's memory that likely will be the issue.]


    If the current resource set is insufficient for the current workload,
    then (by definition) something has to be shed. My "workload manager"
    handles that -- deciding that there *is* a resource shortage (by looking >>> at how many deadlines are being missed/aborted) as well as sorting out
    what the likeliest candidates to "off-migrate" would be.

    Similarly, deciding when there is an abundance of resources that
    could be offered to other nodes.

    So, if a node is powered up *solely* for its compute resources
    (or, it's unique hardware-related tasks have been satisfied) AND
    it discovers another node(s) has enough resources to address
    its needs, it can push it's workload off to that/those node(s) and
    then power itself down.

    Each node effectively implements part of a *distributed* cloud
    "service" by holding onto resources as they are being used and
    facilitating their distribution when there are "greener pastures"
    available.

    But, unlike a "physical" cloud service, they accommodate the
    possibility of "no better space" by keeping the resources
    (and loads) that already reside on themselves until such a place
    can be found -- or created (i.e., bring more compute resources
    on-line, on-demand). They don't have the option of "parking"
    resources elsewhere, even as a transient measure.



    In my case, the worst dependency lies in the RDBMS. But, it's loss
    can be tolerated if you don't have to access data that is ONLY
    available on the DB.

    [The switch is, of course, the extreme example of a single point failure]

    E.g., if you have the TEXT image for a "camera module" residing
    on node 23, then you can use that to initialize the TEXT segment
    of the camera on node 47. If you want to store something on the
    DBMS, it can be cached locally until the DBMS is available. etc.

    Yes. Or [assuming a network connection] the DBMS could be made
    distributed so it always is available.



    The TS model makes the plumbing /stateless/ and can - with a bit of
    care - make the process more elastic and more resiliant in the face of
    various failures.

    Processing of your camera video above is implicitly checkpointed with
    every frame that's completed (at whatever stage). It's a perfect
    situation for distributed TS.

    But that means the post processing has to happen WHILE the video is
    being captured. I.e., you need "record" and "record-and-commercial-detect" >>> primitives. Or, to expose the internals of the "record" operation.

    Not at all. What it means is that recording does not produce an
    integrated video stream, but rather a sequence of frames. The frame
    sequence then can be accessed by 'commercial-detect' which consumes[*]
    the input sequence and produces a new output sequence lacking those
    frames which represent commercial content. Finally, some other little
    program could take that commercial-less sequence and produce the
    desired video stream.

    But there's no advantage to this if the "commercial detect" is going
    to be done AFTER the recording. I.e., you're storing the recording,
    frame by frame, in the cloud. I'm storing it on a "record object"
    (memory or DB record). And, in a form that is convenient for
    a "record" operation (without concern for the "commercial detect"
    which may not be used in this case).

    Exploiting the frame-by-frame nature only makes sense if you're
    going to start nibbling on those frames AS they are generated.

    No, frame by frame makes sense regardless. The video, AS a video,
    does not need to exist until a human wants to watch. Until that time
    [and even beyond] it can just as well exist as a frame sequence. The
    only penalty for this is a bit more storage, and storage is [well,
    before supply chains fell apart and the current administration
    undertook to further ruin the economy, it was] decently cheap.


    Snapshots of a single process /may/ be useful for testing or debugging
    [though I have doubts about how much]. I'm not sure what purpose they
    really can serve in a production environment. After all, you don't
    (usually) write programs /intending/ for them to crash.

    Stop thinking about crashes. That implies something is "broken".

    If an RT task (hard or soft) misses its deadline, something IS
    "broken" [for some definition]. Resetting the task automagically to
    some prior - hopefully "non-broken" - state is not helpful if it masks
    the problem.


    Any application that is *inherently* resumable (e.g., my archive
    service) can benefit from an externally imposed snapshot that
    is later restored. The developer is the one who is qualified to
    make this assessment, not the system (it lacks the heuristics
    though, arguably, could "watch" an application to see what handles
    it invokes).

    I guess the question here is: can the developer say "snapshot now!" or
    is it something that happens periodically, or even just best effort?


    Similarly, the system can't know the nature of a task's deadlines.
    But, can provide mechanisms to make use of that deadline data
    FROM the developer. If you don't provide it, then I will
    assume your deadline is at T=infinity... and your code will
    likely never be scheduled! :> (if you try to game it, your
    code also may never be scheduled: "That deadline is too soon
    for me to meet it so lets not bother trying!")

    We've had this conversation before: the problem is not that deadlines
    can't be scheduled for ... it's that the critical deadlines can't all necessarily be enumerated.

    :
    to get D by T3, I need C by T2
    to get C by T2, I need B by T1
    to get B by T1, I need A by T0
    :

    ad regressus, ad nauseam.

    And yes, a single explicit deadline can proxy for some number of
    implicit deadlines. That's not the point and you know it <grin>.


    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Sun May 29 22:23:52 2022
    On 5/24/2022 11:11 PM, George Neuner wrote:
    On Sun, 22 May 2022 14:12:39 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    [Some aggressive eliding as we're getting pretty far afield of
    "exception vs. error code"]

    Best discussions always wander. <grin>

    Subject to time constraints... <frown> Reshuffling computers over
    the past few days :-/

    [much elided, throughout, as were on the same page as to *theory*...]

    Your particular example isn't possible, but other things are -
    including having values seem to appear or disappear when they are
    examined at different points within your transaction.

    But the point of the transaction is to lock these changes
    (or recognize their occurrence) so this "ambiguity" can't
    manifest. (?)

    Yes ... and no.

    The "client" either sees the result of entire transaction or none of it.

    ...

    High isolation levels often result in measurably lower performance - "repeatable read" requires that when any underlying table is first
    touched, the selected rows be locked, or be copied for local use. RR
    also changes how subsequent uses of those rows are evaluated (see
    below). Locking limits concurrency, copying uses resources (which
    also may limit concurrency).

    ...

    However, most client-side DB libraries do NOT accept whole tables as
    results. Instead they open a cursor on the result and cache some
    (typically small) number of rows surrounding the one currently
    referenced by the cursor. Moving the cursor fetches new rows to
    maintain the illusion that the whole table is available. Meanwhile
    the transaction that produced the result is kept alive (modulo
    timeout) because the result TVT is still needed until the client
    closes the cursor.

    [Ways around the problem of cursor based client libraries may include increasing the size of the row cache so that it can hold the entire
    expected result table, or (if you can't do that) to make a local copy
    of the result as quickly as possible. Remembering that the row cache
    is per-cursor, per-connection, increasing the size of the cache(s) may
    force restructuring your application to limit the number of open connections.]

    I don't expose the DBMS to the applications. Remember, everything is
    an object -- backed by object servers.

    So, an application (service) developer decides that he needs some persistent object(s) to service client requests. He develops the object server(s)... and then the *persistent* object server(s), as appropriate. His server will interact with the "persistent store" (which I am currently implementing as
    an RDBMS) and hide all of the details of that interaction from his clients.

    A client may want to store a "video recording" or a "programming event" or
    a name/address. In order to do so, he would need a handle to the server
    for said object(s) and the capability that allows the ".store()" method to
    be invoked. Or, ".retrieve()" if he wanted to initialize the object from
    it's persistent state (assuming he has permission to do so).

    This eliminates the need to expose ACLs in the database to clients and
    gives finer-grained control over access. E.g., I can create a ".save()"
    method that only lets you store the state ONCE (because I know you should
    only *need* to store it once and if you mistakenly try to store it twice, you've effectively violated an invariant).

    To make the server developers' lives easier, I assume that all "persistent store" interactions will be atomic in nature (hence the interpretation of "transaction" in this light). Each "interaction" with the store effectively looks like you waited on a global lock, then did all of your work, then released the lock AND DIDN'T LOOK BACK (until the method was done).

    I get that behavior currently because I only have a single session with
    the DBMS; client requests are effectively serialized, regardless of
    how complex they are. They can only finish or abort; no partial results.

    [When I decide how to implement the "store", I will reexamine how
    the "proxy" is implemented]

    Because I can "accept" and "return" objects of variable size (e.g.,
    megabytes, if necessary), there's no need to try to trim the
    interface to something more efficient. E.g., if you ask for an
    "AddressBook" instead of an "Address", then you may end up with
    a block of thousands of "Addresses" that require a boatload of memory.
    <shrug> If you didn't want this, then don't ASK for it!

    [Note that the fine-grained access controls mean I can be very selective
    as to who I let do what. It's likely that "you" don't need access to
    entire address books]]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to George Neuner on Mon May 30 23:08:22 2022
    On 5/28/2022 10:14 PM, George Neuner wrote:
    On Thu, 26 May 2022 17:39:26 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    There are mechanisms that allow:
    - a node to look for work and "pull" it onto itself
    - a node to look for workers and "push" work onto them
    - a node to look at loads and capacities and MOVE work around
    - all of the above but with the nodes contributing "suggestions" to
    some other agent(s) that make the actual changes

    [for later]

    In my approach, looking for work is just "looking for a load
    that you can *assume* -- from another! You already know there is
    data FOR THAT TASK to process. And, the TEXT+DATA to handle
    that is sitting there, waiting to be "plucked" (no need to
    download from the persistent store -- which might have been
    powered down to conserve power).

    The goal isn't to "help" a process that is running elsewhere
    but, rather, to find a better environment for that process.

    That's a form of "work stealing". But usually the idea is to migrate
    the task to a CPU which has [for some definition] a lighter load:
    "lighter" because it's faster, or because it has more memory, etc.
    Typically this leaves out the small fry.

    The role of the (local) workload manager is to ensure the (local)
    node can meet the requirements of the loads THAT IT CHOOSES TO HOST.
    If the node finds itself unable to service the current working set,
    then it has to find a way to shed some of that load.

    [Note that I could, alternately, have adopted a policy that the
    workload manager should try to ACQUIRE as many loads as possible;
    or, that the workload manager should "mind everyone ELSE'S business",
    etc.]

    Ideally, it can find another workload manager (another node) that
    has "noticed" that *its* load is undersized for the resource set
    that *it* has. I.e., *it* has AVAILABLE processing capability. It
    cares nothing about the relative capabilities of other nodes,
    just "spare capacity" (i.e., a "small" node could potentially have
    more unused capacity than a "large" node)

    Think of it as a design problem where you "suddenly" are given
    a particular set of applications that you must deploy on the
    least amount of hardware.

    E.g., you wouldn't (?) develop a thermostat for heating and
    another one for cooling and sell them as separate devices.
    (and yet another for evaporative cooling -- though the latter
    is usually -- foolishly -- handled by a separate device!)
    And, yet another "weather module" that lets the thermostat
    "system" adapt to current weather conditions. And, another
    "forecast module" that attempts to predict future trends...

    Instead, you would try to combine the software *and* hardware
    to offer more (obvious!) value to the user.

    But, that may not be possible/practical -- you can't predict
    what the needs of those "future" modules will be at the time
    you deploy the "initial offering".

    The workload managers conspire to move the task from the node
    that is overburdened to the underburdened one. The task in
    question just sees a delay in its "running" (time slice) just
    like any execution discontinuity caused by a task switch -- who
    knows how long you'll be sleeping while the scheduler arranges
    for some other task to execute! ("Gee, I woke up on a different
    node! Imagine that! *BUT*, at least I woke up! No one's
    heard a peep from BackYardMonitor in *days*!!!")

    If the workload manager can't find a way to "cooperatively" shed
    some if its load, then it will have to kill off some local tasks
    until the remaining task set is runnable.

    Or, sheddable. (isn't it fun mutilating language?)

    [It's not quite this simple as there are also system-wide
    optimizations in place so key services can survive, even if
    they happen to be colocated on an overloaded node]

    Of course, it is possible that it will shed "enough" to leave
    some excess capacity after the killed task(s) is gone. In which
    case, it can potentially offer those resources to other workload
    managers to address *their* needs.

    It's not a static problem -- just like design suffers continual
    revision ("We added some memory for XYZ -- so, now we might also
    be able to add feature ABC that we had previously ruled out...")

    The TS model is both more competitive but less ruthless - idle (or
    lightly loaded) nodes don't steal from one another, but rather nodes
    that ARE doing the same kind of processing compete for the input data sequence. More capable nodes get more done - but not /necessarily/ at
    the expense of less capable nodes [that depends on the semantics of
    the tuple services].

    Very different.

    There's no "theft" involved. If node X can handle its workload,
    then node X retains that workload. It's not trying to *balance*
    loads. It's not eager to take on node Y's responsibilities.

    If node Y finds itself with nothing to do, then it can power itself
    down (unless it is offering hardware services to <something> -- in
    which case, it doesn't have "nothing" to do.)

    The *policy* over how/when loads move is set in the algorithms,
    not in the mechanism. As I stated above (preserved from previous post),
    how a task gets from A to B is up to the current state and implementation
    of the workload manager(s). Originally, it was a single process that
    snooped on everyone and "dictated" what should "go" where. But, this
    has obvious faults despite the possibility of making "better" decisions.

    Node A can notice that it is overloaded and request node B (whom it has
    noticed as being underloaded) to assume a particular task (that node A
    chose).

    Or, node A can *push* the task onto node B.

    Or, node C can act as "executor".

    etc. This depends on the "capabilities" that were granted to each
    workload manager wrt the other "actors" in the mix.

    Pathological behaviors are possible; it's easy to get into a situation
    where tasks are just shuffled back and forth instead of actually being *worked*. Or, tasks A & B *both* realizing task C's excess capacity
    could benefit themselves -- who gets to make use of it?

    The point of local workload managers is to have a better feel for the
    needs of the *locally* running process (i.e., it is easier to ask the
    local kernel for details about a local process -- and the handles it
    has open, time spent waiting, etc. -- than it is to ask a remote kernel
    about details of some remote process). As well as monitoring the
    overall performance of the local node.

    It minimizes the need for "global" data that workload managers
    on other nodes would otherwise need to access to make global
    decisions.

    I can create a redundant cloud service *locally*. But, if the cloud is
    *remote*, then I'm dependant on the link to the outside world to
    gain access to that (likely redundant) service.

    Yes. But you wouldn't choose to do that. If someone else chooses to
    do it, that is beyond your control.

    In effect, having tasks (and its data) sitting on particular nodes acts as
    a distributed local cloud. The memory that they reside in is entirely
    local instead of "off node". But, it has an opportunity to run "where it
    is" instead of just "sitting", parked (waiting), in a cloud.

    Remember, a node doesn't know what the tasks residing on it will do
    in the next microsecond. So, all decisions have to be made under
    present conditions.

    [I can "give advice" about what has historically happened with a
    particular task/service but can't make guarantees. This is likely
    much better than the *developer* could specify at install time!]

    If you've integrated that "cloud service" into your design, then
    you really can't afford to lose that service. It would be like
    having the OS reside somewhere beyond my control.

    I can tolerate reliance on LOCAL nodes -- because I can exercise
    some control over them. ...

    So, using a local cloud would mean providing that service on
    local nodes in a way that is reliable and efficient.

    Letting workload managers on each node decide how to shuffle
    around the "work" gives me the storage (in place) that the
    cloud affords (the nodes are currently DOING the work!)

    But it means that even really /tiny/ nodes need some variant of your "workload manager". It's simpler just to look for something to do
    when idle.

    No. Any node can be "empowered" (given suitable capabilities) to
    gather the data that it needs and make the decisions as to where
    loads should be pulled/pushed. So, another node can make
    decisions for a node that can't make decisions for itself.
    Likewise, a node can make "personal observations" about what
    it would like to see happen -- yet leave the decision (and
    activation) to someone else.

    But, the possibility of moving the load relies on the node
    having support for the full RTOS, not some skimpy, watered down
    framework that just "reads a thermistor and pushes an ASCII string
    down a serial comm line".

    My design precludes such "tiny" nodes on the assumption that supporting
    an encrypted network connection already sets a pretty high bar for
    resources and horsepower. If you want to read a thermistor with a
    PIC and send data serially to <something>, then build a proxy *in* a
    "big" node that hides all of the details of that interface. Treat
    the PIC as "on-board hardware" and bind that functionality to the
    hosting node.

    [After all, noone cares if the thermistor interface is a local A/DC
    or a serial port or a current loop...]

    This is how I "hide" the RDBMS, presently. I have no desire to port
    all of that code to run on my RTOS. But, I can build a little node
    that has an in-facing network port to talk to the rest of the system
    AND an out-facing network port to talk to a *PC* running the RDBMS.
    The "glue layers" run in that node while the DB runs in the PC.
    No one cares as long as it gives the appearance of an integrated node.

    In my case, the worst dependency lies in the RDBMS. But, it's loss
    can be tolerated if you don't have to access data that is ONLY
    available on the DB.

    [The switch is, of course, the extreme example of a single point failure]

    E.g., if you have the TEXT image for a "camera module" residing
    on node 23, then you can use that to initialize the TEXT segment
    of the camera on node 47. If you want to store something on the
    DBMS, it can be cached locally until the DBMS is available. etc.

    Yes. Or [assuming a network connection] the DBMS could be made
    distributed so it always is available.

    Exactly.

    I can provide a redundant camera watching the front door to
    handle the case where the INTENDED camera is "unavailable".
    But, that is a never ending set of what-ifs so its easier
    just to draw the line and say "when THIS breaks, then THAT
    service is unavailable".

    Hopefully, you take enough steps to ensure this isn't a widespread
    event (but, nothing to stop a guy with a baseball bat from
    taking a whack at one of the exterior cameras! Or, walking off with
    the weather station hardware. Or, turning off the municipal water
    feed at the meter. Or, ...)

    We're not trying to go to the moon...

    [Consider the legacy alternative to this; a bunch of little
    "automation islands" that possibly can't even share data!
    What happens if someone takes a baseball bat to your garage
    door opener? Or, your thermostat? Or...]

    Processing of your camera video above is implicitly checkpointed with >>>>> every frame that's completed (at whatever stage). It's a perfect
    situation for distributed TS.

    But that means the post processing has to happen WHILE the video is
    being captured. I.e., you need "record" and "record-and-commercial-detect"
    primitives. Or, to expose the internals of the "record" operation.

    Not at all. What it means is that recording does not produce an
    integrated video stream, but rather a sequence of frames. The frame
    sequence then can be accessed by 'commercial-detect' which consumes[*]
    the input sequence and produces a new output sequence lacking those
    frames which represent commercial content. Finally, some other little
    program could take that commercial-less sequence and produce the
    desired video stream.

    But there's no advantage to this if the "commercial detect" is going
    to be done AFTER the recording. I.e., you're storing the recording,
    frame by frame, in the cloud. I'm storing it on a "record object"
    (memory or DB record). And, in a form that is convenient for
    a "record" operation (without concern for the "commercial detect"
    which may not be used in this case).

    Exploiting the frame-by-frame nature only makes sense if you're
    going to start nibbling on those frames AS they are generated.

    No, frame by frame makes sense regardless. The video, AS a video,
    does not need to exist until a human wants to watch. Until that time
    [and even beyond] it can just as well exist as a frame sequence. The
    only penalty for this is a bit more storage, and storage is [well,
    before supply chains fell apart and the current administration
    undertook to further ruin the economy, it was] decently cheap.

    The CODEC being used (and the nature of the video stream)
    coupled with the available resources determines what "makes sense".
    I "process" video until a buffer is full ("memory object") and
    then pass that to the "recorder" wired to the output of the
    process. The recorder accumulates as many memory objects as
    it considers necessary (based on how it is encoding the video)
    and processes them when there's enough for the coder. These
    (memory objects) then get passed on to the storage device
    (which, if it was a disk, would deal with sectors of data,
    regardless of the "bounds" on the memory objects coming into it.)

    Remember, these can reside on different nodes or be colocated on
    the same node. So, the resource requirements have to be adaptable
    based on the resources that are available, at the time.

    In this way, the resources used by each "agent" can be varied
    as the resources available dictate. E.g., if there's gobs of
    (unclaimed) RAM on a node, then fill it before passing it
    on to the next stage. If the available memory is curtailed,
    then pass along smaller chunks. (more comms overhead per
    unit of memory)

    [Remember, that "passing" the memory may take time -- if it is
    going off-node. So, you need to still have "working memory"
    around to continue your processing while waiting for the
    RMI to complete (which will free the physical memory that
    was "sent"). Call by value semantics.]

    Snapshots of a single process /may/ be useful for testing or debugging
    [though I have doubts about how much]. I'm not sure what purpose they
    really can serve in a production environment. After all, you don't
    (usually) write programs /intending/ for them to crash.

    Stop thinking about crashes. That implies something is "broken".

    If an RT task (hard or soft) misses its deadline, something IS
    "broken" [for some definition]. Resetting the task automagically to
    some prior - hopefully "non-broken" - state is not helpful if it masks
    the problem.

    It's an open system so there are no schedulability guarantees.
    Nor assurances of "adequate resources".

    You install a first person shooter on your PC. The hardware meets
    the "minimum requirements" for the game.

    But, you're also rendering 3D CAD models WHILE you want to play the
    game.

    What are your "expectations"? Why? <grin> Is the 3D CAD app crappy?
    Or, the 1P shooter? Ans: the PC is the problem and *you* for not
    realizing the limitations it imposes on your use of it. But, how
    "savvy" should I expect you to be when making these decisions?

    This is what makes it an interesting application domain; it's relatively
    easy to design a CLOSED RT system but all bets are off if you let something else "in". Will *your* portions of the system continue to work properly
    while the addition(s) suffer? Or, will yours suffer to benefit the addition(s)? Or, will EVERYTHING shit the bed?

    Consider, if I (authoring one of those "additions") claim that my
    application is of "supreme" priority (expressed in whatever scheduling
    criteria you choose), then where does that leave *you*? And, what about
    Ben who comes along after me?? What's to stop HIM from trying to one-up
    *my* "requirements specification" (to ensure his app runs well!).

    ["Let's make everything louder than everything else"]

    You can either let things gag-on-demand (you may not be able to predict
    when these conflicts will occur, /a priori/) *or* let the system exercise
    some control over the load(s) and resource(s) in an attempt to BELIEVE
    EVERY CLAIM. (I.e., if they are *all* of supreme importance, then perhaps
    I should bring on as much excess capacity as possible to ensure they
    can all be satisfied?!)

    Any application that is *inherently* resumable (e.g., my archive
    service) can benefit from an externally imposed snapshot that
    is later restored. The developer is the one who is qualified to
    make this assessment, not the system (it lacks the heuristics
    though, arguably, could "watch" an application to see what handles
    it invokes).

    I guess the question here is: can the developer say "snapshot now!" or
    is it something that happens periodically, or even just best effort?

    If the developer wants to checkpoint his code, then he would know
    what he *needs* to preserve and could take explicit action to do
    so. And, take deliberate action when restarted to go check those
    checkpoints to sort out where to resume.

    My goal is to avoid that for the class of applications that are
    (largely) inherently resumable.

    [Again, I keep coming back to the archive software; inherent in
    its design is the knowledge that it may not be allowed to run
    to completion -- because that can be MANY many hours (how long does
    it take to compute hashes of a few TB of files?)! So, it takes
    little bites at the problem and stores the results AS it computes
    them (instead of caching lots of results and updating the
    archive's state, in one fell swoop, "later")]

    Similarly, the system can't know the nature of a task's deadlines.
    But, can provide mechanisms to make use of that deadline data
    FROM the developer. If you don't provide it, then I will
    assume your deadline is at T=infinity... and your code will
    likely never be scheduled! :> (if you try to game it, your
    code also may never be scheduled: "That deadline is too soon
    for me to meet it so lets not bother trying!")

    We've had this conversation before: the problem is not that deadlines
    can't be scheduled for ... it's that the critical deadlines can't all necessarily be enumerated.

    :
    to get D by T3, I need C by T2
    to get C by T2, I need B by T1
    to get B by T1, I need A by T0
    :

    ad regressus, ad nauseam.

    Every "task" (in the "job" sense of the word) has to have a deadline,
    if it is RT. Scheduling decisions are based on those deadlines.
    The current "run set" also depends on them (a deadline that has
    passed means it is silly to keep that task in the run set! A
    deadline that is sufficiently far in the future need not consume
    resources, now)

    Specifying deadlines can be ... difficult. Esp when you want
    actual "times" on those events.

    But, increasingly, anything that interacts with a user effectively
    becomes a RT task. User's don't have infinite patience. I may not
    care if you water the roses now or later. But, next *month* is
    unacceptable! Likewise, if I'm backing into the driveway, the
    garage door needs to be open -- completely -- before I drive
    into it! (if you want to open it a few seconds before it MUST
    be open, that's fine with me; but a few seconds AFTER means a
    fair bit of cost in car/door repairs and HOURS before necessary
    has other downsides) I'd like the commercials removed from the
    OTA broadcast before I watch it -- but, exactly WHEN will that be?

    Folks "learned" (?? "had no choice!") to be patient with PC apps.
    If unhappy with the "performance" of their PC, they could buy a newer/faster/bigger one -- and *defer* their DISsatisfaction
    (but, dissatisfaction is inevitable)

    Smart phones have put limits on that patience as folks are often
    not in "convenient waiting places" when using them. They expect near
    instant gratification -- regardless of the effort involved by the app.

    OTOH, they have even less patience with *appliances* (you expect
    your TV to turn on when you press the ON/OFF button, not 30 seconds
    later!). How long would you wait for your stove to "boot"?

    [C's vehicle spends 15+ seconds "booting" during which time, many of
    the controls are inoperable -- despite giving the APPEARANCE that they
    are ready. E.g., you can HOLD a button pressed and it simply won't see
    it. Amusing as there's no reason the system can't be "ready" to run
    instead of having to reinitialize itself each time! (I've noticed a
    distinct "switching transient" in the backup camera -- as one often
    puts the car in reverse VERY SOON after starting it. As NOT having the
    camera available would be folly -- oops! -- it appears that they hard-wire
    the video feed to the display while the processor is getting its shit
    together. Then, when the processor is ready to be an active intermediary
    in that process, the video switches over to video processed by the CPU]

    So, the adage that *everything* is RT -- at some point -- appears to
    have come to pass. The trick, then, is trying to sort out the
    user's tolerance for particular delays (by noticing his reactions
    to them) and deadline misses.

    AIs.

    And yes, a single explicit deadline can proxy for some number of
    implicit deadlines. That's not the point and you know it <grin>.


    George


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)