[Apologies if you see this in several public/private forums]
I'm trying to settle an argument among folks using my current
codebase wrt how (atypical) error conditions are handled:
returning an error code vs. raising an exception.
The error code camp advocates for that approach as it allows
the error to be handled where it arises (is detected).
The exception camp claims these are truly "exceptional" conditions
(that is debatable, especially if you assume "exceptional" means >"infrequent") and should be handled out of the normal flow
of execution.
The error code camp claims processing the exception is more tedious
(depends on language binding) AND easily overlooked/not-considered
(like folks failing to test malloc() for NULL).
The exception camp claims the OS can install "default" exception handlers
to address those cases where the developer was lazy or ignorant. And,
that no such remedy is possible if error codes are processed inline.
In a perfect world, a developer would deal with ALL of the possibilities
laid out by the API.
But, we have to live with imperfect people... <frown>
I can implement either case with similar effort (even a compile/run-time >switch) but would prefer to just "take a stand" and be done with it...
[Apologies if you see this in several public/private forums]
I'm trying to settle an argument among folks using my current
codebase wrt how (atypical) error conditions are handled:
returning an error code vs. raising an exception.
The error code camp advocates for that approach as it allows
the error to be handled where it arises (is detected).
The exception camp claims these are truly "exceptional" conditions
(that is debatable, especially if you assume "exceptional" means
"infrequent") and should be handled out of the normal flow
of execution.
I don't think the manner in which the error gets reported is as
important as /separating/ error returns from data returns. IMNSHO it
is a bad mistake for a single value sometimes to represent good data
and sometimes to represent an error.
The locality argument I think too often is misunderstood: it's
definitely not a "source" problem - you can place CATCH code as close
to or as far away from the TRY code as you wish.
The real problem with exceptions is that most compilers turn the CATCH
block into a separate function. Then because exceptions are expected
to be /low probability/ events, the CATCH function by default will be
treated as a 'cold code' path. In the best case it will be in the
same load module but separated from the TRY code. In the worst case
it could end up in a completely different load module.
Typically the TRY block also is turned into a separate function - at
least initially - and then (hopefully) it will be recognized as being
a unique function with a single call point and be inlined again at
some later compilation stage [at least if it satisfies the inlining
metrics].
With a return value, conditional error code - even if never used - is
likely to have at least been prefetched into cache [at least with
/inline/ error code]. Exception handling code, being Out-Of-Band by
design, will NOT be anywhere in cache unless it was used recently.
This being comp.realtime, it matters also how quickly exceptions can
be recognized and dispatched. Dispatch implementation can vary by how
many handlers are active in the call chain, and also by distances
between TRY and CATCH blocks ... this tends not to be documented very
well (if at all) so you need to experiment and see what your compiler
does with various structuring.
[Certain popular compilers have pragmas to mark code as 'hot' or
'cold' with the intent to control prefetch. You can mark CATCH blocks
'hot' to keep them near(er) their TRY code, or conversely, you can
mark conditional inline error handling as 'cold' to avoid having it prefetched because you expect it to be rarely used.
(I have never seen any code deliberately marked 'cold'. 8-)]
The error code camp claims processing the exception is more tedious
(depends on language binding) AND easily overlooked/not-considered
(like folks failing to test malloc() for NULL).
The exception camp claims the OS can install "default" exception handlers
to address those cases where the developer was lazy or ignorant. And,
that no such remedy is possible if error codes are processed inline.
Depends on the language. Some offer exceptions that /must/ be handled
or the code won't compile.
And some, like Scala, expect that exception handling will be used for
general flow of control (like IF/THEN), and so they endeavor to make
the cost of exception handing as low as possible.
In a perfect world, a developer would deal with ALL of the possibilities
laid out by the API.
But, we have to live with imperfect people... <frown>
I can implement either case with similar effort (even a compile/run-time
switch) but would prefer to just "take a stand" and be done with it...
The answer depends on what languages you are using and what your
compiler(s) can do.
Hi George,
Hope you are well and surviving this "global inconvenience"? :>
On 5/10/2022 2:49 PM, George Neuner wrote:
I don't think the manner in which the error gets reported is as
important as /separating/ error returns from data returns. IMNSHO it
is a bad mistake for a single value sometimes to represent good data
and sometimes to represent an error.
Agreed. I return tuples/structs so the "intended result" can be
separated from the "status of the invocation". This (hopefully)
reinforces the *need* (?) for the developer to explicitly check
the "status" before assuming the "result" makes sense (or that
the operation actually completed).
However, there is nothing that ensures the developer will actually
*examine* the status returned. Or, that he will handle ALL of the
potential status codes possible!
:
But, it precludes use of the return value directly in an expression
(which clutters up the code). [exceptions really go a long way towards >cleaning this up as you can "catch" the SET of exceptions at the end
of a block instead of having to explicitly test each "status" returned.
OTOH, the notion of "status" is portable to all (?) language bindings >(whereas exceptions require language support or "other mechanisms"
to implement consistently)
The locality argument I think too often is misunderstood: it's
definitely not a "source" problem - you can place CATCH code as close
to or as far away from the TRY code as you wish.
Likewise with "status codes". But, you tend not to want to write
much code without KNOWING that the code you've hoped to have executed >actually DID execute (as intended). I.e., you wouldn't want to invoke >several such functions (potentially relying on the results of earlier >function calls) before sorting out if you've already stubbed your toe!
There are just too many places where you can get an "unexpected" status >returned, not just an "operation failed".
And, with status codes, you need to examine EACH status code individually >instead of just looking for, e.g., *any* INSUFFICIENT_PERMISSION exception
or HOST_NOT_AVAILABLE, etc. thrown by any of the functions/methods in the
TRY block.
[You also know that the exception abends execution before any subsequent >functions/methods are invoked; no need to *conditionally* execute function2 >AFTER function1 has returned "uneventfully"]
I.e., the code just ends up looking a lot "cleaner". More functionality in
a given amount of paper.
The real problem with exceptions is that most compilers turn the CATCH
block into a separate function. Then because exceptions are expected
to be /low probability/ events, the CATCH function by default will be
treated as a 'cold code' path. In the best case it will be in the
same load module but separated from the TRY code. In the worst case
it could end up in a completely different load module.
I think part of the problem is that there is an expectation that
exceptions will be rare events. This leads to a certain laziness
on the developer's part (akin to NOT expecting malloc() to fail
"very often"). This is especially true if synthesizing those events
(in test scaffolding) is difficult or not repeatable.
But, there are cases where exceptions can be as common as "normal
execution". Part of the problem is defining WHAT is an "exception"
and what is a "failure" (bad choices of words) return.
[Is "Name not found" an error? Or, an exception? A lot depends
on the mindset of the developer -- if he is planning on only
searching for names that he knows/assumes to exist, then the "failure" >suggests something beyond his original conception has happened.]
I can implement either case with similar effort (even a compile/run-time >>> switch) but would prefer to just "take a stand" and be done with it...
The answer depends on what languages you are using and what your
compiler(s) can do.
I think the bigger problem is the mindset(s) of the developers.
Few have any significant experience programming using sockets
(which seems to be the essence of the problems my folks are having).
So, are unaccustomed to the fact that the *mechanism* has a part
to play in the function invocation. It's not just a simple BAL/CALL
that you can count on the CPU being able to execute.
Add to this the true parallelism in place AND the distributed
nature (of my project) and its just hard to imagine that an
object that you successfully accessed on line 45 of your code
might no longer be accessible on line 46; "Through no fault
(action) of your own". And, that the "fault" might lie in
the "invocation mechanism" *or* the actions of some other agency
in the system.
For example,
if (SUCCESS != function()) ...
doesn't mean function() "failed"; it could also mean function() was
never actually executed! Perhaps the object on which it operates
is no longer accessible to you, or you are attempting to use a
method that you don't have permission to use, or the server that
was backing it is now offline, or... (presumably, you will invoke
different remedies in each case)
So, either make these "other cases" (status possibilities) more
visible OR put in some default handling that ensures a lazy developer
shoots himself in the foot in a very obvious way.
[For user applets, the idea of a default handler makes lots of sense;
force their code to crash if any of these "unexpected" situations arise]
Hope you are well and surviving this "global inconvenience"? :>
I am well. The "inconvenience" is inconvenient.
Hope you are the same. 8-)
However, there is nothing that ensures the developer will actually
*examine* the status returned. Or, that he will handle ALL of the
potential status codes possible!
Which is the attraction of exceptions - at least in those languages
which force the programmer to declare what exceptions may be thrown
and explicitly handle them in calling code.
Of course, they all allow installing a generic "whatever" handler at
the top level, so the notion of 'required' is dubious at best ... but
most languages having exceptions don't require anything and simply
abort the program if an unhandled exception occurs.
:
But, it precludes use of the return value directly in an expression
(which clutters up the code). [exceptions really go a long way towards
cleaning this up as you can "catch" the SET of exceptions at the end
of a block instead of having to explicitly test each "status" returned.
If the value might be data or might be error then the only expression
you /could/ return it to would be have to some kind of conditional.
I.e., the code just ends up looking a lot "cleaner". More functionality in >> a given amount of paper.
Yes, exception code generally gives better (source visual) separation
between the 'good' path and the 'error' path.
The question really is "how far errors can be allowed to propragate?"
If an error /must/ be handled close to where it occurs, there is
little point to using exceptions.
Exceptions really start to make sense from the developer POV when a significant [for some metric] amount of code can be made (mostly) free
from inline error handling, OR when there is a large set of possible
errors that can be grouped meaningfully: e.g., instead of parsing
bitfields or writing an N-way switch on the error value, you can write
one or a few exception handlers each of which deals just with some
subset of possible errors.
But, there are cases where exceptions can be as common as "normal
execution". Part of the problem is defining WHAT is an "exception"
and what is a "failure" (bad choices of words) return.
Terminology always has been a problem: errors are not necessarily 'exceptional', and exceptional conditions are not necessarily
'errors'. And there is no agreement among developers on the best way
to handle either situation.
There always has been some amount of confusion and debate among
developers about what place exceptions can occupy in various
programming models.
I can implement either case with similar effort (even a compile/run-time >>>> switch) but would prefer to just "take a stand" and be done with it...
The answer depends on what languages you are using and what your
compiler(s) can do.
I think the bigger problem is the mindset(s) of the developers.
Few have any significant experience programming using sockets
(which seems to be the essence of the problems my folks are having).
So, are unaccustomed to the fact that the *mechanism* has a part
to play in the function invocation. It's not just a simple BAL/CALL
that you can count on the CPU being able to execute.
Sockets are a royal PITA - many possible return 'conditions', and many
which are (or should be) non-fatal 'just retry it' issues from the POV
of the application.
If you try (or are required by spec) to enumerate and handle all of
the possibilities, you end up with many pages of handler code.
BTDTGTTS.
Add to this the true parallelism in place AND the distributed
nature (of my project) and its just hard to imagine that an
object that you successfully accessed on line 45 of your code
might no longer be accessible on line 46; "Through no fault
(action) of your own". And, that the "fault" might lie in
the "invocation mechanism" *or* the actions of some other agency
in the system.
For example,
if (SUCCESS != function()) ...
doesn't mean function() "failed"; it could also mean function() was
never actually executed! Perhaps the object on which it operates
is no longer accessible to you, or you are attempting to use a
method that you don't have permission to use, or the server that
was backing it is now offline, or... (presumably, you will invoke
different remedies in each case)
So everything has to be written as asynchronous event code. I'd say
"so what", but there is too much evidence that a large percentage of programmers have a lot of trouble writing asynch code.
I find it curious because Javascript is said to have more programmers
than ALL OTHER popular languages combined. Between browsers and
'node.js', it's available on almost any platform. And in Javascript
all I/O is asynchronous.
You'd think there would be lots of programmers able to deal with
asynch code (modulo learning a new syntax) ... but it isn't true.
The parallel issues just compound the problem. Most developers have
problems with thread or task parallelism - never mind distributed.
Protecting less-capable programmers from themselves is one of the
major design goals of the virtual machine 'managed' runtimes - JVM,
CLR, etc. - and of the languages that target them. They often
sacrifice (sometimes significant) performance for correct operation.
The real problem is, nobody ever could make transactional operating
systems [remember IBM's 'Quicksilver'?] performant enough to be
competitive even for normal use.
So, either make these "other cases" (status possibilities) more
visible OR put in some default handling that ensures a lazy developer
shoots himself in the foot in a very obvious way.
[For user applets, the idea of a default handler makes lots of sense;
force their code to crash if any of these "unexpected" situations arise]
It's easy enough to have a default exception handler that kills the application, just as the default signal handler does in Unix. But
that doesn't help developers very much.
Java, at least, offers 'checked' exceptions that must be handled -
either by catching or /deliberate/ proprogation ... or else the
program build will fail. If you call a function that throws checked exceptions and don't catch or propagate, the code won't compile. You
can propagate all the way to top level - right out of the compilation
module - but if a checked exception isn't handled somewhere, the
program will fail to link.
The result, of course, is that relatively few Java developers use
checked exceptions extensively.
There are some other languages that work even harder to guarantee that exceptions are handled, but they aren't very popular.
On 5/12/2022 10:26 AM, George Neuner wrote:
In my world, as the system is open (in a dynamic sense), you can't
really count on anything being static. A service that you used "three >statements earlier" may be shutdown before your latest invocation.
Sure, a notification is "in the mail" -- but, you may not have received
(or processed) it, yet.
Do "you" end up faulting as a result? What are *you* serving??
I think the bigger problem is the mindset(s) of the developers.
Few have any significant experience programming using sockets
(which seems to be the essence of the problems my folks are having).
So, are unaccustomed to the fact that the *mechanism* has a part
to play in the function invocation. It's not just a simple BAL/CALL
that you can count on the CPU being able to execute.
Sockets are a royal PITA - many possible return 'conditions', and many
which are (or should be) non-fatal 'just retry it' issues from the POV
of the application.
If you try (or are required by spec) to enumerate and handle all of
the possibilities, you end up with many pages of handler code.
BTDTGTTS.
What I'm seeing is folks who "think malloc never (rarely?) returns NULL". >It's just sloppy engineering. You *know* what the system design puts
in place. You KNOW what guarantees you have -- and DON'T have. Why
code in ignorance of those realities? Then, be surprised when the
events that the architecture was designed to tolerate come along and
bite you in the ass?!
Add to this the true parallelism in place AND the distributed
nature (of my project) and its just hard to imagine that an
object that you successfully accessed on line 45 of your code
might no longer be accessible on line 46; "Through no fault
(action) of your own". And, that the "fault" might lie in
the "invocation mechanism" *or* the actions of some other agency
in the system.
For example,
if (SUCCESS != function()) ...
doesn't mean function() "failed"; it could also mean function() was
never actually executed! Perhaps the object on which it operates
is no longer accessible to you, or you are attempting to use a
method that you don't have permission to use, or the server that
was backing it is now offline, or... (presumably, you will invoke
different remedies in each case)
So everything has to be written as asynchronous event code. I'd say
"so what", but there is too much evidence that a large percentage of
programmers have a lot of trouble writing asynch code.
The function calls are still synchronous. ...
... But, the possibility that the
*mechanism* may fault has to be addressed. It's no longer a bi-valued
status result: SUCCESS vs. FAILURE. Instead, it's SUCCESS vs. FAILURE
(as returned by the ftn), INVALID_OBJECT, INSUFFICIENT_PERMISSION, >RESOURCE_SHORTAGE, etc.
But, the developer isn't thinking about any possibility other than >SUCCESS/FAILURE.
[And a fool who tests for != FAILURE -- thinking that implies SUCCESS -- will >get royally bitten!]
Protecting less-capable programmers from themselves is one of the
major design goals of the virtual machine 'managed' runtimes - JVM,
CLR, etc. - and of the languages that target them. They often
sacrifice (sometimes significant) performance for correct operation.
The real problem is, nobody ever could make transactional operating
systems [remember IBM's 'Quicksilver'?] performant enough to be
competitive even for normal use.
I don't sweat the performance issue; I've got capacity up the wazoo!
But, I don't think the tools are yet available to deal with "multicomponent" >systems, like this. People are still accustomed to dealing with specific >devices for specific purposes.
I refuse to believe future systems will consist of oodles of "dumb devices" >talking to a "big" controller that does all of the real thinking. It just >won't scale.
OTOH, there may be a push towards overkill in terms of individual "appliances" >and "wasting resources" when those devices are otherwise idle. Its possible >as things are getting incredibly inexpensive! (but I can't imagine folks >won't want to find some use for those wasted resources. e.g., SETI-ish)
So, either make these "other cases" (status possibilities) more
visible OR put in some default handling that ensures a lazy developer
shoots himself in the foot in a very obvious way.
I'm starting on a C++ binding. That's C-ish enough that it won't
be a tough sell. My C-binding exception handler is too brittle for most
to use reliably -- but is a great "shortcut" for *my* work!
Sorry for the delay ... lotsa stuff going on.
What I'm seeing is folks who "think malloc never (rarely?) returns NULL".
It's just sloppy engineering. You *know* what the system design puts
in place. You KNOW what guarantees you have -- and DON'T have. Why
code in ignorance of those realities? Then, be surprised when the
events that the architecture was designed to tolerate come along and
bite you in the ass?!
On modern Linux, malloc (almost) never does return NULL.
By default, Linux allocates logical address space, not physical space
(ie. RAM or SWAP pages) ... physical space isn't reserved until you
actually try to use the corresponding addresses. In a default
configuration, nothing prevents you to malloc more space than you
actually have. Unless the request exceeds the total possible address
space, it won't fail.
Then too, Linux has this nifty (or nasty, depending) OutOfMemory
service which activates when the real physical space is about to be overcommitted. The OOM service randomly terminates running programs
in an attempt to free space and 'fix' the problem. Of course, it is
NOT guaranteed to terminate the application whose page reservation
caused the overcommit.
If you actually want to know whether malloc failed - and keep innocent programs running - you need to disable the OOM service and change the
system allocation policy so that it provides *physically backed*
address space rather than simply logical address space.
... But, the possibility that the
*mechanism* may fault has to be addressed. It's no longer a bi-valued
status result: SUCCESS vs. FAILURE. Instead, it's SUCCESS vs. FAILURE
(as returned by the ftn), INVALID_OBJECT, INSUFFICIENT_PERMISSION,
RESOURCE_SHORTAGE, etc.
But, the developer isn't thinking about any possibility other than
SUCCESS/FAILURE.
[And a fool who tests for != FAILURE -- thinking that implies SUCCESS -- will
get royally bitten!]
The funny thing is that many common APIs follow the original C custom
of returning zero for success, or (not necessarily negative numbers,
but) non-zero codes for a failure. Though not as prevalent, some APIs
also feature multiple notions of 'success'.
[Of course, some do use a different value for 'success', but zero is
the most common.]
In a lot of cases you can consider [in C terms] FAILURE == (!SUCCESS).
At least as a 1st approximation.
Anyone much beyond 'newbie' /should/ realize this and should study the
API to find out what is (and is not) realistic.
Protecting less-capable programmers from themselves is one of the
major design goals of the virtual machine 'managed' runtimes - JVM,
CLR, etc. - and of the languages that target them. They often
sacrifice (sometimes significant) performance for correct operation.
The real problem is, nobody ever could make transactional operating
systems [remember IBM's 'Quicksilver'?] performant enough to be
competitive even for normal use.
I don't sweat the performance issue; I've got capacity up the wazoo!
But, I don't think the tools are yet available to deal with "multicomponent" >> systems, like this. People are still accustomed to dealing with specific
devices for specific purposes.
I refuse to believe future systems will consist of oodles of "dumb devices" >> talking to a "big" controller that does all of the real thinking. It just >> won't scale.
There has been a fair amount of research into so-called 'coordination' languages ... particularly for distributed systems. The problem, in
general, is that it means yet-another (typically declaritive) language
for developers to learn, and yet-another toolchain to master.
One unsolved problem is whether it is better to embed coordination or
to impose it. The 'embed' approach is typified by MPI and various (Linda-esc) 'tuple-space' systems.
The 'impose' approach typically involves using a declarative language
to specify how some group of processes will interact. The spec then
is used to generate frameworks for the participating processes.
Typically these systems are designed to create actively monitored
groups, and processes are distinguished as being 'compute' nodes or 'coordinate' nodes.
[But there are some systems that can create dynamic 'self-organizing' groups.]
Research has shown that programmers usually find embedded coordination
easier to work with ... when the requirements are fluid it's often
difficult to statically enumerate the needed interactions and get
their semantics correct - which typically results in longer
development times for imposed methods. But it's also easier to F_ up
using embedded methods because of the ad hoc nature of their growth.
OTOH, there may be a push towards overkill in terms of individual "appliances"
and "wasting resources" when those devices are otherwise idle. Its possible >> as things are getting incredibly inexpensive! (but I can't imagine folks
won't want to find some use for those wasted resources. e.g., SETI-ish)
Not 'wasting' per se, but certainly there is a strong trend toward
spending of some resources to make programmers' lives easier. Problem
comes when it starts to make programming /possible/ rather than simply 'easier'.
So, either make these "other cases" (status possibilities) more
visible OR put in some default handling that ensures a lazy developer
shoots himself in the foot in a very obvious way.
The problem with forcing exceptions to always be handled, is that the
code can become cluttered even if the exception simply is propogated
up the call chain. Most languages default to propogating rather than deliberately requiring a declaration of intent, and so handling of
propogated exceptions can be forgotten.
In general there is no practical way to force a result code to be
examined. It is relatively easy for a compiler to require that a
function's return value always be caught - but without a lot of extra
syntax it is impossible to distinguish a 'return value' from a 'result
code', and impossible to guarantee that every possible return code is enumerated and dealt with.
And when essentially any data value can be thrown as an exception you
invite messes like:
try {
:
throw( some_integer );
:
} catch (int v) {
:
}
which - even with an enum of possible codes - really is no more useful
than a return code.
I'm starting on a C++ binding. That's C-ish enough that it won't
be a tough sell. My C-binding exception handler is too brittle for most
to use reliably -- but is a great "shortcut" for *my* work!
Not sure how far you can go without a lot of effort ... C++ is very permissive wrt exception handling: an unhandled exception will
terminate the process, but there's no requirement to enumerate what exceptions code will throw, and you always can
try {
:
catch (...) {
}
to ignore exceptions entirely.
On 5/15/2022 2:11 PM, George Neuner wrote:
What I'm seeing is folks who "think malloc never (rarely?) returns NULL". >>> It's just sloppy engineering. You *know* what the system design puts
in place. You KNOW what guarantees you have -- and DON'T have. Why
code in ignorance of those realities? Then, be surprised when the
events that the architecture was designed to tolerate come along and
bite you in the ass?!
On modern Linux, malloc (almost) never does return NULL.
By default, Linux allocates logical address space, not physical space
(ie. RAM or SWAP pages) ... physical space isn't reserved until you
actually try to use the corresponding addresses. In a default
Ditto here. You can force the pages to be mapped, wire them down *or*
let them be mapped-as-referenced.
But, each task has strict resource limits so, eventually, attempting
to allocate additional memory will exceed your quota and the OS will
throw an exception. (if you can allocate to your heart's content, then >what's to stop you from faulting in all of those pages?)
If you actually want to know whether malloc failed - and keep innocent
programs running - you need to disable the OOM service and change the
system allocation policy so that it provides *physically backed*
address space rather than simply logical address space.
I suspect most (many?) Linux boxen also have secondary storage to
fall back on, in a pinch.
In my case, there's no spinning rust... and any memory objects used
to extend physical memory (onto other nodes) is subject to being
killed off just like any other process.
[It's a really interesting programming environment when you can't
count on anything remaining available! Someone/something can always
be more important than *you*!]
The problem with many of these "conditions" is that they are hard
to simulate and can botch *any* RMI. "Let's assume the first one succeeds >(or fails) and now assume the second encounters an invalid object exception..."
The real problem is, nobody ever could make transactional operating
systems [remember IBM's 'Quicksilver'?] performant enough to be
competitive even for normal use.
I don't sweat the performance issue; I've got capacity up the wazoo!
But, I don't think the tools are yet available to deal with "multicomponent"
systems, like this. People are still accustomed to dealing with specific >>> devices for specific purposes.
I refuse to believe future systems will consist of oodles of "dumb devices" >>> talking to a "big" controller that does all of the real thinking. It just >>> won't scale.
The problem is expecting "after-the-sale" add-ons to accurately adopt
such methodologies; you don't have control over those offerings.
And, many developers can't/won't bother with the "out-of-the-ordinary" >conditions.
If the RDBMS is not available to capture your updated speech models,
will you bother trying to maintain them locally -- for the next
time they are required (instead of fetching them from the RDBMS at
that time)? Or, will you just drop them and worry about whether or
not the RDBMS is available when *next* you need it to source the data?
OTOH, if your algorithm inherently relies on the fact that those
updates will be passed forward (via the RDBMS), then your future
behavior will suffer (or fail!) when you've not adequately handled
that !SUCCESS.
I'm starting on a C++ binding. That's C-ish enough that it won't
be a tough sell. My C-binding exception handler is too brittle for
most to use reliably -- but is a great "shortcut" for *my* work!
:
The hope was to provide a friendlier environment for dealing
with the exceptions (my C exception framework is brittle).
But, I think the whole OOPS approach is flawed for what I'm
doing. It's too easy/common for something that the compiler
never sees (is completely unaware of) to alter the current
environment in ways that will confuse the code that it
puts in place.
E.g., the object that you're about to reference on the next line
no longer exists. Please remember to invoke its destructor
(or, never try to reference it, again; or, expect exceptions
to be thrown for every reference to it; or...)
[consider the case of an object that has such an object embedded
within it]
On Mon, 16 May 2022 03:42:33 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
On 5/15/2022 2:11 PM, George Neuner wrote:
What I'm seeing is folks who "think malloc never (rarely?) returns NULL". >>>> It's just sloppy engineering. You *know* what the system design puts >>>> in place. You KNOW what guarantees you have -- and DON'T have. Why
code in ignorance of those realities? Then, be surprised when the
events that the architecture was designed to tolerate come along and
bite you in the ass?!
On modern Linux, malloc (almost) never does return NULL.
By default, Linux allocates logical address space, not physical space
(ie. RAM or SWAP pages) ... physical space isn't reserved until you
actually try to use the corresponding addresses. In a default
Ditto here. You can force the pages to be mapped, wire them down *or*
let them be mapped-as-referenced.
Well yes, but not what I meant. You can go to the lower level mmap(2)
and do whatever you want ... but the standard library malloc/calloc
behaves according to a system-wide allocation policy which can only be changed by an administrator.
But, each task has strict resource limits so, eventually, attempting
to allocate additional memory will exceed your quota and the OS will
throw an exception. (if you can allocate to your heart's content, then
what's to stop you from faulting in all of those pages?)
Well, you can impose certain limits via ulimit(3) or the shell command
of the same name. But ulimit is only for physical space.
If you actually want to know whether malloc failed - and keep innocent
programs running - you need to disable the OOM service and change the
system allocation policy so that it provides *physically backed*
address space rather than simply logical address space.
I suspect most (many?) Linux boxen also have secondary storage to
fall back on, in a pinch.
Actually no. The overwhelming majority of Linux instances are hosted
VMs - personal workstations and on-metal servers all together account
for just a tiny fraction. And while the majority of VMs certainly DO
have underlying writable storage, typically they are configured (at
least initially) without any swap device.
Of course, an administrator can create a swap /file/ if necessary. But
that's the easy part.
The larger issue is that, while usually not providing a swap device,
most cloud providers also forget to change the default swap and memory overcommit behavior. By default, the system allows unbacked address allocation and also tries to keep a significant portion (about ~30%)
of physical memory free.
There are dozens of system parameters that affect the allocation and out-of-memory behavior ... unless you are willing to substantially
over provision memory, the system has to be carefully configured to
run well with no swap. Among the most common questions you'll see in
Linux support forums are
- what the *&^%$ is 'OOM' and why is it killing my tasks?
- how do I change the OOM policy?
- how do I change the swap policy?
- how do I change the memory overcommit policy?
In my case, there's no spinning rust... and any memory objects used
to extend physical memory (onto other nodes) is subject to being
killed off just like any other process.
Nowadays a large proportion of storage is SSD ... no spinning rust ...
but I understand. <grin>
[It's a really interesting programming environment when you can't
count on anything remaining available! Someone/something can always
be more important than *you*!]
The problem with many of these "conditions" is that they are hard
to simulate and can botch *any* RMI. "Let's assume the first one succeeds >> (or fails) and now assume the second encounters an invalid object exception..."
Which is why a (hierarchical) transactional model is so appealing. The problem is providing /comprehensive/ system-wide support [taking into
account the multiple different failure modes], and then teaching
programmers to work with it.
The real problem is, nobody ever could make transactional operating
systems [remember IBM's 'Quicksilver'?] performant enough to be
competitive even for normal use.
I don't sweat the performance issue; I've got capacity up the wazoo!
But, I don't think the tools are yet available to deal with "multicomponent"
systems, like this. People are still accustomed to dealing with specific >>>> devices for specific purposes.
And specific services. The problem comes when you are dealing /simultaneously/ with multiple things, any of which can fail or
disappear at any moment.
I refuse to believe future systems will consist of oodles of "dumb devices"
talking to a "big" controller that does all of the real thinking. It just >>>> won't scale.
And, pretty much, nobody else believes it either.
So-called 'edge' computing largely is based on distributed tuple-space
models specifically /because/ they are (or can be) self-organizing and
are temporally decoupled: individual devices can come and go at will,
but the state of ongoing computations is maintained in the fabric.
The problem is expecting "after-the-sale" add-ons to accurately adopt
such methodologies; you don't have control over those offerings.
There's no practical way to guarantee 3rd party add-ons will be
compatible with an existing system. Comprehensive documentation,
adherence to external standards, and reliance on vendor-provided tools
only can go so far.
And, many developers can't/won't bother with the "out-of-the-ordinary"
conditions.
If the RDBMS is not available to capture your updated speech models,
will you bother trying to maintain them locally -- for the next
time they are required (instead of fetching them from the RDBMS at
that time)? Or, will you just drop them and worry about whether or
not the RDBMS is available when *next* you need it to source the data?
OTOH, if your algorithm inherently relies on the fact that those
updates will be passed forward (via the RDBMS), then your future
behavior will suffer (or fail!) when you've not adequately handled
that !SUCCESS.
Now you're conflating 'policy' with 'mechanism' ... which is where
most [otherwise reasonable] ideas go off the rails.
If you're worried that a particular service will be a point of failure
you always can massively over-design and over-provision it such that
failure becomes statistically however unlikely you desire.
The end result of such thinking is that every service becomes
distributed, self-organizing, and self-repairing, and every node
features multiply redundant hardware (in case there's a fire in the 'fire-proof vault), and node software must be able to handle any and
every contigency for any possible program in case the (multiply
redundant) fabric fails.
I'm starting on a C++ binding. That's C-ish enough that it won't
be a tough sell. My C-binding exception handler is too brittle for
most to use reliably -- but is a great "shortcut" for *my* work!
:
The hope was to provide a friendlier environment for dealing
with the exceptions (my C exception framework is brittle).
But, I think the whole OOPS approach is flawed for what I'm
doing. It's too easy/common for something that the compiler
never sees (is completely unaware of) to alter the current
environment in ways that will confuse the code that it
puts in place.
Well, hypothetically, use of objects could include compiler generated fill-in-the-blank type error handling. But that severely restricts
what languages [certainly what compilers] could be used and places a
high burden on developers to provide relevant error templates.
E.g., the object that you're about to reference on the next line
no longer exists. Please remember to invoke its destructor
(or, never try to reference it, again; or, expect exceptions
to be thrown for every reference to it; or...)
[consider the case of an object that has such an object embedded
within it]
There's no particularly good answer to that: the problem really is not
OO - the same issues need to be addressed no matter what paradigm is
used. Rather the problem is that certain languages lack the notion of 'try-finally' to guarantee cleanup of local state (in whatever form
that state exists).
Another useful syntactic form is one that both opens AND closes the
object [for some definition of 'open' and 'close'] so the object can
be made static/global.
Neither C nor C++ supports 'try-finally'. As a technical matter, all
the modern compilers actually DO offer extensions with more-or-less equivalent functionality - but use of the extensions is non-portable.
On 5/19/2022 3:42 PM, George Neuner wrote:
:
Now, we're looking at dynamically distributed applications where
the code runs "wherever". And, (in my case) can move or "disappear"
from one moment to the next.
I think folks just aren't used to considering "every" function (method) >invocation as a potential source for "mechanism failure". They don't >recognize a ftn invocation as potentially suspect.
I.e., you can avoid all of these problems -- by implementing everything >yourself, in your own process space. Then, the prospect of "something" >disappearing is a non-issue -- YOU "disappear" (and thus never know
you're gone!)
So-called 'edge' computing largely is based on distributed tuple-space
models specifically /because/ they are (or can be) self-organizing and
are temporally decoupled: individual devices can come and go at will,
but the state of ongoing computations is maintained in the fabric.
But (in the embedded/RT system world) they are still "devices" with
specific functionalities. We're not (yet) accustomed to treating >"processing" as a resource that can be dispatched as needed. There
are no mechanisms where you can *request* more processing (beyond
creating another *process* and hoping <something> recognizes that
it can co-execute elsewhere)
I have provisions for apps to request redundancy where the RTOS will >automatically checkpoint the app/process and redispatch a process
after a fault. But, that comes at a cost (to the app and the system)
and could quickly become a crutch for sloppy developers. Just
because your process is redundantly backed, doesn't mean it is free
of design flaws.
:I'm starting on a C++ binding. That's C-ish enough that it won't
be a tough sell. My C-binding exception handler is too brittle for
most to use reliably -- but is a great "shortcut" for *my* work!
The hope was to provide a friendlier environment for dealing
with the exceptions (my C exception framework is brittle).
But, I think the whole OOPS approach is flawed for what I'm
doing. It's too easy/common for something that the compiler
never sees (is completely unaware of) to alter the current
environment in ways that will confuse the code that it
puts in place.
Well, hypothetically, use of objects could include compiler generated
fill-in-the-blank type error handling. But that severely restricts
what languages [certainly what compilers] could be used and places a
high burden on developers to provide relevant error templates.
I thought of doing that with the stub generator. But, then I would
need a different binding for each client of a particular service.
And, it would mean that a particular service would ALWAYS behave
a certain way for a particular task; you couldn't expect one type of
behavior on line 24 and another on 25.
How you expose the "exceptional behaviors" is the issue. There's
a discrete boundary between objects, apps and RTOS so anything that
relies on blurring that boundary runs the risk of restricting future >solutions, language bindings, etc.
The problem is still in the mindsets of the developers. Until they actively >embrace the possibility of these RMI failing, they won't even begin to consider
how to address that possibility.
[That's why I suggested the "automatically insert a template after each >invocation" -- to remind them of each of the possible outcomes in a
very hard to ignore way!]
I work with a lot of REALLY talented people so I am surprised that this
is such an issue. They understand the mechanism. They have to understand >that wires can break, devices can become disconnected, etc. AND that this
can happen AT ANY TIME (not just "prior to POST). So, why no grok?
[OTOH, C complains that I fail to see the crumbs I leave on the counter
each time I prepare her biscotti: "How can you not SEE them??" <shrug>]
I had a similar problem trying to get them used to having two different >notions of "time": system time and wall time. And, the fact that they
are entirely different schemes with different rules governing their
behavior. I.e., if you want to do something "in an hour", then say
"in an hour" and use the contiguous system time scheme. OTOH, if you
want to do something at 9PM, then use the wall time. Don't expect any >correlation between the two!
Now, we're looking at dynamically distributed applications where
the code runs "wherever". And, (in my case) can move or "disappear"
from one moment to the next.
I think folks just aren't used to considering "every" function (method)
invocation as a potential source for "mechanism failure". They don't
recognize a ftn invocation as potentially suspect.
Multiuser databases have been that way since ~1960s: you have to code
with the expectation that /anything/ you try to do in the database may
fail ... not necessarily due to error, but simply because another
concurrent transaction is holding some resource you need. Your
transaction may be arbitrarily delayed, or even terminated: in the
case of deadlock, participating transactions are killed one by one
until /some/ one of them [not necessarily yours] is able to complete.
You have to be ready at all times to retry operations or do something
else reasonable with their failure.
I.e., you can avoid all of these problems -- by implementing everything
yourself, in your own process space. Then, the prospect of "something"
disappearing is a non-issue -- YOU "disappear" (and thus never know
you're gone!)
Which is how things were done in ye old days. <grin>
So-called 'edge' computing largely is based on distributed tuple-space
models specifically /because/ they are (or can be) self-organizing and
are temporally decoupled: individual devices can come and go at will,
but the state of ongoing computations is maintained in the fabric.
But (in the embedded/RT system world) they are still "devices" with
specific functionalities. We're not (yet) accustomed to treating
"processing" as a resource that can be dispatched as needed. There
are no mechanisms where you can *request* more processing (beyond
creating another *process* and hoping <something> recognizes that
it can co-execute elsewhere)
The idea doesn't preclude having specialized nodes ... the idea is
simply that if a node crashes, the task state [for some approximation]
is preserved "in the cloud" and so can be restored if the same node
returns, or the task can be assumed by another node (if possible).
It often requires moving code as well as data, and programs need to be written specifically to regularly checkpoint / save state to the
cloud, and to be able to resume from a given checkpoint.
The "tuple-space" aspect specifically is to coordinate efforts by
multiple nodes without imposing any particular structure or
communication pattern on partipating nodes ... with appropriate TS
support many different communication patterns can be accomodated simultaneously.
I have provisions for apps to request redundancy where the RTOS will
automatically checkpoint the app/process and redispatch a process
after a fault. But, that comes at a cost (to the app and the system)
and could quickly become a crutch for sloppy developers. Just
because your process is redundantly backed, doesn't mean it is free
of design flaws.
"Image" snapshots are useful in many situations, but they largely are impractical to communicate through a wide-area distributed system.
[I know your system is LAN based - I'm just making a point.]
For many programs, checkpoint data will be much more compact than a
snapshot of the running process, so it makes more sense to design
programs to be resumed - particularly if you can arrange that reset of
a faulting node doesn't eliminate the program, so code doesn't have to
be downloaded as often (or at all).
Even if the checkpoint data set is enormous, it often can be saved incrementally. You then have to weigh the cost of resuming, which
requires the whole data set be downloaded.
:I'm starting on a C++ binding. That's C-ish enough that it won't
be a tough sell. My C-binding exception handler is too brittle for >>>>>> most to use reliably -- but is a great "shortcut" for *my* work!
The hope was to provide a friendlier environment for dealing
with the exceptions (my C exception framework is brittle).
But, I think the whole OOPS approach is flawed for what I'm
doing. It's too easy/common for something that the compiler
never sees (is completely unaware of) to alter the current
environment in ways that will confuse the code that it
puts in place.
Well, hypothetically, use of objects could include compiler generated
fill-in-the-blank type error handling. But that severely restricts
what languages [certainly what compilers] could be used and places a
high burden on developers to provide relevant error templates.
I thought of doing that with the stub generator. But, then I would
need a different binding for each client of a particular service.
And, it would mean that a particular service would ALWAYS behave
a certain way for a particular task; you couldn't expect one type of
behavior on line 24 and another on 25.
I wasn't thinking of building the error handling into the interface
object, but rather a "wizard" type code skeleton inserted at the point
of use. You couldn't do it with the IDL compiler ... unless it also generated skeleton for clients as well as for servers - which is not
typical.
I suppose you /could/ have the IDL generate different variants of the
same interface ... perhaps in response to a checklist of errors to be
handled provided by the programmer when the stub is created.
But then to allow for different behaviors in the same program, you
might need to generate multiple variants of the same interface object
and make sure to use the right one in the right place. Too much
potential for F_up there.
[Also, IIRC, you are based on CORBA? So potentially a resource drain
given that just the interface object in the client can initiate a
'session' with the server.]
How you expose the "exceptional behaviors" is the issue. There's
a discrete boundary between objects, apps and RTOS so anything that
relies on blurring that boundary runs the risk of restricting future
solutions, language bindings, etc.
The problem really is that RPC tries to make all functions appear as
if they are local ... 'local' meaning "in the same process".
At least /some/ of the "blurring" you speak of goes away in languages
like Scala where every out-of-process call - e.g., I/O, invoking
functions from a shared library, messaging another process, etc. - all
are treated AS IF RPC to a /physically/ remote server, regardless of
whether that actually is true. If an unhandled error occurs for any out-of-process call, the process is terminated.
[Scala is single threaded, but its 'processes' are very lightweight,
more like 'threads' in other systems.]
The problem is still in the mindsets of the developers. Until they actively >> embrace the possibility of these RMI failing, they won't even begin to consider
how to address that possibility.
[That's why I suggested the "automatically insert a template after each
invocation" -- to remind them of each of the possible outcomes in a
very hard to ignore way!]
I work with a lot of REALLY talented people so I am surprised that this
is such an issue. They understand the mechanism. They have to understand >> that wires can break, devices can become disconnected, etc. AND that this
can happen AT ANY TIME (not just "prior to POST). So, why no grok?
I didn't understand it until I started working seriously with DBMS.
A lot of the code I write now looks and feels transactional,
regardless of what it's actually doing. I try to make there are no
side effects [other than failure] and I routinely (ab)use exceptions
and Duff's device to back out of complex situations, release
resources, undo (where necessary) changes to data structures, etc.
I won't hesitate to wrap a raw service API, and create different
versions of it that handle things differently. Of course this results
in some (usually moderate) code growth - which is less a problem in
more powerful systems. I have written for small devices in the past,
but I don't do that anymore.
[OTOH, C complains that I fail to see the crumbs I leave on the counter
each time I prepare her biscotti: "How can you not SEE them??" <shrug>]
When the puppy pees on the carpet, everyone develops indoor blindness.
<grin>
I had a similar problem trying to get them used to having two different
notions of "time": system time and wall time. And, the fact that they
are entirely different schemes with different rules governing their
behavior. I.e., if you want to do something "in an hour", then say
"in an hour" and use the contiguous system time scheme. OTOH, if you
want to do something at 9PM, then use the wall time. Don't expect any
correlation between the two!
Once I wrote a small(ish) parser, in Scheme, for AT-like time specs.
You could say things like "(AT now + 40 minutes)" or "(AT 9pm
tomorrow)", etc. and it would figure out what that meant wrt the
system clock. The result was the requested epoch expressed in seconds.
That epoch could be used directly to set an absolute timer, or to poll expiration with a simple 'now() >= epoch'.
It was implemented as a compile time macro that generated and spliced
in code at the call site to compute the desired answer at runtime.
The parser understood 12 and 24 hour clocks, names of days and months, expressions like 'next tuesday', etc. There were a few small
auxialiary functions required to be linked into the executable to
figure out, e.g., what day it was, on-the-fly, but many typical uses
just reduced to something like '(+ (current_time) <computed_offset>)'.
The parser itself was never in the executable.
Simplified a lot of complicated clock handling.
On 5/20/2022 6:59 PM, George Neuner wrote:
Now, we're looking at dynamically distributed applications where
the code runs "wherever". And, (in my case) can move or "disappear" >>>from one moment to the next.
I think folks just aren't used to considering "every" function (method)
invocation as a potential source for "mechanism failure". They don't
recognize a ftn invocation as potentially suspect.
Multiuser databases have been that way since ~1960s: you have to code
with the expectation that /anything/ you try to do in the database may
fail ... not necessarily due to error, but simply because another
concurrent transaction is holding some resource you need. Your
transaction may be arbitrarily delayed, or even terminated: in the
case of deadlock, participating transactions are killed one by one
until /some/ one of them [not necessarily yours] is able to complete.
Yes. The "code" (SQL) isn't really executing on the client. What
you have, in effect, is an RPC; the client is telling the DBMS what to do
and then waiting on its results. Inherent in that is the fact that
there is a disconnect between the request and execution -- a *mechanism*
that can fail.
But, in my (limited) experience, DB apps tend to be relatively short
and easily decomposed into transactions. You "feel" like you've accomplished >some portion of your goal after each interaction with the DB.
By contrast, "procedural" apps tend to have finer-grained actions; you
don't get the feeling that you've made "definable" progress until you've >largely met your goal.
E.g., you'd have to sum N values and then divide the sum by N to get
an average. If any of those "adds" was interrupted, you'd not feel
like you'd got anything done. Likewise, if you did the division
to get the average, you'd likely still be looking to do something MORE
with that figure.
The DBMS doesn't have an "out" when it has summed the first M (M<N) values; >it sums them and forms the average BEFORE it can abend. Or, shits the bed >completely. There's no "half done" state.
Yes. And, that recovery can be complicated, esp if the operations up
to this point have had side effects, etc. How do you "unring the bell"?
So-called 'edge' computing largely is based on distributed tuple-space >>>> models specifically /because/ they are (or can be) self-organizing and >>>> are temporally decoupled: individual devices can come and go at will,
but the state of ongoing computations is maintained in the fabric.
But (in the embedded/RT system world) they are still "devices" with
specific functionalities. We're not (yet) accustomed to treating
"processing" as a resource that can be dispatched as needed. There
are no mechanisms where you can *request* more processing (beyond
creating another *process* and hoping <something> recognizes that
it can co-execute elsewhere)
The idea doesn't preclude having specialized nodes ... the idea is
I'm arguing for the case of treating each node as "specialized + generic"
and making the generic portion available for other uses that aren't >applicable to the "specialized" nature of the node (hardware).
Your doorbell sits in an idiot loop waiting to "do something" -- instead
of spending that "idle time" working on something *else* so the "device"
that would traditionally be charged with doing that something else
can get by with less resources on-board.
[I use cameras galore. Imagining feeding all that video to a single
"PC" would require me to keep looking for bigger and faster PCs!]
simply that if a node crashes, the task state [for some approximation]
is preserved "in the cloud" and so can be restored if the same node
returns, or the task can be assumed by another node (if possible).
It often requires moving code as well as data, and programs need to be
written specifically to regularly checkpoint / save state to the
cloud, and to be able to resume from a given checkpoint.
Yes. For me, all memory is wrapped in "memory objects". Each has particular >attributes (and policies/behaviors), depending on its intended use.
E.g., the TEXT resides in an R/O object ...
The DATA resides in an R/W object ...
I leverage my ability to "migrate" a task (task is resource
container) to *pause* the task and capture a snapshot of each
memory object (some may not need to be captured if they are
copies of identical objects elsewhere in the system) AS IF it
was going to be migrated.
But, instead of migrating the task, I simply let it resume, in place.
The problem with this is watching for side-effects that happen
between snapshots. I can hook all of the "handles" out of the
task -- but, no way I can know what each of those "external
objects" might be doing.
OTOH, if I know that no external references have taken place since the
last "snapshot", then I can safely restart the task from the last
snapshot.
It is great for applications that are well suited to checkpointing,
WITHOUT requiring the application to explicitly checkpoint itself.
The "tuple-space" aspect specifically is to coordinate efforts by
multiple nodes without imposing any particular structure or
communication pattern on partipating nodes ... with appropriate TS
support many different communication patterns can be accomodated
simultaneously.
For many programs, checkpoint data will be much more compact than a
snapshot of the running process, so it makes more sense to design
programs to be resumed - particularly if you can arrange that reset of
a faulting node doesn't eliminate the program, so code doesn't have to
be downloaded as often (or at all).
Yes, but that requires more skill on the part of the developer.
And, makes it more challenging for him to test ("What if your
app dies *here*? Have you checkpointed the RIGHT things to
be able to recover? And, what about *here*??")
I'm particularly focused on user-level apps (scripts) where I can
build hooks into the primitives that the user employs to effectively
keep track of what they've previously been asked to do -- keeping in
mind that these will tend to be very high-levels of abstraction
(from the user's perspective).
E.g.,
At 5:30PM record localnews
At 6:00PM record nationalnews
remove_commercials(localnews)
remove_commercials(nationalnews)
when restarted, each primitive can look at the current time -- and state
of the "record" processes -- to sort out where they are in the sequence.
And, the presence/absence of the "commercial-removed" results. (obviously >you can't record a broadcast that has already ended so why even try!)
Note that the above can be a KB of "code" + "state" -- because
all of the heavy lifting is (was?) done in other processes.
Even if the checkpoint data set is enormous, it often can be saved
incrementally. You then have to weigh the cost of resuming, which
requires the whole data set be downloaded.
OTOH, if you want to do something at 9:05 -- assuming it is 9:00 now -- you >set THAT timer based on the wall time. The guarantee it gives is that
it will trigger at or after "9:05"... regardless of how many seconds elapse >between now and then!
So, if something changes the current wall time, the "in 5 minutes" timer
will not be affected by that change; it will still wait the full 300 seconds. >OTOH, the timer set for 9:05 will expire *at* 9:05. If the *current* notion >of wall time claims that it is now 7:15, then you've got a long wait ahead!
On smaller systems, the two ideas of time are often closely intertwined;
the system tick (jiffy) effectively drives the time-of-day clock. And, >"event times" might be bound at time of syscall *or* resolved late.
So, if the system time changes (can actually go backwards in some poorly >designed systems!), your notion of "the present time" -- and, with it,
your expectations of FUTURE times -- changes.
Again, it should be a simple distinction to get straight in your
head. When you're dealing with times that the rest of the world
uses, use the wall time. When you're dealing with relative times,
use the system time. And, be prepared for there to be discontinuities >between the two!
It may seem trivial but if you are allowing something to interfere with
your notion of "now", then you have to be prepared when that changes
outside of your control.
[I have an "atomic" clock that was off by 14 hours. WTF??? When
your day-night schedule is as freewheeling as mine, it makes a
difference if the clock tells you a time that suggests the sun is
*rising* when, in fact, it is SETTING! <frown>]
Your particular example isn't possible, but other things are -
including having values seem to appear or disappear when they are
examined at different points within your transaction.
There absolutely IS the notion of partial completion when you use
inner (ie. sub-) transactions, which can succeed and fail
independently of each other and of the outer transaction(s) in which
they are nested. Differences in isolation can permit side effects from
other ongoing transactions to be visible.
So-called 'edge' computing largely is based on distributed tuple-space >>>>> models specifically /because/ they are (or can be) self-organizing and >>>>> are temporally decoupled: individual devices can come and go at will, >>>>> but the state of ongoing computations is maintained in the fabric.
But (in the embedded/RT system world) they are still "devices" with
specific functionalities. We're not (yet) accustomed to treating
"processing" as a resource that can be dispatched as needed. There
are no mechanisms where you can *request* more processing (beyond
creating another *process* and hoping <something> recognizes that
it can co-execute elsewhere)
The idea doesn't preclude having specialized nodes ... the idea is
I'm arguing for the case of treating each node as "specialized + generic"
and making the generic portion available for other uses that aren't
applicable to the "specialized" nature of the node (hardware).
Your doorbell sits in an idiot loop waiting to "do something" -- instead
of spending that "idle time" working on something *else* so the "device"
that would traditionally be charged with doing that something else
can get by with less resources on-board.
[I use cameras galore. Imagining feeding all that video to a single
"PC" would require me to keep looking for bigger and faster PCs!]
And if frames from the camera are uploaded into a cloud queue, any
device able to process them could look there for new work. And store
its results into a different cloud queue for the next step(s). Faster
and/or more often 'idle' CPUs will do more work.
Pipelines can be 'logical' as well as 'physical': opportunistically
processed data queues qualify as 'pipeline' stages.
You often seem to get hung up on specific examples and fail to see how
the idea(s) can be applied more generally.
simply that if a node crashes, the task state [for some approximation]
is preserved "in the cloud" and so can be restored if the same node
returns, or the task can be assumed by another node (if possible).
It often requires moving code as well as data, and programs need to be
written specifically to regularly checkpoint / save state to the
cloud, and to be able to resume from a given checkpoint.
TS models produce an implicit 'sequence' checkpoint with every datum
tuple uploaded into the cloud. In many cases that sequencing is all
that's needed by external processes to accomplish the goal.
Explicit checkpoint is required to resume only when processing is so
time consuming that you /expect/ the node may fail (or be reassigned)
before completing work on its current 'morsel' of input. To avoid
REDOing lots of work - e.g., by starting over - it makes more sense to periodically checkpoint your progress.
Different meta levels.
Processing of your camera video above is implicitly checkpointed with
every frame that's completed (at whatever stage). It's a perfect
situation for distributed TS.
Yes. For me, all memory is wrapped in "memory objects". Each has particular
attributes (and policies/behaviors), depending on its intended use.
E.g., the TEXT resides in an R/O object ...
The DATA resides in an R/W object ...
I leverage my ability to "migrate" a task (task is resource
container) to *pause* the task and capture a snapshot of each
memory object (some may not need to be captured if they are
copies of identical objects elsewhere in the system) AS IF it
was going to be migrated.
But, instead of migrating the task, I simply let it resume, in place.
The problem with this is watching for side-effects that happen
between snapshots. I can hook all of the "handles" out of the
task -- but, no way I can know what each of those "external
objects" might be doing.
Or necessarily be able to reconnect the plumbing.
OTOH, if I know that no external references have taken place since the
last "snapshot", then I can safely restart the task from the last
snapshot.
It is great for applications that are well suited to checkpointing,
WITHOUT requiring the application to explicitly checkpoint itself.
The point I was making above is that TS models implicitly checkpoint
when they upload data into the cloud. If that data contains explicit sequencing, then it can be an explicit checkpoint as well.
Obviously this depends on how you write the program and the
granularity of the data. A program like your AI that needs to
save/restore a whole web of inferences is very different from one that
when idle grabs a few frames of video and transcodes them to MPG.
The "tuple-space" aspect specifically is to coordinate efforts by
multiple nodes without imposing any particular structure or
communication pattern on partipating nodes ... with appropriate TS
support many different communication patterns can be accomodated
simultaneously.
:
For many programs, checkpoint data will be much more compact than a
snapshot of the running process, so it makes more sense to design
programs to be resumed - particularly if you can arrange that reset of
a faulting node doesn't eliminate the program, so code doesn't have to
be downloaded as often (or at all).
Yes, but that requires more skill on the part of the developer.
And, makes it more challenging for him to test ("What if your
app dies *here*? Have you checkpointed the RIGHT things to
be able to recover? And, what about *here*??")
Only for resuming non-sequenced work internal to the node. Whether
you need to do this depends on the complexity of the program.
Like I said, if the work is simple enough to just do over, the
implicit checkpoint of a datum being in the 'input' queue may be
sufficient.
There are TS models that actively support the notion of 'checking out'
work, tracking who is doing what, timing-out unfinished work,
restoring 'checked-out' (removed) data, ignoring results from
timed-out workers (should the result show up eventually), etc.
The TS server is more complicated, but the clients don't have to be.
I'm particularly focused on user-level apps (scripts) where I can
build hooks into the primitives that the user employs to effectively
keep track of what they've previously been asked to do -- keeping in
mind that these will tend to be very high-levels of abstraction
(from the user's perspective).
E.g.,
At 5:30PM record localnews
At 6:00PM record nationalnews
remove_commercials(localnews)
remove_commercials(nationalnews)
when restarted, each primitive can look at the current time -- and state
of the "record" processes -- to sort out where they are in the sequence.
And, the presence/absence of the "commercial-removed" results. (obviously >> you can't record a broadcast that has already ended so why even try!)
Note that the above can be a KB of "code" + "state" -- because
all of the heavy lifting is (was?) done in other processes.
Even if the checkpoint data set is enormous, it often can be saved
incrementally. You then have to weigh the cost of resuming, which
requires the whole data set be downloaded.
Well, recording is a sequential, single node process. Obviously
different nodes can record different things simultaneously.
But - depending on how you identify content vs junk - removing the commercials could be done in parallel by a gang, each of which needs
only to look at a few video frames at a time.
OTOH, if you want to do something at 9:05 -- assuming it is 9:00 now -- you >> set THAT timer based on the wall time. The guarantee it gives is that
it will trigger at or after "9:05"... regardless of how many seconds elapse >> between now and then!
So, if something changes the current wall time, the "in 5 minutes" timer
will not be affected by that change; it will still wait the full 300 seconds.
OTOH, the timer set for 9:05 will expire *at* 9:05. If the *current* notion >> of wall time claims that it is now 7:15, then you've got a long wait ahead! >>
On smaller systems, the two ideas of time are often closely intertwined;
In large systems too!
If time goes backwards, all bets are off. Most systems are designed
so that can't happen unless a priveleged user intervenes. System time generally is kept in UTC and 'display' time is computed wrt system
time when necessary.
But your notion of 'wall time' seems unusual: typically it refers to
a notion of time INDEPENDENT of the computer - ie. the clock on the
wall, the watch on my wrist, etc. - not to whatever the computer may /display/ as the time.
Ie. if you turn back the wall clock, the computer doesn't notice. If
you turn back the computer's system clock, then you are an
administrator and you get what you deserve.
There are a number of monotonic time conventions, but mostly you just
work in UTC if you want to ignore local time conventions like
'daylight saving' that might result in time moving backwards. Network time-set protocols never move the local clock backwards: they adjust
the length of the local clock tick such that going forward the local
time converges with the external time at some (hopefully near) point
in the future.
You still might encounter leap-seconds every so often, but [so far]
they have only gone forward so as yet they haven't caused problems
with computed delays. Not guaranteed though.
My AT parser produced results that depended on current system time to calculate, but the results were fixed points in UTC time wrt the 1970
Unix epoch. The computed point might be 300 seconds or might be
3,000,000 seconds from time of the parse - but it didn't matter so
long as nobody F_d with the system clock.
the system tick (jiffy) effectively drives the time-of-day clock. And,
"event times" might be bound at time of syscall *or* resolved late.
So, if the system time changes (can actually go backwards in some poorly
designed systems!), your notion of "the present time" -- and, with it,
your expectations of FUTURE times -- changes.
Again, it should be a simple distinction to get straight in your
head. When you're dealing with times that the rest of the world
uses, use the wall time. When you're dealing with relative times,
use the system time. And, be prepared for there to be discontinuities
between the two!
Yes, but if you have to guard against (for lack of a better term)
'timebase' changes, then your only recourse is to use absolute
countdown.
The problem is, the current state of a countdown has to be maintained continuously and it can't easily be used in a 'now() >= epoch' polling software timer. That makes it very inconvenient for some uses.
And there are times when you really do want the delay to reflect the
new clock setting: ie. the evening news comes on at 6pm regardless of daylight saving, so the showtime moves (in opposition) with the change
in the clock.
Either countdown or fixed epoch can handle this if computed
appropriately (i.e. daily with reference to calendar) AND the computer remains online to maintain the countdown for the duration. If the
computer may be offline during the delay period, then only fixed epoch
will work.
It may seem trivial but if you are allowing something to interfere with
your notion of "now", then you have to be prepared when that changes
outside of your control.
[I have an "atomic" clock that was off by 14 hours. WTF??? When
your day-night schedule is as freewheeling as mine, it makes a
difference if the clock tells you a time that suggests the sun is
*rising* when, in fact, it is SETTING! <frown>]
WTF indeed. The broadcast is in UTC or GMT (depending), so if your
clock was off it had to be because it's offset was wrong.
I say 'offset' rather than 'timezone' because some "atomic" clocks
have no setup other than what is the local time. Internally, the
mechanism just notes the difference between local and broadcast time
during setup, and if the differential becomes wrong it adjusts the
[Some aggressive eliding as we're getting pretty far afield of
"exception vs. error code"]
Your particular example isn't possible, but other things are -
including having values seem to appear or disappear when they are
examined at different points within your transaction.
But the point of the transaction is to lock these changes
(or recognize their occurrence) so this "ambiguity" can't
manifest. (?)
The "client" either sees the result of entire transaction or none of it.
There absolutely IS the notion of partial completion when you use
inner (ie. sub-) transactions, which can succeed and fail
independently of each other and of the outer transaction(s) in which
they are nested. Differences in isolation can permit side effects from
other ongoing transactions to be visible.
But you don't expose those partial results (?)
How would the client know that he's seeing partial results?
[Some aggressive eliding as we're getting pretty far afield of
"exception vs. error code"]
I'm arguing for the case of treating each node as "specialized + generic" >>> and making the generic portion available for other uses that aren't
applicable to the "specialized" nature of the node (hardware).
Your doorbell sits in an idiot loop waiting to "do something" -- instead >>> of spending that "idle time" working on something *else* so the "device" >>> that would traditionally be charged with doing that something else
can get by with less resources on-board.
[I use cameras galore. Imagining feeding all that video to a single
"PC" would require me to keep looking for bigger and faster PCs!]
... if frames from the camera are uploaded into a cloud queue, any
device able to process them could look there for new work. And store
its results into a different cloud queue for the next step(s). Faster
and/or more often 'idle' CPUs will do more work.
That means every CPU must know how to recognize that sort of "work"
and be able to handle it. Each of those nodes then bears a cost
even if it doesn't actually end up contributing to the result.
it also makes the "cloud" a shared resource akin to the "main computer".
What do you do when it isn't available?
If the current resource set is insufficient for the current workload,
then (by definition) something has to be shed. My "workload manager"
handles that -- deciding that there *is* a resource shortage (by looking
at how many deadlines are being missed/aborted) as well as sorting out
what the likeliest candidates to "off-migrate" would be.
Similarly, deciding when there is an abundance of resources that
could be offered to other nodes.
So, if a node is powered up *solely* for its compute resources
(or, it's unique hardware-related tasks have been satisfied) AND
it discovers another node(s) has enough resources to address
its needs, it can push it's workload off to that/those node(s) and
then power itself down.
Each node effectively implements part of a *distributed* cloud
"service" by holding onto resources as they are being used and
facilitating their distribution when there are "greener pastures"
available.
But, unlike a "physical" cloud service, they accommodate the
possibility of "no better space" by keeping the resources
(and loads) that already reside on themselves until such a place
can be found -- or created (i.e., bring more compute resources
on-line, on-demand). They don't have the option of "parking"
resources elsewhere, even as a transient measure.
When a "cloud service" is unavailable, you have to have a backup
policy in place as to how you'll deal with these overloads.
I don't have a "general" system. :> And, suspect future (distributed) >embedded systems will shy away from the notion of any centralized "controller >node" for the obvious dependencies that that imposes on the solution.
Sooner or later, that node will suffer from scale. Or, reliability.
(one of the initial constraints I put on the system was NOT to rely on
any "outside" service; why not use a DBMS "in the cloud"? :> )
I can [checkpoint] just by letting a process migrate itself to its
current node -- *if* it wants to "remember" that it can resume cleanly
from that point (but not any point beyond that unless side-effects
are eliminated). The first step in the migration effectively creates
the process snapshot.
There is overhead to taking that snapshot -- or pushing those
"intermediate results" to the cloud. You have to have designed your >*application* with that in mind.
Just like an application can choose to push "temporary data" into
the DBMS, in my world. And, incur those costs at run-time.
The more interesting problem is seeing what you can do "from the
outside" without the involvement of the application.
E.g., if an application had to take special measures in order to
be migrate-able, then I suspect most applications wouldn't be!
And, as a result, the system wouldn't have that flexibility.
OTOH, if the rules laid out for the environment allow me to wedge
that type of service *under* the applications, then there's no
cost-adder for the developers.
Processing of your camera video above is implicitly checkpointed with
every frame that's completed (at whatever stage). It's a perfect
situation for distributed TS.
But that means the post processing has to happen WHILE the video is
being captured. I.e., you need "record" and "record-and-commercial-detect" >primitives. Or, to expose the internals of the "record" operation.
Similarly, you could retrain the speech models WHILE you are listening
to a phone call. But, that means you need the horsepower to do so
AT THAT TIME, instead of just capturing the audio ("record") and
doing the retraining "when convenient" ("retrain").
I've settled on simpler primitives that can be applied in more varied >situations. E.g., you will want to "record" the video when someone
wanders onto your property. But, there won't be any "commercials"
to detect in that stream.
Trying to make "primitives" that handle each possible combination of
actions seems like a recipe for disaster; you discover some "issue"
and handle it in one implementation and imperfectly (or not at all)
handle it in the other. "Why does it work if I do 'A then B' but
'B while A' chokes?"
Remember, we're (I'm) trying to address something as "simple" as
"exceptions vs error codes", here. Expecting a developer to write
code with the notion of partial recovery in mind goes far beyond
that!
He can *choose* to structure his application/object/service in such
a way that makes that happen. Or not.
:
I think it's hard to *generally* design solutions that can be
interrupted and partially restored. You have to make a deliberate
effort to remember what you've done and what you were doing.
We seem to have developed the habit/practice of not "formalizing" >intermediate results as we expect them to be transitory.
*Then*, you need some assurance that you *will* be restarted; otherwise,
the progress that you've already made may no longer be useful.
I don't, for example, universally use my checkpoint-via-OS hack
because it will cause more grief than it will save. *But*, if
a developer knows it is available (as a service) and the constraints
of how it works, he can offer a hint (at install time) to suggest
his app/service be installed with that feature enabled *instead* of
having to explicitly code for resumption.
What I want to design against is the need to over-specify resources
*just* for some "job" that may be infrequent or transitory. That
leads to nodes costing more than they have to. Or, doing less than
they *could*!
If someTHING can notice imbalances in resources/demands and dynamically >adjust them, then one node can act as if it has more "capability" than
its own hardware would suggest.
[Time is a *huge* project because of all the related issues. You
still need some "reference" for the timebases -- wall and system.
And, a way to ensure they track in some reasonably consistent
manner: 11:00 + 60*wait(60 sec) should bring you to 11:00
even though you're using times from two different domains!]
I have no "epoch". Timestamps reflect the system time at which the
event occurred. If the events had some relation to "wall time"
("human time"), then any discontinuities in that time frame are
the problem of the human. Time "starts" at the factory. Your
system's "system time" need bear no relationship to mine.
[When the system boots, it has no idea "what time it is" until it can
get a time fix from some external agency. That *can* be an RTC -- but,
RTC batteries can die, etc. It can be manually specified -- but, that
can be in error. <shrug> The system doesn't care. The *user* might
care (if his shows didn't get recorded at the right times)...]
For me, time doesn't exist when EVERYTHING is off. Anything that
was supposed to happen during that interval obviously can't happen.
And, nothing that has happened (in the environment) can be "noticed"
so there's no way of ordering those observations!
If I want some "event" to be remembered beyond an outage, then
the time of the event has to be intentionally stored in persistent
storage (i.e., the DBMS) and retrieved from it (and rescheduled)
once the system restarts.
These tend to be "human time" events (broadcast schedules, HVAC
events, etc.). Most "system time" events are short-term and don't
make sense spanning an outage *or* aren't particularly concerned
with accuracy (e.g., vacuum the DB every 4 hours).
I have a freerunning timer that is intended to track the passage of
time during an outage (like an RTC would). It can never be set
(reset) so, in theory, tells me what the system timer WOULD have
been, had the system not suffered an outage.
----
Now, back to my error/exception problem. I have to see if there are
any downsides to offering a dual API to address each developer's
"style"...
Now, back to my error/exception problem. I have to see if there are
any downsides to offering a dual API to address each developer's
"style"...
On Sun, 22 May 2022 14:12:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Now, back to my error/exception problem. I have to see if there are
any downsides to offering a dual API to address each developer's
"style"...
I don't see any downside to a dual API other than naming conventions,
which might be avoided if it's possible to link a 'return value'
library vs an exception throwing library ...
But I think it would be difficult to do both all the way down in
parallel. Simpler to pick one as the basis, implement everything in
your chosen model, and then provide a set of wrappers that convert to
the other model.
The question then is: which do you choose as the basis?
Return value doesn't involve compiler 'magic', so you can code in any
languge - including ones that don't offer exceptions. However, code
may be more complicated by the need to propagate errors.
Exceptions lead to cleaner code and (generally) more convenient
writing, but they do involve compiler 'magic' and so limit the
langauges that may be used. Code that depends on exceptions also may
be slower to handle errors.
13 of one. Baker's dozen of the other.
On Sun, 22 May 2022 14:12:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
I'm arguing for the case of treating each node as "specialized + generic" >>>> and making the generic portion available for other uses that aren't
applicable to the "specialized" nature of the node (hardware).
Your doorbell sits in an idiot loop waiting to "do something" -- instead >>>> of spending that "idle time" working on something *else* so the "device" >>>> that would traditionally be charged with doing that something else
can get by with less resources on-board.
[I use cameras galore. Imagining feeding all that video to a single
"PC" would require me to keep looking for bigger and faster PCs!]
... if frames from the camera are uploaded into a cloud queue, any
device able to process them could look there for new work. And store
its results into a different cloud queue for the next step(s). Faster
and/or more often 'idle' CPUs will do more work.
That means every CPU must know how to recognize that sort of "work"
and be able to handle it. Each of those nodes then bears a cost
even if it doesn't actually end up contributing to the result.
Every node must know how to look for work, but if suitable code can be downloaded on demand, then the nodes do NOT have to know how to do any particular kind of work.
Ie. looking for new work yields a program. Running the program looks
for data to process. You can put limits into it such as to terminate
the program if no new data is seen for a while.
Similarly on the cloud side, the server(s) might be aware of
program/data associations and only serve up programs that have data
queued to process. You can tweak this with priorities.
it also makes the "cloud" a shared resource akin to the "main computer".
What do you do when it isn't available?
The cloud is distributed so that [to some meaningful statistical
value] it always is available to any node having a working network connection.
You have to design cloud services with expectation that network
partitioning will occur and the cloud servers themselves will lose
contact with one another. They need to be self-organizing (for when
the network is restored) and databases they maintain should be
redundant, self-repairing, and modeled on BASE rather than ACID.
If the current resource set is insufficient for the current workload,
then (by definition) something has to be shed. My "workload manager"
handles that -- deciding that there *is* a resource shortage (by looking
at how many deadlines are being missed/aborted) as well as sorting out
what the likeliest candidates to "off-migrate" would be.
Similarly, deciding when there is an abundance of resources that
could be offered to other nodes.
So, if a node is powered up *solely* for its compute resources
(or, it's unique hardware-related tasks have been satisfied) AND
it discovers another node(s) has enough resources to address
its needs, it can push it's workload off to that/those node(s) and
then power itself down.
Each node effectively implements part of a *distributed* cloud
"service" by holding onto resources as they are being used and
facilitating their distribution when there are "greener pastures"
available.
But, unlike a "physical" cloud service, they accommodate the
possibility of "no better space" by keeping the resources
(and loads) that already reside on themselves until such a place
can be found -- or created (i.e., bring more compute resources
on-line, on-demand). They don't have the option of "parking"
resources elsewhere, even as a transient measure.
When a "cloud service" is unavailable, you have to have a backup
policy in place as to how you'll deal with these overloads.
Right, but you're still thinking (mostly) in terms of self-contained
programs that do some significant end-to-end processing.
And that is fine, but it is not the best way to employ a lot of
small(ish) idle CPUs.
Think instead of a pipeline of small, special/single purpose programs
that can be strung together to accomplish the processing in well
defined stages.
Like a command-line gang in Unix: e.g., "find | grep | sort ..."
Then imagine that they are connected not directly by pipes or sockets,
but indirectly through shared 'queues' maintained by an external
service.
Then imagine that each pipeline stage is or can be on a different
node, and that if the node crashes, the program it was executing can
be reassigned to a new node. If no node is available, the pipeline
halts ... with its state and partial results preserved in the cloud
... until a new node can take over.
Obviously, not every process can be decomposed in this way - but with
a bit of thought, surprisingly many processing tasks CAN BE adapted to
this model.
I don't have a "general" system. :> And, suspect future (distributed)
embedded systems will shy away from the notion of any centralized "controller
node" for the obvious dependencies that that imposes on the solution.
"central control" is not a problem if the group can self-organize and (s)elect a new controller.
Sooner or later, that node will suffer from scale. Or, reliability.
(one of the initial constraints I put on the system was NOT to rely on
any "outside" service; why not use a DBMS "in the cloud"? :> )
"Cloud" is not any particular implementation - it's the notion of
ubiquitous, high availability service. It also does not necessarily
imply "wide-area" - a cloud can serve a building or a campus.
[I'm really tired of the notion that words have only the meaning that marketing departments and the last N months of public consciousness
have bestowed on them. It's true that the actual term 'cloud
computing' originated in the 1990s, but the concepts embodied by the
term date from the 1950s.]
I can [checkpoint] just by letting a process migrate itself to its
current node -- *if* it wants to "remember" that it can resume cleanly >>from that point (but not any point beyond that unless side-effects
are eliminated). The first step in the migration effectively creates
the process snapshot.
There is overhead to taking that snapshot -- or pushing those
"intermediate results" to the cloud. You have to have designed your
*application* with that in mind.
Just like an application can choose to push "temporary data" into
the DBMS, in my world. And, incur those costs at run-time.
The more interesting problem is seeing what you can do "from the
outside" without the involvement of the application.
E.g., if an application had to take special measures in order to
be migrate-able, then I suspect most applications wouldn't be!
And, as a result, the system wouldn't have that flexibility.
OTOH, if the rules laid out for the environment allow me to wedge
that type of service *under* the applications, then there's no
cost-adder for the developers.
Understood. The problem is that notion of 'process' still is too heavyweight. You can easily checkpoint in place, but migrating the
process creates (at least) issues with reconnecting all the network
plumbing.
Ie. if the program wasn't written to notice that important connections
were lost and handle those situations ... everywhere ... then it can't survive a migration.
The TS model makes the plumbing /stateless/ and can - with a bit of
care - make the process more elastic and more resiliant in the face of various failures.
Processing of your camera video above is implicitly checkpointed with
every frame that's completed (at whatever stage). It's a perfect
situation for distributed TS.
But that means the post processing has to happen WHILE the video is
being captured. I.e., you need "record" and "record-and-commercial-detect" >> primitives. Or, to expose the internals of the "record" operation.
Not at all. What it means is that recording does not produce an
integrated video stream, but rather a sequence of frames. The frame
sequence then can be accessed by 'commercial-detect' which consumes[*]
the input sequence and produces a new output sequence lacking those
frames which represent commercial content. Finally, some other little program could take that commercial-less sequence and produce the
desired video stream.
[*] consumes or copies - the original data sequence could be left
intact for some other unrelated processing.
All of this can be done elastically, in the background, and
potentially in parallel by a gang of (otherwise idle) nodes.
Similarly, you could retrain the speech models WHILE you are listening
to a phone call. But, that means you need the horsepower to do so
AT THAT TIME, instead of just capturing the audio ("record") and
doing the retraining "when convenient" ("retrain").
Recording produces a sequence of clips to be analyzed by 'retrain'.
I've settled on simpler primitives that can be applied in more varied
situations. E.g., you will want to "record" the video when someone
wanders onto your property. But, there won't be any "commercials"
to detect in that stream.
Trying to make "primitives" that handle each possible combination of
actions seems like a recipe for disaster; you discover some "issue"
and handle it in one implementation and imperfectly (or not at all)
handle it in the other. "Why does it work if I do 'A then B' but
'B while A' chokes?"
You're simultaneously thinking too small AND too big.
Breaking the operation into pipeline(able) stages is the right idea,
but you need to think harder about what kinds of pipelines make sense
and what is the /minimum/maximum/average amount of processing that
makes sense for a pipeline stage.
Remember, we're (I'm) trying to address something as "simple" as
"exceptions vs error codes", here. Expecting a developer to write
code with the notion of partial recovery in mind goes far beyond
that!
He can *choose* to structure his application/object/service in such
a way that makes that happen. Or not.
:
I think it's hard to *generally* design solutions that can be
interrupted and partially restored. You have to make a deliberate
effort to remember what you've done and what you were doing.
We seem to have developed the habit/practice of not "formalizing"
intermediate results as we expect them to be transitory.
Right. But generally it is the case that only certain intermediate
points are even worthwhile to checkpoint.
But again, this line of thinking assumes that the processing both is
complex and time consuming - enough so that it is /expected/ to fail
before completion.
*Then*, you need some assurance that you *will* be restarted; otherwise,
the progress that you've already made may no longer be useful.
Any time a process is descheduled (suspended), for whatever reason,
there is no guarantee that it will wake up again. But it has to
behave AS IF it will.
I don't, for example, universally use my checkpoint-via-OS hack
because it will cause more grief than it will save. *But*, if
a developer knows it is available (as a service) and the constraints
of how it works, he can offer a hint (at install time) to suggest
his app/service be installed with that feature enabled *instead* of
having to explicitly code for resumption.
Snapshots of a single process /may/ be useful for testing or debugging [though I have doubts about how much]. I'm not sure what purpose they
really can serve in a production environment. After all, you don't
(usually) write programs /intending/ for them to crash.
For comparison: VM hypervisors offer snapshots also, but they capture
the state of the whole system. You not only save the state of your
process, but also of any system services it was using, any (local)
peer processes it was communicating with, etc. This seems far more
useful from a developer POV.
Obviously MMV and there may be uses I have not considered.
What I want to design against is the need to over-specify resources
*just* for some "job" that may be infrequent or transitory. That
leads to nodes costing more than they have to. Or, doing less than
they *could*!
If someTHING can notice imbalances in resources/demands and dynamically
adjust them, then one node can act as if it has more "capability" than
its own hardware would suggest.
[Time is a *huge* project because of all the related issues. You
still need some "reference" for the timebases -- wall and system.
And, a way to ensure they track in some reasonably consistent
manner: 11:00 + 60*wait(60 sec) should bring you to 11:00
even though you're using times from two different domains!]
Time always is a problem for any application that cares. Separating
the notion of 'system' time from human notions of time is necessary
but is not sufficient for all cases.
I have no "epoch". Timestamps reflect the system time at which the
event occurred. If the events had some relation to "wall time"
("human time"), then any discontinuities in that time frame are
the problem of the human. Time "starts" at the factory. Your
system's "system time" need bear no relationship to mine.
How is system time preserved through a system-wide crash? [consider a
power outage that depletes any/all UPSs.]
What happens if/when your (shared) precision clock source dies?
[When the system boots, it has no idea "what time it is" until it can
get a time fix from some external agency. That *can* be an RTC -- but,
RTC batteries can die, etc. It can be manually specified -- but, that
can be in error. <shrug> The system doesn't care. The *user* might
care (if his shows didn't get recorded at the right times)...]
Answer ... system time is NOT preserved in all cases.
For me, time doesn't exist when EVERYTHING is off. Anything that
was supposed to happen during that interval obviously can't happen.
And, nothing that has happened (in the environment) can be "noticed"
so there's no way of ordering those observations!
If I want some "event" to be remembered beyond an outage, then
the time of the event has to be intentionally stored in persistent
storage (i.e., the DBMS) and retrieved from it (and rescheduled)
once the system restarts.
Scheduled relative to the (new?, updated?) system time at the moment
of scheduling. But how do you store that desired event time? Ie., a countdown won't survive if up-to-moment residue can't be persisted
through a shutdown (or crash) and/or the system time at restart does
not reflect the outage period.
But if [as you said above] there's no starting 'epoch' to your
timebase - ie. no zero point corresponding to a point in human time -
then there also is no way to specify an absolute point in human time
for an event in the future.
On 5/24/2022 11:12 PM, George Neuner wrote:
On Sun, 22 May 2022 14:12:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
There are mechanisms that allow:
- a node to look for work and "pull" it onto itself
- a node to look for workers and "push" work onto them
- a node to look at loads and capacities and MOVE work around
- all of the above but with the nodes contributing "suggestions" to
some other agent(s) that make the actual changes
ANY node with a capability (handle) onto another node can create
a task on that other node. And, specify how to initialize the
ENTIRE state -- including copying a RUNNING state from another node.
You can't really come up with an *ideal* partitioning
(I suspect NP-complete under timeliness constraints?)
In my approach, looking for work is just "looking for a load
that you can *assume* -- from another! You already know there is
data FOR THAT TASK to process. And, the TEXT+DATA to handle
that is sitting there, waiting to be "plucked" (no need to
download from the persistent store -- which might have been
powered down to conserve power).
The goal isn't to "help" a process that is running elsewhere
but, rather, to find a better environment for that process.
I can create a redundant cloud service *locally*. But, if the cloud is >*remote*, then I'm dependant on the link to the outside world to
gain access to that (likely redundant) service.
If you've integrated that "cloud service" into your design, then
you really can't afford to lose that service. It would be like
having the OS reside somewhere beyond my control.
I can tolerate reliance on LOCAL nodes -- because I can exercise
some control over them. ...
So, using a local cloud would mean providing that service on
local nodes in a way that is reliable and efficient.
Letting workload managers on each node decide how to shuffle
around the "work" gives me the storage (in place) that the
cloud affords (the nodes are currently DOING the work!)
If the current resource set is insufficient for the current workload,
then (by definition) something has to be shed. My "workload manager"
handles that -- deciding that there *is* a resource shortage (by looking >>> at how many deadlines are being missed/aborted) as well as sorting out
what the likeliest candidates to "off-migrate" would be.
Similarly, deciding when there is an abundance of resources that
could be offered to other nodes.
So, if a node is powered up *solely* for its compute resources
(or, it's unique hardware-related tasks have been satisfied) AND
it discovers another node(s) has enough resources to address
its needs, it can push it's workload off to that/those node(s) and
then power itself down.
Each node effectively implements part of a *distributed* cloud
"service" by holding onto resources as they are being used and
facilitating their distribution when there are "greener pastures"
available.
But, unlike a "physical" cloud service, they accommodate the
possibility of "no better space" by keeping the resources
(and loads) that already reside on themselves until such a place
can be found -- or created (i.e., bring more compute resources
on-line, on-demand). They don't have the option of "parking"
resources elsewhere, even as a transient measure.
In my case, the worst dependency lies in the RDBMS. But, it's loss
can be tolerated if you don't have to access data that is ONLY
available on the DB.
[The switch is, of course, the extreme example of a single point failure]
E.g., if you have the TEXT image for a "camera module" residing
on node 23, then you can use that to initialize the TEXT segment
of the camera on node 47. If you want to store something on the
DBMS, it can be cached locally until the DBMS is available. etc.
The TS model makes the plumbing /stateless/ and can - with a bit of
care - make the process more elastic and more resiliant in the face of
various failures.
Processing of your camera video above is implicitly checkpointed with
every frame that's completed (at whatever stage). It's a perfect
situation for distributed TS.
But that means the post processing has to happen WHILE the video is
being captured. I.e., you need "record" and "record-and-commercial-detect" >>> primitives. Or, to expose the internals of the "record" operation.
Not at all. What it means is that recording does not produce an
integrated video stream, but rather a sequence of frames. The frame
sequence then can be accessed by 'commercial-detect' which consumes[*]
the input sequence and produces a new output sequence lacking those
frames which represent commercial content. Finally, some other little
program could take that commercial-less sequence and produce the
desired video stream.
But there's no advantage to this if the "commercial detect" is going
to be done AFTER the recording. I.e., you're storing the recording,
frame by frame, in the cloud. I'm storing it on a "record object"
(memory or DB record). And, in a form that is convenient for
a "record" operation (without concern for the "commercial detect"
which may not be used in this case).
Exploiting the frame-by-frame nature only makes sense if you're
going to start nibbling on those frames AS they are generated.
Snapshots of a single process /may/ be useful for testing or debugging
[though I have doubts about how much]. I'm not sure what purpose they
really can serve in a production environment. After all, you don't
(usually) write programs /intending/ for them to crash.
Stop thinking about crashes. That implies something is "broken".
Any application that is *inherently* resumable (e.g., my archive
service) can benefit from an externally imposed snapshot that
is later restored. The developer is the one who is qualified to
make this assessment, not the system (it lacks the heuristics
though, arguably, could "watch" an application to see what handles
it invokes).
Similarly, the system can't know the nature of a task's deadlines.
But, can provide mechanisms to make use of that deadline data
FROM the developer. If you don't provide it, then I will
assume your deadline is at T=infinity... and your code will
likely never be scheduled! :> (if you try to game it, your
code also may never be scheduled: "That deadline is too soon
for me to meet it so lets not bother trying!")
On Sun, 22 May 2022 14:12:39 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
[Some aggressive eliding as we're getting pretty far afield of
"exception vs. error code"]
Best discussions always wander. <grin>
Your particular example isn't possible, but other things are -
including having values seem to appear or disappear when they are
examined at different points within your transaction.
But the point of the transaction is to lock these changes
(or recognize their occurrence) so this "ambiguity" can't
manifest. (?)
Yes ... and no.
The "client" either sees the result of entire transaction or none of it.
High isolation levels often result in measurably lower performance - "repeatable read" requires that when any underlying table is first
touched, the selected rows be locked, or be copied for local use. RR
also changes how subsequent uses of those rows are evaluated (see
below). Locking limits concurrency, copying uses resources (which
also may limit concurrency).
However, most client-side DB libraries do NOT accept whole tables as
results. Instead they open a cursor on the result and cache some
(typically small) number of rows surrounding the one currently
referenced by the cursor. Moving the cursor fetches new rows to
maintain the illusion that the whole table is available. Meanwhile
the transaction that produced the result is kept alive (modulo
timeout) because the result TVT is still needed until the client
closes the cursor.
[Ways around the problem of cursor based client libraries may include increasing the size of the row cache so that it can hold the entire
expected result table, or (if you can't do that) to make a local copy
of the result as quickly as possible. Remembering that the row cache
is per-cursor, per-connection, increasing the size of the cache(s) may
force restructuring your application to limit the number of open connections.]
On Thu, 26 May 2022 17:39:26 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
There are mechanisms that allow:
- a node to look for work and "pull" it onto itself
- a node to look for workers and "push" work onto them
- a node to look at loads and capacities and MOVE work around
- all of the above but with the nodes contributing "suggestions" to
some other agent(s) that make the actual changes
In my approach, looking for work is just "looking for a load
that you can *assume* -- from another! You already know there is
data FOR THAT TASK to process. And, the TEXT+DATA to handle
that is sitting there, waiting to be "plucked" (no need to
download from the persistent store -- which might have been
powered down to conserve power).
The goal isn't to "help" a process that is running elsewhere
but, rather, to find a better environment for that process.
That's a form of "work stealing". But usually the idea is to migrate
the task to a CPU which has [for some definition] a lighter load:
"lighter" because it's faster, or because it has more memory, etc.
Typically this leaves out the small fry.
The TS model is both more competitive but less ruthless - idle (or
lightly loaded) nodes don't steal from one another, but rather nodes
that ARE doing the same kind of processing compete for the input data sequence. More capable nodes get more done - but not /necessarily/ at
the expense of less capable nodes [that depends on the semantics of
the tuple services].
Very different.
I can create a redundant cloud service *locally*. But, if the cloud is
*remote*, then I'm dependant on the link to the outside world to
gain access to that (likely redundant) service.
Yes. But you wouldn't choose to do that. If someone else chooses to
do it, that is beyond your control.
If you've integrated that "cloud service" into your design, then
you really can't afford to lose that service. It would be like
having the OS reside somewhere beyond my control.
I can tolerate reliance on LOCAL nodes -- because I can exercise
some control over them. ...
So, using a local cloud would mean providing that service on
local nodes in a way that is reliable and efficient.
Letting workload managers on each node decide how to shuffle
around the "work" gives me the storage (in place) that the
cloud affords (the nodes are currently DOING the work!)
But it means that even really /tiny/ nodes need some variant of your "workload manager". It's simpler just to look for something to do
when idle.
In my case, the worst dependency lies in the RDBMS. But, it's loss
can be tolerated if you don't have to access data that is ONLY
available on the DB.
[The switch is, of course, the extreme example of a single point failure]
E.g., if you have the TEXT image for a "camera module" residing
on node 23, then you can use that to initialize the TEXT segment
of the camera on node 47. If you want to store something on the
DBMS, it can be cached locally until the DBMS is available. etc.
Yes. Or [assuming a network connection] the DBMS could be made
distributed so it always is available.
Processing of your camera video above is implicitly checkpointed with >>>>> every frame that's completed (at whatever stage). It's a perfect
situation for distributed TS.
But that means the post processing has to happen WHILE the video is
being captured. I.e., you need "record" and "record-and-commercial-detect"
primitives. Or, to expose the internals of the "record" operation.
Not at all. What it means is that recording does not produce an
integrated video stream, but rather a sequence of frames. The frame
sequence then can be accessed by 'commercial-detect' which consumes[*]
the input sequence and produces a new output sequence lacking those
frames which represent commercial content. Finally, some other little
program could take that commercial-less sequence and produce the
desired video stream.
But there's no advantage to this if the "commercial detect" is going
to be done AFTER the recording. I.e., you're storing the recording,
frame by frame, in the cloud. I'm storing it on a "record object"
(memory or DB record). And, in a form that is convenient for
a "record" operation (without concern for the "commercial detect"
which may not be used in this case).
Exploiting the frame-by-frame nature only makes sense if you're
going to start nibbling on those frames AS they are generated.
No, frame by frame makes sense regardless. The video, AS a video,
does not need to exist until a human wants to watch. Until that time
[and even beyond] it can just as well exist as a frame sequence. The
only penalty for this is a bit more storage, and storage is [well,
before supply chains fell apart and the current administration
undertook to further ruin the economy, it was] decently cheap.
Snapshots of a single process /may/ be useful for testing or debugging
[though I have doubts about how much]. I'm not sure what purpose they
really can serve in a production environment. After all, you don't
(usually) write programs /intending/ for them to crash.
Stop thinking about crashes. That implies something is "broken".
If an RT task (hard or soft) misses its deadline, something IS
"broken" [for some definition]. Resetting the task automagically to
some prior - hopefully "non-broken" - state is not helpful if it masks
the problem.
Any application that is *inherently* resumable (e.g., my archive
service) can benefit from an externally imposed snapshot that
is later restored. The developer is the one who is qualified to
make this assessment, not the system (it lacks the heuristics
though, arguably, could "watch" an application to see what handles
it invokes).
I guess the question here is: can the developer say "snapshot now!" or
is it something that happens periodically, or even just best effort?
Similarly, the system can't know the nature of a task's deadlines.
But, can provide mechanisms to make use of that deadline data
FROM the developer. If you don't provide it, then I will
assume your deadline is at T=infinity... and your code will
likely never be scheduled! :> (if you try to game it, your
code also may never be scheduled: "That deadline is too soon
for me to meet it so lets not bother trying!")
We've had this conversation before: the problem is not that deadlines
can't be scheduled for ... it's that the critical deadlines can't all necessarily be enumerated.
:
to get D by T3, I need C by T2
to get C by T2, I need B by T1
to get B by T1, I need A by T0
:
ad regressus, ad nauseam.
And yes, a single explicit deadline can proxy for some number of
implicit deadlines. That's not the point and you know it <grin>.
George
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 69:35:42 |
Calls: | 6,712 |
Files: | 12,244 |
Messages: | 5,356,703 |