• Re: Tracking bug report frequency

    From Martin Brown@21:1/5 to Joe Gwinn on Mon Sep 11 09:05:13 2023
    On 08/09/2023 22:14, Joe Gwinn wrote:
    On Fri, 8 Sep 2023 09:33:33 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:


    NR = Newton Raphson?

    Yes. Although these days my production code actually uses Halley's
    method in preference since it is almost the same speed to execute on
    modern pipeline CPUs and gives cubic rather than quadratic convergence.

    Interesting. I had not heard of Halley's Method, and cubic
    convergence is worthwhile.

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections provided
    that the function is sufficiently differentiable. They have to be
    computed very carefully with considerable attention to detail!

    One curiosity I have seen with them is that for me D_5 is anomalously
    fast despite containing D_4 ! On certain problems D5 is *faster* than NR
    (I suspect it is some sort of pipeline stall slowing the others down).

    In the past they all got slower because there was a division for each
    step down the chain. I have a new computational scheme which avoids that
    (at an increased risk of under/over flow in some edge cases)

    The codes all run as expected on simple starting guesses but when the
    fancy cubic ones are used they show this peculiar speed variation. It is
    most noticeable with the Intel compiler (less so on MSC or GCC).

    The reason these high order methods are fashionable right now is that
    the iterative methods require extra function evaluations and for that
    you have to crystallise the next value of x1 and recompute f(x1),f'(x1)
    etc. Historically they would iterate 3-5 times to solve it. Having to
    wait for that new input value is an unavoidable hard pipeline stall.

    But if you can get the initial guess with a relative error below
    10^(-16/N) then in an ideal world the Nth order difference corrector
    gets the answer in a single step and so the next value can be computed.

    Some practitioners prefer what they call modified NR which is a sort of
    hybrid of NR with Halley with slightly better convergence than Halley.
    However it involves a sqrt and isn't always better.

    Equation (4) in this paper (which includes the derivation) - method
    originally due to Laguerre but now seeing a revival in some codes.

    <https://academic.oup.com/mnras/article/467/2/1702/2929272>

    The gotcha is that although as written with abs() it never actually
    fails - a sufficiently bad guess will still give poor answers, but is
    much less likely to explode or oscillate in the way that NR does.

    So NR is better because it always warns?

    No. NR can go completely AWOL far too easily and diverge to infinity or oscillate chaotically between a set of values. Halley's method is the
    best of the three for stability, but MNR will get you slightly better
    precision if you can be sure your guesses are always close enough.

    I'm not convinced that it is worth the cost of a sqrt but with CPUs
    improving all the time it is now very close. The latest i5-12600 I have
    can do a sqrt in about the same time as it takes to do a divide!

    Wonder what algorithm it uses?

    My guess is some sort of cunning hardware shift, multiply and add. The
    latest generation of chips stand out in this respect. If only they would implement a hardware cube root then I would be very happy.

    And then there is co-plotting and staring - the Mark I eyeball is a
    very good pattern and anomaly detector, especially of concentricity or
    linearity.

    Looking at the plot of residual errors can be very enlightening. So can
    looking at the histogram for many test cases of the error distribution
    of solve, test and then measure error in the resulting answer.

    It is remarkably easy to home in on any weak spots and edge cases. My
    normal reporting rule is printout exceptions whenever the worst case
    high or low error is exceeded. Verbose mode shows when it is equalled.

    Yes. But I always make plots and stare at them.

    I only do that if something catches my attention as potentially odd.

    My initial reaction was that it was tested library code so it must be my >>>> problem - until I traced into it and saw how it failed. It gives 8
    instead of 16 sig fig in double precision for these pathological data.

    That's a good one.

    I was a bit surprised. It required an astonishingly exact combination of
    factors for it to trigger and one of my real problems hit it square on.

    So biased probing paid off here? Or only for diagnosis?

    It confirmed that given the right awkward parameters the solver would
    fail but also that it was a a very low risk event. Taking an overnight
    run just to find a single example using random data.

    I just happened to be unlucky in my exact choice of problem. Expansion
    around pi/4 was absolutely fine but the expansion around 3pi/4 went
    completely haywire even though it differed only in a handful of sign
    changes to coefficients. I was utterly convinced it was my user error.

    In the 1980s I hit a similar problem, but it wasn't hard to find. In
    scaled integer arithmetic (no floating point then), it's very common
    to multiply two 32-bit integers together, yielding a 64-bit product,
    and then divide that by a 64-bit scaling factor, yielding a 32-bit
    quotient. In the Fortran of the day, if one just wrote this as
    (a*b)/c, the compiler would truncate the product (a*b) to 32 bits
    before dividing by c, yielding a wildly wrong answer, without any
    warnings or drama. The solution was define a 64-bit intermediate
    variable, and do the operation in two lines: p=a*b, result = p/c.

    ISTR there being a library function MULDIV(a,b,c) for that at least on
    the platforms where I worked. We had REAL*4, REAL*8 and REAL*16 but I
    think only INTEGER*1, INTEGER*2 and INTEGER*4 available (late 70's).

    --
    Martin Brown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phil Hobbs@21:1/5 to Martin Brown on Mon Sep 11 08:36:31 2023
    On 2023-09-11 04:05, Martin Brown wrote:
    On 08/09/2023 22:14, Joe Gwinn wrote:
    On Fri, 8 Sep 2023 09:33:33 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:


    NR = Newton Raphson?

    Yes. Although these days my production code actually uses
    Halley's method in preference since it is almost the same speed
    to execute on modern pipeline CPUs and gives cubic rather than
    quadratic convergence.

    Interesting. I had not heard of Halley's Method, and cubic
    convergence is worthwhile.

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections
    provided that the function is sufficiently differentiable. They have
    to be computed very carefully with considerable attention to detail!


    One curiosity I have seen with them is that for me D_5 is
    anomalously fast despite containing D_4 ! On certain problems D5 is
    *faster* than NR (I suspect it is some sort of pipeline stall slowing
    the others down).

    In the past they all got slower because there was a division for
    each step down the chain. I have a new computational scheme which
    avoids that (at an increased risk of under/over flow in some edge
    cases)

    The codes all run as expected on simple starting guesses but when
    the fancy cubic ones are used they show this peculiar speed
    variation. It is most noticeable with the Intel compiler (less so on
    MSC or GCC).

    The reason these high order methods are fashionable right now is
    that the iterative methods require extra function evaluations and for
    that you have to crystallise the next value of x1 and recompute
    f(x1),f'(x1) etc. Historically they would iterate 3-5 times to solve
    it. Having to wait for that new input value is an unavoidable hard
    pipeline stall.

    But if you can get the initial guess with a relative error below
    10^(-16/N) then in an ideal world the Nth order difference corrector
    gets the answer in a single step and so the next value can be
    computed.

    Some practitioners prefer what they call modified NR which is a
    sort of hybrid of NR with Halley with slightly better
    convergence than Halley. However it involves a sqrt and isn't
    always better.

    Equation (4) in this paper (which includes the derivation) -
    method originally due to Laguerre but now seeing a revival in
    some codes.

    <https://academic.oup.com/mnras/article/467/2/1702/2929272>

    The gotcha is that although as written with abs() it never
    actually fails - a sufficiently bad guess will still give poor
    answers, but is much less likely to explode or oscillate in the
    way that NR does.

    So NR is better because it always warns?

    No. NR can go completely AWOL far too easily and diverge to infinity
    or oscillate chaotically between a set of values. Halley's method is
    the best of the three for stability, but MNR will get you slightly
    better precision if you can be sure your guesses are always close
    enough.

    I'm not convinced that it is worth the cost of a sqrt but with
    CPUs improving all the time it is now very close. The latest
    i5-12600 I have can do a sqrt in about the same time as it takes
    to do a divide!

    Wonder what algorithm it uses?

    My guess is some sort of cunning hardware shift, multiply and add.
    The latest generation of chips stand out in this respect. If only
    they would implement a hardware cube root then I would be very
    happy.

    And then there is co-plotting and staring - the Mark I eyeball
    is a very good pattern and anomaly detector, especially of
    concentricity or linearity.

    Looking at the plot of residual errors can be very enlightening.
    So can looking at the histogram for many test cases of the error
    distribution of solve, test and then measure error in the
    resulting answer.

    It is remarkably easy to home in on any weak spots and edge
    cases. My normal reporting rule is printout exceptions whenever
    the worst case high or low error is exceeded. Verbose mode shows
    when it is equalled.

    Yes. But I always make plots and stare at them.

    I only do that if something catches my attention as potentially odd.



    It's been quite awhile since I looked into the problem (and I was never
    close to your level of expertise there, Martin). However, this being
    Usenet, I am undeterred. ;)

    Cubic methods are great as long as computing the second derivative is
    cheap. BITD the "Illinois algorithm" and its relatives got a fair
    amount of attention (in select circles). They're regula-falsi based,
    but with hyperlinear convergence. The best that I know is a Bjorck &
    Anderson variant whose efficiency index (basically the average order of convergence per function evaluation) gets up to 8**(1/4) ~ 1.68.
    This compares with 1.0 for bisection or vanilla Regula Falsi, and
    sqrt(2) for the secant method.

    Secant is too squirrelly to use by itself, because of its tendency to
    hit a flat spot and jump straight to East Limbo, never to return.
    Newton has that problem too, so you have to find some finite interval
    where f_1 * f_2 < 0, and reject steps that fall outside it.

    Besides that, there are two cases with Mr. Newton: if computing the
    derivative is comparatively cheap, as with trig functions, he gets an efficiency index of 2, but if not, he's stuck down at sqrt(2) with
    secant. (Sometimes the derivative is harder to compute than the value,
    e.g. with special functions or ratios of trig polynomials, in which case
    Newton is even worse off.)

    On the other hand, if the derivative can be computed without nasty
    significance loss due to cancellation, N. is likely to work better at
    high precision.

    I've nearly always used derivative-free methods, because I'm usually not
    too concerned with ultimate accuracy, and the best ones can reach an
    efficiency index of very nearly 2.0, which is pretty cool. (Besides, e
    I'm too lazy to want to manage all of those derivatives.)

    My initial reaction was that it was tested library code so
    it must be my problem - until I traced into it and saw how
    it failed. It gives 8 instead of 16 sig fig in double
    precision for these pathological data.

    That's a good one.

    I was a bit surprised. It required an astonishingly exact
    combination of factors for it to trigger and one of my real
    problems hit it square on.

    So biased probing paid off here? Or only for diagnosis?

    It confirmed that given the right awkward parameters the solver would
    fail but also that it was a a very low risk event. Taking an
    overnight run just to find a single example using random data.

    I just happened to be unlucky in my exact choice of problem.
    Expansion around pi/4 was absolutely fine but the expansion around
    3pi/4 went completely haywire even though it differed only in a
    handful of sign changes to coefficients. I was utterly convinced it
    was my user error.

    In the 1980s I hit a similar problem, but it wasn't hard to find.
    In scaled integer arithmetic (no floating point then), it's very
    common to multiply two 32-bit integers together, yielding a 64-bit
    product, and then divide that by a 64-bit scaling factor, yielding
    a 32-bit quotient. In the Fortran of the day, if one just wrote
    this as (a*b)/c, the compiler would truncate the product (a*b) to
    32 bits before dividing by c, yielding a wildly wrong answer,
    without any warnings or drama. The solution was define a 64-bit
    intermediate variable, and do the operation in two lines: p=a*b,
    result = p/c.

    ISTR there being a library function MULDIV(a,b,c) for that at least
    on the platforms where I worked. We had REAL*4, REAL*8 and REAL*16
    but I think only INTEGER*1, INTEGER*2 and INTEGER*4 available (late
    70's).

    Cheers

    Phil Hobbs


    --
    Dr Philip C D Hobbs
    Principal Consultant
    ElectroOptical Innovations LLC / Hobbs ElectroOptics
    Optics, Electro-optics, Photonics, Analog Electronics
    Briarcliff Manor NY 10510

    http://electrooptical.net
    http://hobbs-eo.com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to '''newspam'''@nonad.co.uk on Mon Sep 11 16:58:20 2023
    On Mon, 11 Sep 2023 09:05:13 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:

    On 08/09/2023 22:14, Joe Gwinn wrote:
    On Fri, 8 Sep 2023 09:33:33 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:


    NR = Newton Raphson?

    Yes. Although these days my production code actually uses Halley's
    method in preference since it is almost the same speed to execute on
    modern pipeline CPUs and gives cubic rather than quadratic convergence.

    Interesting. I had not heard of Halley's Method, and cubic
    convergence is worthwhile.

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections provided
    that the function is sufficiently differentiable. They have to be
    computed very carefully with considerable attention to detail!

    And differentiation amplifies noise.


    One curiosity I have seen with them is that for me D_5 is anomalously
    fast despite containing D_4 ! On certain problems D5 is *faster* than NR
    (I suspect it is some sort of pipeline stall slowing the others down).

    In the old days, one would see a bunch of NOPs in the generated
    assembly, but that won't work for this. I would code a test case in
    assembly code and time it. Maybe by toggling a wire before and after,
    and looking with a scope.


    In the past they all got slower because there was a division for each
    step down the chain. I have a new computational scheme which avoids that
    (at an increased risk of under/over flow in some edge cases)

    The old dodge was to multiply by the pre computed reciprocal, if this
    could be computed less often than every division.


    The codes all run as expected on simple starting guesses but when the
    fancy cubic ones are used they show this peculiar speed variation. It is
    most noticeable with the Intel compiler (less so on MSC or GCC).

    The reason these high order methods are fashionable right now is that
    the iterative methods require extra function evaluations and for that
    you have to crystallise the next value of x1 and recompute f(x1),f'(x1)
    etc. Historically they would iterate 3-5 times to solve it. Having to
    wait for that new input value is an unavoidable hard pipeline stall.

    Use staggered parallel pipelines?


    But if you can get the initial guess with a relative error below
    10^(-16/N) then in an ideal world the Nth order difference corrector
    gets the answer in a single step and so the next value can be computed.

    Yes.

    I used to do this level of optimizing, but in operating-system
    kernels. High-level interrupt routines warranted personal instruction-by-instruction hand optimization, with a drawing for each
    and every step. It might have been 100 lines of assembly code.
    Multiplied by the interrupt rate and overhead of this or that bit of
    hardware. Eliminating even one clock cycle was worthwhile.


    Some practitioners prefer what they call modified NR which is a sort of
    hybrid of NR with Halley with slightly better convergence than Halley.
    However it involves a sqrt and isn't always better.

    Equation (4) in this paper (which includes the derivation) - method
    originally due to Laguerre but now seeing a revival in some codes.

    <https://academic.oup.com/mnras/article/467/2/1702/2929272>

    The gotcha is that although as written with abs() it never actually
    fails - a sufficiently bad guess will still give poor answers, but is
    much less likely to explode or oscillate in the way that NR does.

    So NR is better because it always warns?

    No. NR can go completely AWOL far too easily and diverge to infinity or >oscillate chaotically between a set of values. Halley's method is the
    best of the three for stability, but MNR will get you slightly better >precision if you can be sure your guesses are always close enough.

    I'm not convinced that it is worth the cost of a sqrt but with CPUs
    improving all the time it is now very close. The latest i5-12600 I have
    can do a sqrt in about the same time as it takes to do a divide!

    Wonder what algorithm it uses?

    My guess is some sort of cunning hardware shift, multiply and add. The
    latest generation of chips stand out in this respect. If only they would >implement a hardware cube root then I would be very happy.

    That makes sense, as many have a barrel shifter in hardware.



    And then there is co-plotting and staring - the Mark I eyeball is a
    very good pattern and anomaly detector, especially of concentricity or >>>> linearity.

    Looking at the plot of residual errors can be very enlightening. So can
    looking at the histogram for many test cases of the error distribution
    of solve, test and then measure error in the resulting answer.

    It is remarkably easy to home in on any weak spots and edge cases. My
    normal reporting rule is printout exceptions whenever the worst case
    high or low error is exceeded. Verbose mode shows when it is equalled.

    Yes. But I always make plots and stare at them.

    I only do that if something catches my attention as potentially odd.

    I always do it, precisely to improve my chances of noticing something
    odd.


    My initial reaction was that it was tested library code so it must be my >>>>> problem - until I traced into it and saw how it failed. It gives 8
    instead of 16 sig fig in double precision for these pathological data. >>>>
    That's a good one.

    I was a bit surprised. It required an astonishingly exact combination of >>> factors for it to trigger and one of my real problems hit it square on.

    So biased probing paid off here? Or only for diagnosis?

    It confirmed that given the right awkward parameters the solver would
    fail but also that it was a a very low risk event. Taking an overnight
    run just to find a single example using random data.

    I just happened to be unlucky in my exact choice of problem. Expansion
    around pi/4 was absolutely fine but the expansion around 3pi/4 went >completely haywire even though it differed only in a handful of sign
    changes to coefficients. I was utterly convinced it was my user error.

    In the 1980s I hit a similar problem, but it wasn't hard to find. In
    scaled integer arithmetic (no floating point then), it's very common
    to multiply two 32-bit integers together, yielding a 64-bit product,
    and then divide that by a 64-bit scaling factor, yielding a 32-bit
    quotient. In the Fortran of the day, if one just wrote this as
    (a*b)/c, the compiler would truncate the product (a*b) to 32 bits
    before dividing by c, yielding a wildly wrong answer, without any
    warnings or drama. The solution was define a 64-bit intermediate
    variable, and do the operation in two lines: p=a*b, result = p/c.

    ISTR there being a library function MULDIV(a,b,c) for that at least on
    the platforms where I worked. We had REAL*4, REAL*8 and REAL*16 but I
    think only INTEGER*1, INTEGER*2 and INTEGER*4 available (late 70's).

    Yes, it either didn't exist yet or didn't cover the specific case we
    had.

    A parallel story from the late 1970s was while building a C-141
    maintenance trainer (mockup cockpit animated by a midi-computer of the
    day) in Fortran, and we were forced toimplement partly in assembler, specifically a collection of assembly-coded Fortran functions, to
    implement 32-bit bitwise logic functions. When I told the application programmers that they would be writing assembly, there was much
    wailing and rending of garments. But I provided a template and
    annotated example, and they soon got the hang of it.

    Thirty or forty year later, I get a call out of the blue from a
    company I'd never heard of. I don't know how they got my name. They
    had just won a contract to port the now legacy trainer code to C, and
    didn't know what to do with all that assembler. I told him the
    reason, and that this had been just before the ISA Extensions had been released, but that all that assembler did was bitwise logic, which
    could be done directly in C. The caller was very happy to hear this.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Brown@21:1/5 to Phil Hobbs on Wed Sep 13 13:54:45 2023
    On 11/09/2023 13:36, Phil Hobbs wrote:
    On 2023-09-11 04:05, Martin Brown wrote:

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections
    provided that the function is sufficiently differentiable. They have
    to be computed very carefully with considerable attention to detail!


    It's been quite awhile since I looked into the problem (and I was never
    close to your level of expertise there, Martin).  However, this being Usenet, I am undeterred. ;)

    Cubic methods are great as long as computing the second derivative is cheap.  BITD the "Illinois algorithm" and its relatives got a fair
    amount of attention (in select circles).  They're regula-falsi based,
    but with hyperlinear convergence.  The best that I know is a  Bjorck & Anderson variant whose efficiency index (basically the average order of convergence per function evaluation) gets up to  8**(1/4) ~ 1.68.
    This compares with 1.0 for bisection or vanilla Regula Falsi, and
    sqrt(2) for the secant method.

    That's a blast from the past. It used to be popular in some of these
    problems until a way was found to tame potential divergence (basically
    by making sure that the starting guess is always within the range where
    the sequence is absolutely convergent). Some analytical sleight of hand
    is needed to get it spot on or there will be some insanely small but
    still non-zero value where it goes crazy as eccentricity e -> 1.

    There has been a recent paper using bisection but it has various flaws -
    not least that it determines an accurate answer to the wrong question.

    Secant is too squirrelly to use by itself, because of its tendency to
    hit a flat spot and jump straight to East Limbo, never to return.
    Newton has that problem too, so you have to find some finite interval
    where f_1 * f_2 < 0, and reject steps that fall outside it.

    One of the cute features of Kepler's problem E = M + e*sin(E) is that
    one of the simplest naive upper bounds of E_0 = M+e is absolutely
    convergent so if you clip on that you remain numerically stable.

    Besides that, there are two cases with Mr. Newton: if computing the derivative is comparatively cheap, as with trig functions, he gets an efficiency index of 2, but if not, he's stuck down at sqrt(2) with
    secant.  (Sometimes the derivative is harder to compute than the value,
    e.g. with special functions or ratios of trig polynomials, in which case Newton is even worse off.)

    The trend at the moment is to avoid system trig functions and have 5th
    order polynomial expansions around a table of equispaced fixed sin, cos
    values. Right now the trick for speed is avoid divides where possible.

    On the other hand, if the derivative can be computed without nasty significance loss due to cancellation, N. is likely to work better at
    high precision.

    We always have analytical derivatives available (although they have some unpleasant properties in the singular corner).

    I've nearly always used derivative-free methods, because I'm usually not
    too concerned with ultimate accuracy, and the best ones can reach an efficiency index of very nearly 2.0, which is pretty cool.  (Besides, e
    I'm too lazy to want to manage all of those derivatives.)

    If I only have the function available numerically as a black box and no analytical derivative I tend to favour Golden ratio search for a maximum
    and then take a punt on computing the derivative numerically and seeing
    where NR actually lands. It either polishes the answer nicely or goes
    AWOL. A few extra function evaluations is usually neither here nor there.

    I did some work a long time ago on computing accurate numerical
    derivatives from equally spaced function evaluations. It still comes in
    handy sometimes (a bit of a lost art from the old days of manual
    computors sat at desks).

    What surprised me most was that a flaw like that could be lurking in
    almost every cubic solver (I bet NAGlib got it right) just waiting for
    the right set of unplayable data to come along.

    --
    Martin Brown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phil Hobbs@21:1/5 to Martin Brown on Wed Sep 13 15:24:53 2023
    Martin Brown <'''newspam'''@nonad.co.uk> wrote:
    On 11/09/2023 13:36, Phil Hobbs wrote:
    On 2023-09-11 04:05, Martin Brown wrote:

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections
    provided that the function is sufficiently differentiable. They have
    to be computed very carefully with considerable attention to detail!


    It's been quite awhile since I looked into the problem (and I was never
    close to your level of expertise there, Martin).  However, this being
    Usenet, I am undeterred. ;)

    Cubic methods are great as long as computing the second derivative is
    cheap.  BITD the "Illinois algorithm" and its relatives got a fair
    amount of attention (in select circles).  They're regula-falsi based,
    but with hyperlinear convergence.  The best that I know is a  Bjorck &
    Anderson variant whose efficiency index (basically the average order of
    convergence per function evaluation) gets up to  8**(1/4) ~ 1.68.
    This compares with 1.0 for bisection or vanilla Regula Falsi, and
    sqrt(2) for the secant method.

    That's a blast from the past. It used to be popular in some of these
    problems until a way was found to tame potential divergence (basically
    by making sure that the starting guess is always within the range where
    the sequence is absolutely convergent). Some analytical sleight of hand
    is needed to get it spot on or there will be some insanely small but
    still non-zero value where it goes crazy as eccentricity e -> 1.

    There has been a recent paper using bisection but it has various flaws -
    not least that it determines an accurate answer to the wrong question.

    Could be a problem.


    Secant is too squirrelly to use by itself, because of its tendency to
    hit a flat spot and jump straight to East Limbo, never to return.
    Newton has that problem too, so you have to find some finite interval
    where f_1 * f_2 < 0, and reject steps that fall outside it.

    One of the cute features of Kepler's problem E = M + e*sin(E) is that
    one of the simplest naive upper bounds of E_0 = M+e is absolutely
    convergent so if you clip on that you remain numerically stable.

    Fun. Last time I used Kepler’s equation was on a problem set in about
    1980. I recall being too lazy to use Newton, so since it was a planetary
    orbit with epsilon <<1, I just rearranged terms and iterated on epsilon.
    (Prof Ovenden was obviously bored with grading Newtonian solutions, because
    he gave me an extra point.) ;)

    The best thing about it is the names—M is “mean anomaly” and E is “eccentric anomaly”.


    Besides that, there are two cases with Mr. Newton: if computing the
    derivative is comparatively cheap, as with trig functions, he gets an
    efficiency index of 2, but if not, he's stuck down at sqrt(2) with
    secant.  (Sometimes the derivative is harder to compute than the value,
    e.g. with special functions or ratios of trig polynomials, in which case
    Newton is even worse off.)

    The trend at the moment is to avoid system trig functions and have 5th
    order polynomial expansions around a table of equispaced fixed sin, cos values. Right now the trick for speed is avoid divides where possible.

    On the other hand, if the derivative can be computed without nasty
    significance loss due to cancellation, N. is likely to work better at
    high precision.

    We always have analytical derivatives available (although they have some unpleasant properties in the singular corner).

    I've nearly always used derivative-free methods, because I'm usually not
    too concerned with ultimate accuracy, and the best ones can reach an
    efficiency index of very nearly 2.0, which is pretty cool.  (Besides, e
    I'm too lazy to want to manage all of those derivatives.)

    If I only have the function available numerically as a black box and no analytical derivative I tend to favour Golden ratio search for a maximum
    and then take a punt on computing the derivative numerically and seeing
    where NR actually lands. It either polishes the answer nicely or goes
    AWOL. A few extra function evaluations is usually neither here nor there.

    I did some work a long time ago on computing accurate numerical
    derivatives from equally spaced function evaluations. It still comes in
    handy sometimes (a bit of a lost art from the old days of manual
    computors sat at desks).

    Long ago I did one of those on an HP41C, along the way to an Adams-Bashforth-Moulton predictor-corrector ODE solver. (It’s easy to remember which is which—“Bashforth” is an excellent name for an extrapolator.) ;)

    What surprised me most was that a flaw like that could be lurking in
    almost every cubic solver (I bet NAGlib got it right) just waiting for
    the right set of unplayable data to come along.

    Cheers

    Phil Hobbs





    --
    Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Brown@21:1/5 to Joe Gwinn on Fri Sep 15 10:55:34 2023
    On 11/09/2023 21:58, Joe Gwinn wrote:
    On Mon, 11 Sep 2023 09:05:13 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:

    On 08/09/2023 22:14, Joe Gwinn wrote:
    On Fri, 8 Sep 2023 09:33:33 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:


    NR = Newton Raphson?

    Yes. Although these days my production code actually uses Halley's
    method in preference since it is almost the same speed to execute on
    modern pipeline CPUs and gives cubic rather than quadratic convergence. >>>
    Interesting. I had not heard of Halley's Method, and cubic
    convergence is worthwhile.

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections provided
    that the function is sufficiently differentiable. They have to be
    computed very carefully with considerable attention to detail!

    And differentiation amplifies noise.

    Symbolic differentiation doesn't. We have analytical expressions for the problem to solve and most of them are differentiable several times apart
    from where there are poles. Those regions need extra care.

    One curiosity I have seen with them is that for me D_5 is anomalously
    fast despite containing D_4 ! On certain problems D5 is *faster* than NR
    (I suspect it is some sort of pipeline stall slowing the others down).

    In the old days, one would see a bunch of NOPs in the generated
    assembly, but that won't work for this. I would code a test case in
    assembly code and time it. Maybe by toggling a wire before and after,
    and looking with a scope.

    I tend to use RDTSC although it is deprecated for this sort of thing it
    works fairly well. You just have to be aware of its limitations. The
    ultimate way to do it is run a shim that lets you access the machine
    specific registers and directly measure pipeline stalls and other modes
    of failure. I haven't done that for this code (yet).

    In the past they all got slower because there was a division for each
    step down the chain. I have a new computational scheme which avoids that
    (at an increased risk of under/over flow in some edge cases)

    The old dodge was to multiply by the pre computed reciprocal, if this
    could be computed less often than every division.

    These days you can pretty much rely on any decent compiler to precompute
    static constants into whatever form is best. The strength reduction
    rules work well. You can write x/3.0 confident in the knowledge that it
    will compile to x*(1.0/3) with the 1.0/3 computed at compile time.
    (although it is as well to check this on new compilers)

    Even if you do something like several expressions in a row /x most of
    the latest compilers will compute 1/x and then multiply it out. Trying
    to micromanage such things can actually make it worse since the compiler
    might store the intermediate value in memory if you code it explicitly.

    The codes all run as expected on simple starting guesses but when the
    fancy cubic ones are used they show this peculiar speed variation. It is
    most noticeable with the Intel compiler (less so on MSC or GCC).

    The reason these high order methods are fashionable right now is that
    the iterative methods require extra function evaluations and for that
    you have to crystallise the next value of x1 and recompute f(x1),f'(x1)
    etc. Historically they would iterate 3-5 times to solve it. Having to
    wait for that new input value is an unavoidable hard pipeline stall.

    Use staggered parallel pipelines?

    There is something about the odd numbered differences that makes them
    unusually fast.

    But if you can get the initial guess with a relative error below
    10^(-16/N) then in an ideal world the Nth order difference corrector
    gets the answer in a single step and so the next value can be computed.

    Yes.

    It is only being done because of the requirement for speed and accuracy
    at the same time. Every cycle counts. Someone is even working on a
    dedicated hardware solution using a variant of CORDIC for this problem.

    https://arxiv.org/abs/2008.02894

    They are getting quite close to having working hardware now.
    I'm not convinced that it is worth the cost of a sqrt but with CPUs
    improving all the time it is now very close. The latest i5-12600 I have >>>> can do a sqrt in about the same time as it takes to do a divide!

    Wonder what algorithm it uses?

    My guess is some sort of cunning hardware shift, multiply and add. The
    latest generation of chips stand out in this respect. If only they would
    implement a hardware cube root then I would be very happy.

    That makes sense, as many have a barrel shifter in hardware.
    ISTR there being a library function MULDIV(a,b,c) for that at least on
    the platforms where I worked. We had REAL*4, REAL*8 and REAL*16 but I
    think only INTEGER*1, INTEGER*2 and INTEGER*4 available (late 70's).

    Yes, it either didn't exist yet or didn't cover the specific case we
    had.

    I'm pretty sure it was in one of the locally written system libraries. I
    also had a routine to mask out certain harmless underflows in an
    astrophysics FLIC code. It was scaled in SI units which were something
    like h/c^2 which didn't leave a lot of room in 32bit floats.

    The original code spent most of its time in underflow recovery routines.
    I got an immediate 5x speed improvement with no loss of accuracy with
    just a few lines of assembler added to the project.

    Eventually the problem was rescaled to put things into the middle ground
    so that underflows were much less likely.

    --
    Martin Brown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to '''newspam'''@nonad.co.uk on Fri Sep 15 13:57:25 2023
    On Fri, 15 Sep 2023 10:55:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:

    On 11/09/2023 21:58, Joe Gwinn wrote:
    On Mon, 11 Sep 2023 09:05:13 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:

    On 08/09/2023 22:14, Joe Gwinn wrote:
    On Fri, 8 Sep 2023 09:33:33 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:


    NR = Newton Raphson?

    Yes. Although these days my production code actually uses Halley's
    method in preference since it is almost the same speed to execute on >>>>> modern pipeline CPUs and gives cubic rather than quadratic convergence. >>>>
    Interesting. I had not heard of Halley's Method, and cubic
    convergence is worthwhile.

    Indeed. Halley was pretty clever but forever overshadowed by Newton.

    Some of the latest codes use Delta_4 and Delta_5 corrections provided
    that the function is sufficiently differentiable. They have to be
    computed very carefully with considerable attention to detail!

    And differentiation amplifies noise.

    Symbolic differentiation doesn't. We have analytical expressions for the >problem to solve and most of them are differentiable several times apart
    from where there are poles. Those regions need extra care.

    One curiosity I have seen with them is that for me D_5 is anomalously
    fast despite containing D_4 ! On certain problems D5 is *faster* than NR >>> (I suspect it is some sort of pipeline stall slowing the others down).

    In the old days, one would see a bunch of NOPs in the generated
    assembly, but that won't work for this. I would code a test case in
    assembly code and time it. Maybe by toggling a wire before and after,
    and looking with a scope.

    I tend to use RDTSC although it is deprecated for this sort of thing it
    works fairly well. You just have to be aware of its limitations. The
    ultimate way to do it is run a shim that lets you access the machine
    specific registers and directly measure pipeline stalls and other modes
    of failure. I haven't done that for this code (yet).

    Well, the old EE's approach always involves an oscilloscope and an
    electrical signal from the device under test, simple and physical, so
    it cannot so easily lie. Well, before the advent of digital scopes,
    but still...



    In the past they all got slower because there was a division for each
    step down the chain. I have a new computational scheme which avoids that >>> (at an increased risk of under/over flow in some edge cases)

    The old dodge was to multiply by the pre computed reciprocal, if this
    could be computed less often than every division.

    These days you can pretty much rely on any decent compiler to precompute >static constants into whatever form is best. The strength reduction
    rules work well. You can write x/3.0 confident in the knowledge that it
    will compile to x*(1.0/3) with the 1.0/3 computed at compile time.
    (although it is as well to check this on new compilers)

    Even if you do something like several expressions in a row /x most of
    the latest compilers will compute 1/x and then multiply it out. Trying
    to micromanage such things can actually make it worse since the compiler >might store the intermediate value in memory if you code it explicitly.

    Yes.


    The codes all run as expected on simple starting guesses but when the
    fancy cubic ones are used they show this peculiar speed variation. It is >>> most noticeable with the Intel compiler (less so on MSC or GCC).

    The reason these high order methods are fashionable right now is that
    the iterative methods require extra function evaluations and for that
    you have to crystallise the next value of x1 and recompute f(x1),f'(x1)
    etc. Historically they would iterate 3-5 times to solve it. Having to
    wait for that new input value is an unavoidable hard pipeline stall.

    Use staggered parallel pipelines?

    There is something about the odd numbered differences that makes them >unusually fast.

    I'd be looking at hardware details like how this all maps onto
    physical memory, especially cache lines and the like.


    But if you can get the initial guess with a relative error below
    10^(-16/N) then in an ideal world the Nth order difference corrector
    gets the answer in a single step and so the next value can be computed.

    Yes.

    It is only being done because of the requirement for speed and accuracy
    at the same time. Every cycle counts. Someone is even working on a
    dedicated hardware solution using a variant of CORDIC for this problem.

    <https://arxiv.org/abs/2008.02894>

    They are getting quite close to having working hardware now.

    I've seen some CORDIC variations, typically for ellipses, but not this
    one.


    I'm not convinced that it is worth the cost of a sqrt but with CPUs
    improving all the time it is now very close. The latest i5-12600 I have >>>>> can do a sqrt in about the same time as it takes to do a divide!

    Wonder what algorithm it uses?

    My guess is some sort of cunning hardware shift, multiply and add. The
    latest generation of chips stand out in this respect. If only they would >>> implement a hardware cube root then I would be very happy.

    That makes sense, as many have a barrel shifter in hardware.
    ISTR there being a library function MULDIV(a,b,c) for that at least on
    the platforms where I worked. We had REAL*4, REAL*8 and REAL*16 but I
    think only INTEGER*1, INTEGER*2 and INTEGER*4 available (late 70's).

    Yes, it either didn't exist yet or didn't cover the specific case we
    had.

    I'm pretty sure it was in one of the locally written system libraries. I
    also had a routine to mask out certain harmless underflows in an
    astrophysics FLIC code. It was scaled in SI units which were something
    like h/c^2 which didn't leave a lot of room in 32bit floats.

    Sounds like Planck Scale?


    The original code spent most of its time in underflow recovery routines.
    I got an immediate 5x speed improvement with no loss of accuracy with
    just a few lines of assembler added to the project.


    Been there done that.

    One standard tool is a sampler tied to a hardware timer interrupt
    programmed for random intervals between interrupts, collecting PC
    addresses. The "random interval" must at least be mutually prime with
    respect to all periodic processes in the system under test.

    The histogram of PC addresses shows where the program was spending its
    time, often in a place nobody ever heard of, usually deep in some
    obscure part of the runtime system. Like underflow handling.


    Eventually the problem was rescaled to put things into the middle ground
    so that underflows were much less likely.

    What I've seen used a lot is straight SI units with double precision
    (64-bit) IEEE floating point arithmetic, with all conversions to and
    from legacy units performed only at the interfaces.


    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)