• [LINK] Why is it so difficult to retrain neural networks and get the sa

    From Computer Nerd Kev@21:1/5 to All on Sun Dec 25 07:37:14 2022
    Why is it so difficult to retrain neural networks and get the same
    results?
    By Pete Warden, November 25, 2022
    - https://petewarden.com/2022/11/25/why-is-it-so-difficult-to-retrain-neural-networks-and-get-the-same-results/

    "Last week I had a question from a colleague about reproducibility
    in TensorFlow, specifically in the 1.14 era. He wanted to be able
    to run the same training code multiple times and get exactly the
    same results, which on the surface doesn't seem like an
    unreasonable expectation. Machine learning training is
    fundamentally a series of arithmetic operations applied repeatedly,
    so what makes getting the same results every time so hard? I had
    the same question when we first started TensorFlow, and I was lucky
    enough to learn some of the answers from the numerical programming
    experts on the team, so I want to share a bit of what I discovered.

    There are good guides to achieving reproducibility out there, but
    they don't usually include explanations for why all the steps
    involved are necessary, or why training becomes so slow when you do
    apply them. One reason reproducibility is so hard is that every
    single calculation in the process has the potential to change the
    end results, which means every step is a potential weak link. This
    means you have to worry about everything from random seeds (which
    are actually fairly easy to make reproducible, but can be hard to
    locate in a big code base) to code in external libraries. CuDNN
    doesn't guarantee determinism by default for example, nor do some
    operations like reductions in TensorFlow's CPU code.

    It was the code execution part that confused me the most. I could
    understand the need to set random seeds correctly, but why would
    any numerical library with exactly the same inputs sometimes
    produce different outputs? It seemed like there must be some
    horrible design flaw to allow that to happen! Thankfully my
    teammates helped me wrap my head around what was going on.

    The key thing I was missing was timing. To get the best
    performance, numerical code needs to run on multiple cores, whether
    on the CPU or the GPU. The important part to understand is that how
    long each core takes to complete is not deterministic. Lots of
    external factors, from the presence of data in the cache to
    interruptions from multi-tasking can affect the timing. This means
    that the order of operations can change. Your high school math
    class might have taught you that x + y + z will produce the same
    result as z + y + x, but in the imperfect world of floating point
    numbers that ain't necessarily so. To illustrate this, I've created
    a short example program in the Godbolt Compiler Explorer." ...

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)