Why is it so difficult to retrain neural networks and get the same
results?
By Pete Warden, November 25, 2022
-
https://petewarden.com/2022/11/25/why-is-it-so-difficult-to-retrain-neural-networks-and-get-the-same-results/
"Last week I had a question from a colleague about reproducibility
in TensorFlow, specifically in the 1.14 era. He wanted to be able
to run the same training code multiple times and get exactly the
same results, which on the surface doesn't seem like an
unreasonable expectation. Machine learning training is
fundamentally a series of arithmetic operations applied repeatedly,
so what makes getting the same results every time so hard? I had
the same question when we first started TensorFlow, and I was lucky
enough to learn some of the answers from the numerical programming
experts on the team, so I want to share a bit of what I discovered.
There are good guides to achieving reproducibility out there, but
they don't usually include explanations for why all the steps
involved are necessary, or why training becomes so slow when you do
apply them. One reason reproducibility is so hard is that every
single calculation in the process has the potential to change the
end results, which means every step is a potential weak link. This
means you have to worry about everything from random seeds (which
are actually fairly easy to make reproducible, but can be hard to
locate in a big code base) to code in external libraries. CuDNN
doesn't guarantee determinism by default for example, nor do some
operations like reductions in TensorFlow's CPU code.
It was the code execution part that confused me the most. I could
understand the need to set random seeds correctly, but why would
any numerical library with exactly the same inputs sometimes
produce different outputs? It seemed like there must be some
horrible design flaw to allow that to happen! Thankfully my
teammates helped me wrap my head around what was going on.
The key thing I was missing was timing. To get the best
performance, numerical code needs to run on multiple cores, whether
on the CPU or the GPU. The important part to understand is that how
long each core takes to complete is not deterministic. Lots of
external factors, from the presence of data in the cache to
interruptions from multi-tasking can affect the timing. This means
that the order of operations can change. Your high school math
class might have taught you that x + y + z will produce the same
result as z + y + x, but in the imperfect world of floating point
numbers that ain't necessarily so. To illustrate this, I've created
a short example program in the Godbolt Compiler Explorer." ...
--
__ __
#_ < |\| |< _#
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)