Forum: >>> Magnum BBS <<<

BG rollouts should be done like placebo-controlled clinical trials

From MK@21:1/5 to All on Wed Apr 5 17:26:35 2023

Warning! Reading this article may cause disillusionment.
Proceed at your own risk. It is quite long but if you take
my word for my rollout results, you can skip them tedious
numbers and just read my comments following each set.

For years I have argued that training BG AI bots through
cubeless 1-point games and then extrapolating cubeful
equities and match equities through the applications of
fancyful formulas and MET's built from recursive/circular
statistics inject biases and systematic errors that result
in inaccuracies of unknown magnitudes.

Consequently, the way rolloust are currently done causes
even slightest inaccuracies to compound and accumulate
and become mistakenly significant, while going unnoticed
by the unsuspecting bot believers.

The following is a demonstration of this using the sample
position from another thread titled "Slot or not?":

https://groups.google.com/g/rec.games.backgammon/c/tH5CbO_-8c4/m/4j8zw9kSAQAJ

XGID=-dBBB-BD-----A--bb-cbB-b--:1:-1:1:42:0:0:0:0:10

The "placebo" in my "clinical rollout trials" would be the
"random play" but we can't even begin to think about
testing using Ex-Gee because it offers no such features.

Luckily, Noo-BG allows separate player strength settings
including a "noise" setting that may come close enough
to random play here as an initial step towards our goal.

Note: I think non-deterministic noise would be closer to
random play but using it in any settings combinations
causes Noo-BG to crash each and every single time. So,
I had to settle for deterministic noise in my rollouts.

All rollouts done for 1296 trials, with all general options
unchecked except for "Bearoff Truncation" and "Variance
reduction", with all player options unchecked except for
"Deterministic noise = 1" in both colums as appropriate.

All stats below are in order of Win%, Win(g)%, Win(bg)%,
Lose(g)%, Lose(bg)%, Cubeless(eq) and Cubeful(eq),
with equity differences between moves in parantheses.

With our stage set, let's now start by establishing some
base lines for the top 3 moves in the sample position.

GNUbg ID: xs4GADy22QMBBg:QQkKAAAAAAAE

First, let's ask for hint at grandmaster (3-ply) level:

13/9 7/5 == 80.9 | 42.3 | 4.4 | 19.1 | 4.4 | 0.2 | +0.974
13/11 7/3 == 75.8 | 36.8 | 4.1 | 24.2 | 4.7 | 0.1 | +0.795 | (-0.179)
13/7 == 76.2 | 36.1 | 3.8 | 23.8 | 4.7 | 0.1 | +0.794 | (-0.179)

Next, let's do a cubeless rollout of the top 3 moves:

13/9 7/5 == 80.4 | 42.5 | 4.5 | 4.7 | 0.2 | +1.030
13/11 7/3 == 74.8 | 36.2 | 4.2 | 5.0 | 0.2 | +0.848 | (-0.182)
13/7 == 75.2 | 34.6 | 3.8 | 4.9 | 0.2 | +0.837 | (-0.193)

Then, let's do a cubeful rollout of the same 3 moves:

13/9 7/5 == 80.4 | 42.8 | 4.7 | 4.7 | 0.2 | +1.035 | +0.956
13/11 7/3 == 74.9 | 36.5 | 4.5 | 5.1 | 0.3 | +0.853 | +0.750 | (-0.206)
13/7 == 75.3 | 35.1 | 4.0 | 4.9 | 0.2 | +0.845 | +0.743 | (-0.213)

At this point, we can already see that equity differences
between the moves get bigger in the cubeless rollout vs.
the hint and even more so in the cubefull rollout because
with both sides set to high skill levels, checker and cube
inaccuracies compound and accumulate at higher rates,
by "echoing" (being duplicated) on both sides.

Playing the above 3 moves result in these 3 positions:

13/9 7/5 == GNUbg ID: trUTAAbGzgYAPA:AQEKAAAAAAAE
13/11 7/3 == GNUbg ID: drNDAAbGzgYAPA:AQEKAAAAAAAE
13/7 == GNUbg ID: ttkHAAbGzgYAPA:AQEKAAAAAAAE

At this point, we are ready to do our "clinical rollouts"
for each position, with now the player "O" being on roll,
but before the dice roll. (I did the rollouts separately and
grouped them by type of rollout for ease of comparing.)

Again, let's first do a cubeful rollout with both players
set to grandmaster, to establish a second base line:

13/9 7/5 == 19.7 | 4.5 | 0.2 | 42.9 | 4.5 | -1.034 | -0.955
13/11 7/3 == 25.4 | 5.2 | 0.2 | 36.5 | 4.5 | -0.849 | -0.744 | (+0.211)
13/7 == 24.9 | 5.0 | 0.2 | 35.1 | 4.2 | -0.843 | -0.741 | (+0.214)

The results are very close to the above cubeful rollout
and only slightly different perhaps because we made a
move and are one less turn away from the end of game.

Next, let's do a cubeful rollout with "O" as grandmaster
(3-ply) and "X" as expert (0-ply) with maximum noise:

13/9 7/5 == 74.3 | 36.6 | 4.1 | 6.8 | 0.6 | +0.817 | +1.436
13/11 7/3 == 77.0 | 34.8 | 4.1 | 4.8 | 0.2 | +0.979 | +1.535 | (-0.099)
13/7 == 75.9 | 35.7 | 3.5 | 6.2 | 0.5 | +0.845 | +1.486 | (-0.050)

(Interestingly here 13/7 is better than 13/11 7/3)

Then, let's do a cubeful rollout with "X" as grandmaster
(3-ply) and "O" as expert (0-ply) with maximum noise:

13/9 7/5 == 0.0 | 0.0 | 0.0 | 92.8 | 43.0 | -2.372 | -2.374
13/11 7/3 == 0.8 | 0.2 | 0.0 | 91.2 | 38.9 | -2.283 | -2.281 | (+0.093)
13/7 == 0.0 | 0.0 | 0.0 | 87.8 | 39.4 | -2.272 | -2.270 | (+0.104)

At this point, we can see that equity differences between
the moves get instead smaller when either side is set to
play close to randomly because the checker and the cube
inaccuracies don't "echo" but compound and accumulate
only by one side.

Let's continue, by again doing first a cubeless rollout with
both sides as grandmaster, to establish another base line:

13/9 7/5 == 19.6 | 4.7 | 0.2 | 42.5 | 4.5 | -1.030
13/11 7/3 == 25.2 | 5.0 | 0.2 | 36.2 | 4.2 | -0.848 | (+0.182)
13/7 == 24.8 | 4.9 | 0.2 | 34.6 | 3.8 | -0.837 | (+0.193)

Cubeless equities here are similar to cubeless equities in
the above "clinical cubeful rollout" but equity differences
are yet a little smaller because checker inaccuracies don't
get compounded with cube inaccuracies.

Next, let's do a cubeless rollout with "O" as grandmaster
(3-ply) and "X" as expert (0-ply) with maximum noise:

13/9 7/5 == 79.8 | 48.0 | 6.6 | 6.2 | 0.6 | +1.075
13/11 7/3 == 82.7 | 45.9 | 6.7 | 5.4 | 0.6 | +1.120 | (-0.045)
13/7 == 82.9 | 47.5 | 5.5 | 5.7 | 0.6 | +1.125 | (-0.050)

Then, let's do a cubeless rollout with "X" as grandmaster
(3-ply) and "O" as expert (0-ply) with maximum noise:

13/9 7/5 == 0.0 | 0.0 | 0.0 | 93.1 | 42.9 | -2.375
13/11 7/3 == 0.1 | 0.0 | 0.0 | 91.5 | 38.9 | -2.302 | (+0.073)
13/7 == 0.0 | 0.0 | 0.0 | 91.3 | 35.8 | -2.277 | (+0.098)

At this point, we can again see that when either side plays
close to randomly, equity differences between the moves
get even smaller (although by not as much), compared to
the corresponding "clinical cubeless rollouts" because not
only that checker and cube inaccuracies don't compound
but also checker inaccuracies don't echo and accumulate
only by one side.

Let's complete our test by doing two more rollouts with
both players set to expert (0-ply) with maximum noise,
i.e. both playing almost randomly.

First, cubeful for 1,296 and 5,184 trials to double-stitch:

[1,296] 13/9 7/5 == 13.0 | 0.0 | 0.1 | 43.1 | 7.7 | -1.249 | -1.246
[5,184] 13/9 7/5 == 12.0 | 0.0 | 0.1 | 43.6 | 7.7 | -1.275 | -1.271
[1,296] 13/11 7/3 == 11.0 | 0.2 | 0.0 | 40.7 | 4.9 | -1.233 | -1.226 | (+0.020) [5,184] 13/11 7/3 == 10.4 | 0.0 | 0.0 | 43.3 | 5.9 | -1.285 | -1.277 | (-0.006) [1,296] 13/7 == 11.5 | 0.0 | 0.1 | 41.0 | 6.2 | -1.247 | -1.242 | (+0.004) [5,184] 13/7 == 11.1 | 0.0 | 0.0 | 41.3 | 6.3 | -1.257 | -1.249 | (+0.022)

(Here 13/7 is better than 13/11 7/3 in 1,296 trials and
13/11 7/3 is better than even 13/9 7/5 in 5,184 trials)

Second, cubeless again for 1,296 as well as 5,184 trials:

[1,296] 13/9 7/5 == 13.0 | 1.6 | 0.1 | 42.6 | 7.6 | -1.224
[5,184] 13/9 7/5 == 11.8 | 1.3 | 0.1 | 43.0 | 7.7 | -1.256
[1,296] 13/11 7/3 == 11.2 | 0.7 | 0.0 | 39.8 | 4.9 | -1.217 | (+0.007)
[5,184] 13/11 7/3 == 10.7 | 0.6 | 0.0 | 42.5 | 5.9 | -1.264 | (-0.008)
[1,296] 13/7 == 12.1 | 0.7 | 0.1 | 40.2 | 6.2 | -1.215 | (+0.009)
[5,184] 13/7 == 11.4 | 0.7 | 0.0 | 40.6 | 6.3 | -1.232 | (+0.024)

(Here 13/11 7/3 is better than 13/9 7/5 in 5,184 trials)

What we see here is that when both sides make random
checker and random cube decisions, there is practically
no difference between cubeful and cubeless rollouts.

However, what's important is that the equity differences
between the 3 moves have now become the smallests in
this test because there are no accummulated inaccuracies
resulting from human bias injected through MET's, fart-ass
formulas, etc. In fact, the 3 moves become so close that
their rankings change in different ways.

In conclusion, this "clinical rollouts test" nicely explains
how some players like me can defy the biased bot play
and incur huge ER/PR's but still win because the equity
differences, calculated/estimated by the bots, between
the ranked moves are inaccurate, exaggerated, unrealistic.

Whether candidate moves drift further apart or near as
a result of endlessly longer and longer rollouts create a
false impression of increased confidance and precision.

Very often, when it looks like picking the "best" move
would make a big difference, it may not matter much
at all. Conversely, what appears to be a huge blunder
may not even be an error at all.

For the ones who can't escape conditioned faith, living
in a fantasy world can become their reality. I'm sure you
all would deny and wish to refute what I demonstrated
here and I would be very curious to see how you might
go about doing or avoiding that...

MK

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun Apr 28 20:37:53 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun Apr 28 20:37:37 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun Apr 28 20:30:04 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Mon Apr 29 09:04:47 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	297
Nodes:	16 (2 / 14)
Uptime:	14:33:19
Calls:	6,667
Calls today:	1
Files:	12,216
Messages:	5,336,614

BG rollouts should be done like placebo-controlled clinical trials

Who's Online

Recent Visitors

System Info