• BG rollouts should be done like placebo-controlled clinical trials

    From MK@21:1/5 to All on Wed Apr 5 17:26:35 2023
    Warning! Reading this article may cause disillusionment.
    Proceed at your own risk. It is quite long but if you take
    my word for my rollout results, you can skip them tedious
    numbers and just read my comments following each set.

    For years I have argued that training BG AI bots through
    cubeless 1-point games and then extrapolating cubeful
    equities and match equities through the applications of
    fancyful formulas and MET's built from recursive/circular
    statistics inject biases and systematic errors that result
    in inaccuracies of unknown magnitudes.

    Consequently, the way rolloust are currently done causes
    even slightest inaccuracies to compound and accumulate
    and become mistakenly significant, while going unnoticed
    by the unsuspecting bot believers.

    The following is a demonstration of this using the sample
    position from another thread titled "Slot or not?":

    https://groups.google.com/g/rec.games.backgammon/c/tH5CbO_-8c4/m/4j8zw9kSAQAJ

    XGID=-dBBB-BD-----A--bb-cbB-b--:1:-1:1:42:0:0:0:0:10

    The "placebo" in my "clinical rollout trials" would be the
    "random play" but we can't even begin to think about
    testing using Ex-Gee because it offers no such features.

    Luckily, Noo-BG allows separate player strength settings
    including a "noise" setting that may come close enough
    to random play here as an initial step towards our goal.

    Note: I think non-deterministic noise would be closer to
    random play but using it in any settings combinations
    causes Noo-BG to crash each and every single time. So,
    I had to settle for deterministic noise in my rollouts.

    All rollouts done for 1296 trials, with all general options
    unchecked except for "Bearoff Truncation" and "Variance
    reduction", with all player options unchecked except for
    "Deterministic noise = 1" in both colums as appropriate.

    All stats below are in order of Win%, Win(g)%, Win(bg)%,
    Lose(g)%, Lose(bg)%, Cubeless(eq) and Cubeful(eq),
    with equity differences between moves in parantheses.

    With our stage set, let's now start by establishing some
    base lines for the top 3 moves in the sample position.

    GNUbg ID: xs4GADy22QMBBg:QQkKAAAAAAAE

    First, let's ask for hint at grandmaster (3-ply) level:

    13/9 7/5 == 80.9 | 42.3 | 4.4 | 19.1 | 4.4 | 0.2 | +0.974
    13/11 7/3 == 75.8 | 36.8 | 4.1 | 24.2 | 4.7 | 0.1 | +0.795 | (-0.179)
    13/7 == 76.2 | 36.1 | 3.8 | 23.8 | 4.7 | 0.1 | +0.794 | (-0.179)

    Next, let's do a cubeless rollout of the top 3 moves:

    13/9 7/5 == 80.4 | 42.5 | 4.5 | 4.7 | 0.2 | +1.030
    13/11 7/3 == 74.8 | 36.2 | 4.2 | 5.0 | 0.2 | +0.848 | (-0.182)
    13/7 == 75.2 | 34.6 | 3.8 | 4.9 | 0.2 | +0.837 | (-0.193)

    Then, let's do a cubeful rollout of the same 3 moves:

    13/9 7/5 == 80.4 | 42.8 | 4.7 | 4.7 | 0.2 | +1.035 | +0.956
    13/11 7/3 == 74.9 | 36.5 | 4.5 | 5.1 | 0.3 | +0.853 | +0.750 | (-0.206)
    13/7 == 75.3 | 35.1 | 4.0 | 4.9 | 0.2 | +0.845 | +0.743 | (-0.213)

    At this point, we can already see that equity differences
    between the moves get bigger in the cubeless rollout vs.
    the hint and even more so in the cubefull rollout because
    with both sides set to high skill levels, checker and cube
    inaccuracies compound and accumulate at higher rates,
    by "echoing" (being duplicated) on both sides.

    Playing the above 3 moves result in these 3 positions:

    13/9 7/5 == GNUbg ID: trUTAAbGzgYAPA:AQEKAAAAAAAE
    13/11 7/3 == GNUbg ID: drNDAAbGzgYAPA:AQEKAAAAAAAE
    13/7 == GNUbg ID: ttkHAAbGzgYAPA:AQEKAAAAAAAE

    At this point, we are ready to do our "clinical rollouts"
    for each position, with now the player "O" being on roll,
    but before the dice roll. (I did the rollouts separately and
    grouped them by type of rollout for ease of comparing.)

    Again, let's first do a cubeful rollout with both players
    set to grandmaster, to establish a second base line:

    13/9 7/5 == 19.7 | 4.5 | 0.2 | 42.9 | 4.5 | -1.034 | -0.955
    13/11 7/3 == 25.4 | 5.2 | 0.2 | 36.5 | 4.5 | -0.849 | -0.744 | (+0.211)
    13/7 == 24.9 | 5.0 | 0.2 | 35.1 | 4.2 | -0.843 | -0.741 | (+0.214)

    The results are very close to the above cubeful rollout
    and only slightly different perhaps because we made a
    move and are one less turn away from the end of game.

    Next, let's do a cubeful rollout with "O" as grandmaster
    (3-ply) and "X" as expert (0-ply) with maximum noise:

    13/9 7/5 == 74.3 | 36.6 | 4.1 | 6.8 | 0.6 | +0.817 | +1.436
    13/11 7/3 == 77.0 | 34.8 | 4.1 | 4.8 | 0.2 | +0.979 | +1.535 | (-0.099)
    13/7 == 75.9 | 35.7 | 3.5 | 6.2 | 0.5 | +0.845 | +1.486 | (-0.050)

    (Interestingly here 13/7 is better than 13/11 7/3)

    Then, let's do a cubeful rollout with "X" as grandmaster
    (3-ply) and "O" as expert (0-ply) with maximum noise:

    13/9 7/5 == 0.0 | 0.0 | 0.0 | 92.8 | 43.0 | -2.372 | -2.374
    13/11 7/3 == 0.8 | 0.2 | 0.0 | 91.2 | 38.9 | -2.283 | -2.281 | (+0.093)
    13/7 == 0.0 | 0.0 | 0.0 | 87.8 | 39.4 | -2.272 | -2.270 | (+0.104)

    At this point, we can see that equity differences between
    the moves get instead smaller when either side is set to
    play close to randomly because the checker and the cube
    inaccuracies don't "echo" but compound and accumulate
    only by one side.

    Let's continue, by again doing first a cubeless rollout with
    both sides as grandmaster, to establish another base line:

    13/9 7/5 == 19.6 | 4.7 | 0.2 | 42.5 | 4.5 | -1.030
    13/11 7/3 == 25.2 | 5.0 | 0.2 | 36.2 | 4.2 | -0.848 | (+0.182)
    13/7 == 24.8 | 4.9 | 0.2 | 34.6 | 3.8 | -0.837 | (+0.193)

    Cubeless equities here are similar to cubeless equities in
    the above "clinical cubeful rollout" but equity differences
    are yet a little smaller because checker inaccuracies don't
    get compounded with cube inaccuracies.

    Next, let's do a cubeless rollout with "O" as grandmaster
    (3-ply) and "X" as expert (0-ply) with maximum noise:

    13/9 7/5 == 79.8 | 48.0 | 6.6 | 6.2 | 0.6 | +1.075
    13/11 7/3 == 82.7 | 45.9 | 6.7 | 5.4 | 0.6 | +1.120 | (-0.045)
    13/7 == 82.9 | 47.5 | 5.5 | 5.7 | 0.6 | +1.125 | (-0.050)

    Then, let's do a cubeless rollout with "X" as grandmaster
    (3-ply) and "O" as expert (0-ply) with maximum noise:

    13/9 7/5 == 0.0 | 0.0 | 0.0 | 93.1 | 42.9 | -2.375
    13/11 7/3 == 0.1 | 0.0 | 0.0 | 91.5 | 38.9 | -2.302 | (+0.073)
    13/7 == 0.0 | 0.0 | 0.0 | 91.3 | 35.8 | -2.277 | (+0.098)

    At this point, we can again see that when either side plays
    close to randomly, equity differences between the moves
    get even smaller (although by not as much), compared to
    the corresponding "clinical cubeless rollouts" because not
    only that checker and cube inaccuracies don't compound
    but also checker inaccuracies don't echo and accumulate
    only by one side.

    Let's complete our test by doing two more rollouts with
    both players set to expert (0-ply) with maximum noise,
    i.e. both playing almost randomly.

    First, cubeful for 1,296 and 5,184 trials to double-stitch:

    [1,296] 13/9 7/5 == 13.0 | 0.0 | 0.1 | 43.1 | 7.7 | -1.249 | -1.246
    [5,184] 13/9 7/5 == 12.0 | 0.0 | 0.1 | 43.6 | 7.7 | -1.275 | -1.271
    [1,296] 13/11 7/3 == 11.0 | 0.2 | 0.0 | 40.7 | 4.9 | -1.233 | -1.226 | (+0.020) [5,184] 13/11 7/3 == 10.4 | 0.0 | 0.0 | 43.3 | 5.9 | -1.285 | -1.277 | (-0.006) [1,296] 13/7 == 11.5 | 0.0 | 0.1 | 41.0 | 6.2 | -1.247 | -1.242 | (+0.004) [5,184] 13/7 == 11.1 | 0.0 | 0.0 | 41.3 | 6.3 | -1.257 | -1.249 | (+0.022)

    (Here 13/7 is better than 13/11 7/3 in 1,296 trials and
    13/11 7/3 is better than even 13/9 7/5 in 5,184 trials)

    Second, cubeless again for 1,296 as well as 5,184 trials:

    [1,296] 13/9 7/5 == 13.0 | 1.6 | 0.1 | 42.6 | 7.6 | -1.224
    [5,184] 13/9 7/5 == 11.8 | 1.3 | 0.1 | 43.0 | 7.7 | -1.256
    [1,296] 13/11 7/3 == 11.2 | 0.7 | 0.0 | 39.8 | 4.9 | -1.217 | (+0.007)
    [5,184] 13/11 7/3 == 10.7 | 0.6 | 0.0 | 42.5 | 5.9 | -1.264 | (-0.008)
    [1,296] 13/7 == 12.1 | 0.7 | 0.1 | 40.2 | 6.2 | -1.215 | (+0.009)
    [5,184] 13/7 == 11.4 | 0.7 | 0.0 | 40.6 | 6.3 | -1.232 | (+0.024)

    (Here 13/11 7/3 is better than 13/9 7/5 in 5,184 trials)

    What we see here is that when both sides make random
    checker and random cube decisions, there is practically
    no difference between cubeful and cubeless rollouts.

    However, what's important is that the equity differences
    between the 3 moves have now become the smallests in
    this test because there are no accummulated inaccuracies
    resulting from human bias injected through MET's, fart-ass
    formulas, etc. In fact, the 3 moves become so close that
    their rankings change in different ways.

    In conclusion, this "clinical rollouts test" nicely explains
    how some players like me can defy the biased bot play
    and incur huge ER/PR's but still win because the equity
    differences, calculated/estimated by the bots, between
    the ranked moves are inaccurate, exaggerated, unrealistic.

    Whether candidate moves drift further apart or near as
    a result of endlessly longer and longer rollouts create a
    false impression of increased confidance and precision.

    Very often, when it looks like picking the "best" move
    would make a big difference, it may not matter much
    at all. Conversely, what appears to be a huge blunder
    may not even be an error at all.

    For the ones who can't escape conditioned faith, living
    in a fantasy world can become their reality. I'm sure you
    all would deny and wish to refute what I demonstrated
    here and I would be very curious to see how you might
    go about doing or avoiding that...

    MK

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)