Tim has spoken earlier about the need to be statistically careful
before reaching conclusions in backgammon. A particular
example of this is that people can naively see unusual runs
of the dice and assert non-randomness without testing.
So, surely we should apply similar statistical discipline to the
idea that XG has a specific problem with blotty boards.
First, let's assume that the concept of a blotty-board position can be well-defined. This might mean that the board is already blotty or it
might mean that blots can be created in the inner board.
At a minimum, this assertion should require at least some evidence for the following:
1) XG tends to lose more equity in blotty-board positions than in other non-contact positions.
2) XG loses more equity by choosing unnecessarily blotty plays than
by missing opportunities to correctly leave inner board blots.
3) There is an identifiable category of blotty-board positions where XG's play is worse than the best humans. [I would rigorously define "not good" play by a bot as being below best-human standard, but that might be idiosyncratic].
I don't see Tim looking into any of the above three points.
At the moment, what Tim seems to be doing is to note what good
statistical practice is, and then do exactly the opposite. He seems
to be doing this:
1) Cherry-pick positions where XG surprisingly leaves blots.
2) Roll these out.
3) Report it whenever XG has made an error.
4) Ignore the situation whenever XG has been correct [this is admittedly
a guess].
5) Naively assert that he has discovered a problem.
Paul
Tim has spoken earlier about the need to be statistically careful
before reaching conclusions in backgammon. A particular
example of this is that people can naively see unusual runs
of the dice and assert non-randomness without testing.
So, surely we should apply similar statistical discipline to the
idea that XG has a specific problem with blotty boards.
On 10/27/2022 4:02 AM, peps...@gmail.com wrote:
Tim has spoken earlier about the need to be statistically careful
before reaching conclusions in backgammon. A particular
example of this is that people can naively see unusual runs
of the dice and assert non-randomness without testing.
So, surely we should apply similar statistical discipline to theOne could do that, of course. But for comparison, let's look at
idea that XG has a specific problem with blotty boards.
how serious backgammon players try to lower their PR. What do they
do? Do they apply rigorous statistical discipline as you suggest?
No. They run their games through a bot, examine what the bot says
are errors, try to understand them, and then try to adjust their
play accordingly. I don't know of anyone who applies rigorous
statistical procedures to determine whether (for example) they should
step out to the bar point more often if they want to lower their PR. Nevertheless, by all accounts, this non-rigorous procedure seems to
work.
You could take the point of view that what's going on is that people
are fooling themselves. Maybe their PR isn't actually getting better,
or maybe it's getting better for reasons that have nothing to do with
the patterns they think they have discerned. Maybe their PR would
decrease even faster if they were to *not* consult XG at all. I don't
find these hypotheses plausible, but I'm also not going to invest any
time trying to support or refute them using rigorous statistical
methodology.
The observation about XG and blotty boards is similar. If you pay
some attention, you'll see the pattern for yourself, just as when I
pay attention to XG's evaluations of my play, I notice that I cash
when TG far too often. Can I show you a rigorous statistical experiment
that proves that I cash when TG too often? No. Do I care that I have
no such experiment? No.
To be clear, I'm not saying that XG's blotty-board tendencies lose a
lot of equity. In fact, I would say they typically don't, and it's
precisely *because* they usually don't matter much that XG does this
sort of thing. I have a folder with over 400 positions I've collected
where XG makes errors I found interesting, and not very many of these
are blotty-board errors, because after I learned that they usually
don't cost a lot of equity, I mostly stopped collecting them. What
I'm doing by posting to r.g.b. is offering some free advice that if
XGR+ dings you with a 0.057 "error" for making a natural play instead
of its nutty 5/1 board-breaking play, then you should take it with a
grain of salt. If you prefer to ignore the advice until you see
statistical proof, you're of course free to do so.
No, I don't want to ignore your advice. I think the blotty play you allude to
might be a bit less significant than you think it is, but, of course, I'm not telepathic
and don't know your exact thoughts.
On 10/27/2022 8:32 AM, peps...@gmail.com wrote:
No, I don't want to ignore your advice. I think the blotty play you allude toAs a point of clarification, there are two different types
might be a bit less significant than you think it is, but, of course, I'm not telepathic
and don't know your exact thoughts.
of "errors" by XG that we might care about (and by "errors"
I mean a play for which XG's verdict is different depending
on whether you roll it out or use a lower-strength evaluation
or truncated rollout, and where we assume that the rollout
with the strongest settings is "correct"). For simplicity,
let's say that there are just two plausible candidate plays,
A and B, and let's use "XGR+" to refer to the weaker setting.
Without loss of generality, assume that the rollout favors
play A and XGR+ favors play B.
1. The rollout says the equity difference is large.
2. XGR+ says the equity difference is large but the rollout
says the equity difference is small.
(The remaining possibility, that both settings say that the
equity difference is small, I don't really care about.)
People mostly pay attention to 1. This is understandable
if your goal is to assess how well XGR+ plays; in case 2,
the error that XGR+ makes is small, so who cares? But note
that increasingly, computers are being used to *assess human
play*. The BMAB awards titles based on how XG rates your play.
People privately use XG to identify their own errors, usually
focusing on cases where XG says they made a big error. If XG
is playing this kind of role, then case 2 matters just as much
as case 1.
I think that the importance of case 2 errors has been
underestimated or even ignored, because people fail to grasp
the difference between XGR+ as a player and XGR+ as a judge.
So part of the reason I posted that position was to give an
example of a case 2 error. The smallness of the rollout equity
difference was therefore a feature and not a bug.
---
Tim Chow
Could a solution to all this be to find a bg expert who has access
to extremely powerful supercomputers? I would think there are
powerful technologies that could use parallel processing to do
all rollouts instantly for an entire match.
With there being a significant intersection between highly skilled maths/computing people and bg people, I would have thought finding
such a person could be feasible.
Or is all this harder than I think it is?
On 10/27/2022 9:43 AM, peps...@gmail.com wrote:
Could a solution to all this be to find a bg expert who has access
to extremely powerful supercomputers? I would think there are
powerful technologies that could use parallel processing to do
all rollouts instantly for an entire match.
With there being a significant intersection between highly skilled maths/computing people and bg people, I would have thought finding
such a person could be feasible.
Or is all this harder than I think it is?First of all, my impression is that the BMAB runs on a shoestring
budget, so what is theoretically possible may not be doable in
practice. I recall that one thing Stick wanted was for players to
be able to mark some plays in advance for rolling out (so that he
wouldn't be penalized for plays which he knew would be misevaluated
by XGR+), but the BMAB does not do this. I don't know the reasons,
but my guess is that it would require too much work for them to
accommodate Stick's request.
The other thing is that I don't think XG is designed to run on a
cloud. I'm also not sure there's an easy way to get it to roll
out every last candidate for every decision. For starters, it
has a built-in limit of 32 checker-play candidates for each move.
(Maybe that's enough for BMAB purposes, though.) Probably one
would have to pay Xavier to do some development work to enable
what you're suggesting, and again, presumably the BMAB doesn't
think this is an effective use of whatever limited money it has.
I really like Stick's idea on the rollout (which I can remember from
previous threads). One possible objection (which I don't share) is that
it creates confustion to combine the task of competing with the post-competition
evaluation. However, there is a neat precedent for this.
On 10/28/2022 10:50 AM, peps...@gmail.com wrote:
I really like Stick's idea on the rollout (which I can remember from previous threads). One possible objection (which I don't share) is thatAgain, I don't know the true objections, but I know that if I were
it creates confustion to combine the task of competing with the post-competition
evaluation. However, there is a neat precedent for this.
in charge, what I would dread most would be the logistics.
How do people submit their candidates, how do you do this in a
standardized manner, how do you take elementary precautions against
people trying to cheat by secretly checking a bot before submitting
their candidates, how do you handle lost records, how do you settle
disputes, etc. It's just a nightmare. I'm sure that even with the
current relatively simple system, irregularities occur with some
frequency and cause more headaches than the BMAB would like. In the
end, it's going to make only a small difference for a small number
of people. Not much bang for the buck from the BMAB's point of view.
---
Tim Chow
For example, if an acclaimed mathematician (say someone with a
postdoctoral position at Harvard) claimed that they tried the most
recent Putnam exam by themselves without cheating, and scored 100%,
don't you think people would believe the mathematician rather
than suspect that they're just lying and actually googled the
solutions?
On 10/29/2022 4:51 AM, peps...@gmail.com wrote:
For example, if an acclaimed mathematician (say someone with aSome people would believe and others were not. People cheat all the
postdoctoral position at Harvard) claimed that they tried the most
recent Putnam exam by themselves without cheating, and scored 100%,
don't you think people would believe the mathematician rather
than suspect that they're just lying and actually googled the
solutions?
time, sometimes for seemingly inexplicable reasons. And even if they
don't cheat, others will suspect them of cheating.
Years ago, Iancho Hristov was informally tracking various people's
PR's, based on available data from online and in-person recorded
matches. Stick was at the top of Iancho's list, based on his online
play. At one point, Stick's PR, averaged over 50 matches, was 2.0.
Many people were convinced that Stick was cheating, presumably by
consulting a bot at crucial moments. (For the record, at the time,
I was one of the few people posting to BGO who was saying that I
didn't believe that Stick was cheating.) They were saying he should demonstrate a 2.0 PR in live play, or submit to live proctoring.
On another occasion, Neil Kazaross was playing an on-line match
against the rest of BGO and was achieving very low PR's. Again,
there were accusations that he was consulting a bot at crucial moments.
You must have heard about the current hullabaloo over cheating in
chess at high levels. Some people, it seems, cheat *more* when nothing
is at stake; others cheat more when the stakes are higher.
Having said all that, I do think there is one potential benefit to your suggestion, which is that it would probably generate a lot of debate
about whether cheating was going on, and the extra publicity would
probably be good for the BMAB. There's nothing like a good controversy
to get people to pay attention to something they would otherwise have
no interest in.
For statistical purposes, 50 matches might not be all that significant.
A PR of 3.0 is very believable for a world-class player and this might be obtained with long stretches
averaging 4.0 and long stretches averaging 2.0. Finding a specific sample with a PR of 2.0 might
be cherry-picking. So yes, there is no evidence of cheating in your post.
As far as I know, the main hullabaloo in chess, related to this is over cheating accusations, rather than
actual cheating. The evidence against Niemann is just pitifully weak.
Of course, he could have cheated anyway, despite there being no evidence, but that is obviously
not a fruitful or fair line of discourse.
A PR of 3.0 is very believable for a world-class player and this might be obtained with long stretches
averaging 4.0 and long stretches averaging 2.0.
On 10/29/2022 6:31 PM, peps...@gmail.com wrote:
For statistical purposes, 50 matches might not be all that significant.I think that 50 matches is significant.
A PR of 3.0 is very believable for a world-class player and this might be obtained with long stretches
averaging 4.0 and long stretches averaging 2.0. Finding a specific sample with a PR of 2.0 might
be cherry-picking. So yes, there is no evidence of cheating in your post.
I'm not a sub-4.0 PR player, but it's not uncommon for me to play 10 consecutive 7-point matches where my overall PR is sub-4.0. But I
don't think I've ever managed to play 50 consecutive 7-point matches
with an overall PR under 4.0. I think the best I've managed is around
4.2 or maybe 4.1-something.
During the "Stick cheats" controversy, the debate wasn't about whether
50 matches was statistically significant. The debate was about how
much of a boost you get from playing online in the comfort of your
home, with the pip count conveniently displayed at all times. Stick
wasn't claiming at the time that he would be able to consistently
play a 2.0 in live play. Several other well-known players acknowledged
that favorable playing conditions would help somewhat, but didn't
believe that it was enough to fully explain Stick's 2.0 performance.
They suggested that a proctor pay Stick a visit at his home and observe
him playing online to confirm that he wasn't secretly consulting a bot,
but Stick said that the presence of a proctor would disturb his concentration. No "resolution" was ever reached, AFAIK.
As far as I know, the main hullabaloo in chess, related to this is over cheating accusations, rather thanWell, there's a distinction between online cheating and OTB cheating.
actual cheating. The evidence against Niemann is just pitifully weak.
Of course, he could have cheated anyway, despite there being no evidence, but that is obviously
not a fruitful or fair line of discourse.
Niemann admitted to cheating online, and chess.com and Ken Regan say
they have no statistical evidence that Niemann has cheated OTB. So
far so good, but chess.com says that Niemann cheated a lot more online
than Niemann said he did. Do you think that chess.com's 70-page report
about Niemann's alleged online cheating is "pitifully weak"?
---
Tim Chow
On Sunday, October 30, 2022 at 4:13:10 AM UTC, Tim Chow wrote:
On 10/29/2022 6:31 PM, peps...@gmail.com wrote:
For statistical purposes, 50 matches might not be all that significant.I think that 50 matches is significant.
A PR of 3.0 is very believable for a world-class player and this might be obtained with long stretches
averaging 4.0 and long stretches averaging 2.0. Finding a specific sample with a PR of 2.0 might
be cherry-picking. So yes, there is no evidence of cheating in your post.
I'm not a sub-4.0 PR player, but it's not uncommon for me to play 10 consecutive 7-point matches where my overall PR is sub-4.0. But I
don't think I've ever managed to play 50 consecutive 7-point matches
with an overall PR under 4.0. I think the best I've managed is around
4.2 or maybe 4.1-something.
During the "Stick cheats" controversy, the debate wasn't about whether
50 matches was statistically significant. The debate was about how
much of a boost you get from playing online in the comfort of your
home, with the pip count conveniently displayed at all times. Stick
wasn't claiming at the time that he would be able to consistently
play a 2.0 in live play. Several other well-known players acknowledged
that favorable playing conditions would help somewhat, but didn't
believe that it was enough to fully explain Stick's 2.0 performance.
They suggested that a proctor pay Stick a visit at his home and observe
him playing online to confirm that he wasn't secretly consulting a bot,
but Stick said that the presence of a proctor would disturb his concentration. No "resolution" was ever reached, AFAIK.
As far as I know, the main hullabaloo in chess, related to this is over cheating accusations, rather thanWell, there's a distinction between online cheating and OTB cheating. Niemann admitted to cheating online, and chess.com and Ken Regan say
actual cheating. The evidence against Niemann is just pitifully weak.
Of course, he could have cheated anyway, despite there being no evidence, but that is obviously
not a fruitful or fair line of discourse.
they have no statistical evidence that Niemann has cheated OTB. So
far so good, but chess.com says that Niemann cheated a lot more online
than Niemann said he did. Do you think that chess.com's 70-page report about Niemann's alleged online cheating is "pitifully weak"?
---My phrase "evidence against Niemann" refers to the OTB allegations.
Tim Chow
I think if all his OTB accusers can do is point to the online evidence, then their
case is pitifully weak. I haven't seen the chess.com report.
Re bg, an important question is whether it constitutes "cheating" to consult match equity tables. Furthermore, does it constitute cheating to take out
a pen and paper, and write down the computations rather than do them in your head?
I think that now (almost) everyone would say that all these behaviours constitute "cheating".
But I don't think that has always been the case.
Paul
My phrase "evidence against Niemann" refers to the OTB allegations.
I think if all his OTB accusers can do is point to the online evidence, then their
case is pitifully weak. I haven't seen the chess.com report.
Re bg, an important question is whether it constitutes "cheating" to consult match equity tables. Furthermore, does it constitute cheating to take out
a pen and paper, and write down the computations rather than do them in your head?
I think that now (almost) everyone would say that all these behaviours constitute "cheating".
But I don't think that has always been the case.
https://www.chess.com/blog/CHESScom/hans-niemann-report
I find their evidence for Niemann's online cheating pretty strong.
Tim Chow
On Sunday, 30 October 2022 at 12:39:14 UTC, Tim Chow wrote:
https://www.chess.com/blog/CHESScom/hans-niemann-report
I find their evidence for Niemann's online cheating pretty strong.
Tim Chow
Your problem is that Carlsen and chess.com are staring down the barrels of a $100m libel lawsuit and "neither Carlsen nor Chess.com produced concrete evidence for their cheating accusations"
https://www.bbc.co.uk/news/world-us-canada-63338375
This isn't going to end nicely, I suspect Carlsen will be financially ruined unless he comes up with a lot more than suspicions.
Your problem is that Carlsen and chess.com are staring down the barrels of a $100m libel lawsuit and "neither Carlsen nor Chess.com produced concrete evidence for their cheating accusations"
So this is Tim's problem? I didn't know that Tim was acting as a guarantor for the pay awards.
There's no way Niemann will win this lawsuit, and I'm sure he knows
it. He's just making a political statement, and maybe hoping he'll
bluff one of the defendants into settling out of court.
---
Tim Chow
The burden of proof in a libel case lies solely with the person making the claim that led to the case being brought. Niemann doesn't have to do anything.
..... I would think there are powerful technologies
that could use parallel processing to do all rollouts
instantly for an entire match.
With there being a significant intersection between
highly skilled maths/computing people and bg people
LOL. You don't know anything about law, do you?
---
Tim Chow
From my (aborted) law school days
On 11/3/2022 1:19 PM, Nasti Chestikov wrote:
From my (aborted) law school days
Did you go to law school in the U.S.? Libel law in the U.S. is
rather different from that of many other countries because of the
First Amendment.
The burden of proof is on the plaintiff to prove that the defendant's
claim is false. You don't need to go to law school to know that this
is the way things work in the U.S.
On 11/3/2022 1:19 PM, Nasti Chestikov wrote:
From my (aborted) law school days
Did you go to law school in the U.S.? Libel law in the U.S. is
rather different from that of many other countries because of the
First Amendment.
The burden of proof is on the plaintiff to prove that the defendant's
claim is false. You don't need to go to law school to know that this
is the way things work in the U.S.
On Sunday, 30 October 2022 at 12:39:14 UTC, Tim Chow wrote:
https://www.chess.com/blog/CHESScom/hans-niemann-report
I find their evidence for Niemann's online cheating pretty strong.
Tim Chow
Your problem is that Carlsen and chess.com are staring down the barrels of a $100m libel lawsuit and "neither Carlsen nor Chess.com produced concrete evidence for their cheating accusations"
https://www.bbc.co.uk/news/world-us-canada-63338375
This isn't going to end nicely, I suspect Carlsen will be financially ruined unless he comes up with a lot more than suspicions.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 297 |
Nodes: | 16 (2 / 14) |
Uptime: | 19:41:47 |
Calls: | 6,667 |
Calls today: | 1 |
Files: | 12,216 |
Messages: | 5,337,043 |