Forum: >>> Magnum BBS <<<

Spell checking identifiers

From Johann 'Myrkraverk' Oskarsson@21:1/5 to All on Wed Jun 24 01:38:11 2020

Dear c.compilers,

While experimenting with Rust, I came across this suggestion.

--> foo.rs:5:9
|
5 | return j; // the variable, not the type.
| ^ help: a local variable with a similar name exists: `i`

Here it is suggesting i where I typed j. This is the same problem as
spell checking identifiers with fuzzy matching, so apologies for a po- tentially misleading subject.

So, without going through the source of rustc to find out, I'm curious
about what general techniques people use to make this work? In particu-
lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know.

I've come across a survey of fuzzy matching algorithms, some of which
work with dictionaries but I have no idea which data structures would
be appropriate in a compiler, nor do I know what criteria I'd use to
choose an appropriate algorithm from such a survey.

As an added bonus, the same technique can of course be used to spell
check identifiers against a natural language dictionary. But since
such a dictionary is more static than the list of identifiers in the
current source file, a precomputed database will work, and a more
expensive indexing method can be used. Is there an indexing method
that works for this, but would not be appropriate for fuzzy matching
against identifiers?

[Apologies for not responding to my other topic yet, I should be able
to reply soon.]

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk
[There's a vast amount of work on edit distance. My guess is they
use something like Levenshtein, but rather than use a constant
distance of 1 between different letters, the distance varies depending
on how different the letters look. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johann 'Myrkraverk' Oskarsson@21:1/5 to All on Wed Jun 24 03:56:56 2020

[There's a vast amount of work on edit distance. My guess is they
use something like Levenshtein, but rather than use a constant
distance of 1 between different letters, the distance varies depending
on how different the letters look. -John]

This clang blog specifically mentions Levenshtein,

http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

and it looks like what people do is to go through the entire symbol
table and compute it against the individual erroneous identifier.

I thought that'd be a bit on the expensive side, because C++ files
can have 100k+ (or millions?) of lines after preprocessing, so one
translation unit really can go up to million identifiers in practice.
[I don't know if that actually happens but I don't think it's safe
to assume it doesn't.]

In the 10 years since, people may have changed from standard Levenshtein
as you mention.

But then, maybe compilation speed for erroneous input isn't really
important. rustc is slow for a short input file in both cases [which
could be the startup cost.]

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@u.washington.edu@21:1/5 to Johann 'Myrkraverk' Oskarsson on Tue Jun 23 16:51:38 2020

On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:

(snip)

This clang blog specifically mentions Levenshtein,

http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

and it looks like what people do is to go through the entire symbol
table and compute it against the individual erroneous identifier.

I thought that'd be a bit on the expensive side,

With either constant weighting or character dependent weighting
it is easy to do with dynamic programming. The time is then O(m n)
where m and n are the two lengths.

It seems most obvious to do only variable that are in the appropriate
scope to be misspelled, but I suspect catching variables used out
of scope is also worth doing. Well, in the latter case, you could
hope that they at least spell them the same.

I think you should turn it off for one character names, though,
even though I suspect those are more likely. Too many false
positives!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Derek M. Jones@21:1/5 to All on Wed Jun 24 11:02:21 2020

Johann,

So, without going through the source of rustc to find out, I'm curious
about what general techniques people use to make this work? In particu-
lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know.

More than you probably wanted to know: http://www.coding-guidelines.com/cbook/sent792.pdf

--
Derek M. Jones
blog:shape-of-code.coding-guidelines.com

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to Johann 'Myrkraverk' Oskarsson on Wed Jun 24 18:12:35 2020

On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:

Dear c.compilers,

While experimenting with Rust, I came across this suggestion.

--> foo.rs:5:9
|
5 | return j; // the variable, not the type.
| ^ help: a local variable with a similar name exists: `i`

Here it is suggesting i where I typed j. This is the same problem as
spell checking identifiers with fuzzy matching, so apologies for a po- tentially misleading subject.

I don't find these kinds of childish diagnostic messages useful at all.

They have started to appear in GCC also.

The good old "undeclared identifier `j`" requires no update, thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gautier_niouzes@hotmail.com@21:1/5 to All on Wed Jun 24 13:08:39 2020

On Tuesday, June 23, 2020 at 8:40:38 PM UTC+2, Johann 'Myrkraverk' Oskarsson

So, without going through the source of rustc to find out, I'm curious
about what general techniques people use to make this work? In particu-
lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know.

GCC's GNAT front-end for Ada does such checks (search for "GNAT.Spelling_Checker").
The source code (at least versions I've found on the Web) does not refer to a specific algorithm.

A few examples of detection:

hac-parser-expressions.adb:129:35: "Symst" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

hac-parser-expressions.adb:129:35: "Sym_set" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

hac-parser-expressions.adb:129:35: "Sysmet" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Kaz Kylheku on Wed Jun 24 20:08:03 2020

Kaz Kylheku <937-053-0959@kylheku.com> schrieb:

On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:

5 | return j; // the variable, not the type.
| ^ help: a local variable with a similar name exists: `i`

I don't find these kinds of childish diagnostic messages useful at all.

They have started to appear in GCC also.

The good old "undeclared identifier `j`" requires no update, thanks.

At least gcc doesn't do the suggestion in that particular case:

foo.c: In function 'foo':
foo.c:8:7: error: 'j' undeclared (first use in this function)
8 | x[j] = a[i] > b[i];
| ^
foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@u.washington.edu@21:1/5 to Derek M. Jones on Wed Jun 24 18:28:02 2020

On Wednesday, June 24, 2020 at 11:45:44 AM UTC-7, Derek M. Jones wrote:

(snip)

More than you probably wanted to know: http://www.coding-guidelines.com/cbook/sent792.pdf

Interesting.

It seems that the compiler could also suggest when different names
are confusingly similar, other than spelling, or otherwise possibly
confusing to readers. Maybe by how they sound when pronounced.

You might not want J1 and J_one in the same program.

Reminds me that the IBM OS/360 Fortran H compiler, instead of using
a hash table, uses six balanced trees. IBM suggests in one manual
that for faster compilation, one distribute names among the
six lengths. Back in the time when compilation time was more
important than readability.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johann 'Myrkraverk' Oskarsson@21:1/5 to Thomas Koenig on Thu Jun 25 21:44:45 2020

On 25/06/2020 4:08 am, Thomas Koenig wrote:

Kaz Kylheku <937-053-0959@kylheku.com> schrieb:

On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:

5 | return j; // the variable, not the type.
| ^ help: a local variable with a similar name exists: `i`

I don't find these kinds of childish diagnostic messages useful at all.

They have started to appear in GCC also.

The good old "undeclared identifier `j`" requires no update, thanks.

At least gcc doesn't do the suggestion in that particular case:

foo.c: In function 'foo':
foo.c:8:7: error: 'j' undeclared (first use in this function)
8 | x[j] = a[i] > b[i];
| ^
foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in

A friend showed me the ghc code for this, and it deliberatly (according
to comments) excludes single variable names. I don't understand enough
Haskell to tell exactly what the code is doing, but they seem to be
using the Damerau-Levenshtein distance algorithm based on identifier
names.

Now, whether this is useful is of course debatable, and I would
personally like to give people a choice to either turn it on, or
off. We can argue about defaults if/when I have a published
compiler.

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johann 'Myrkraverk' Oskarsson@21:1/5 to gah4@u.washington.edu on Thu Jun 25 22:33:37 2020

On 24/06/2020 7:51 am, gah4@u.washington.edu wrote:

On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:

(snip)

This clang blog specifically mentions Levenshtein,

http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

and it looks like what people do is to go through the entire symbol
table and compute it against the individual erroneous identifier.

I thought that'd be a bit on the expensive side,

With either constant weighting or character dependent weighting
it is easy to do with dynamic programming. The time is then O(m n)
where m and n are the two lengths.

Are you talking about doing this one by one through the entire symbol
table?

It seems most obvious to do only variable that are in the appropriate
scope to be misspelled, but I suspect catching variables used out
of scope is also worth doing. Well, in the latter case, you could
hope that they at least spell them the same.

Depending on context, one would also want to do this for type names (as
per the blog above). Depending on the language* and culture**, there
can be thousands of type names in scope.

I think you should turn it off for one character names, though,
even though I suspect those are more likely. Too many false
positives!

rustc obviously does this for one character names, at least in the
case for i and j. I don't know if it's useful to compare a and k.

* C++ and Java come to mind.

** Programming culture, some of them have a name such as Agile, and
eXtreme Programming; others don't have a name.

--
Johann | email: invalid -> com | www.myrkraverk.com/blog/
I'm not from the Internet, I just work there. | twitter: @myrkraverk

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From mac@21:1/5 to All on Thu Jul 9 16:07:45 2020

You might not want J1 and J_one in the same program.

Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but required the same capitalization everywhere

This was much more important in the days of batch systems, where compilers tried to "correct" your code, because it would be a day before you got
another chance. And because student programs couldn't do any damage.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mac on Fri Jul 10 07:12:27 2020

mac <acolvin@efunct.com> schrieb:

You might not want J1 and J_one in the same program.

Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but required the same capitalization everywhere

Fortran has a a bit of a similar issue with its C interoperability
feature.

Entities with C binding have global identifiers in Fortran. Fortran
is a case-insensitive laguage, so FooBar and foobar look the same
to Fortran, and you can not have a C binding to both (but either
one would work).

In practice, this should not be a big problem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@u.washington.edu@21:1/5 to Thomas Koenig on Fri Jul 10 13:17:59 2020

On Friday, July 10, 2020 at 8:23:37 AM UTC-7, Thomas Koenig wrote:

(snip)

Fortran has a a bit of a similar issue with its C interoperability
feature.

Entities with C binding have global identifiers in Fortran. Fortran
is a case-insensitive laguage, so FooBar and foobar look the same
to Fortran, and you can not have a C binding to both (but either
one would work).

Much of this was pretty strange before Fortran added the
C interoperability feature. Mostly, Fortran compilers did what
they did, and the C names had to match. That included adding
underscore characters sometimes.

Now Fortran has BIND(C), and more specifically BIND(C, NAME='cname')
where you specify the name as it will be known to C. That name is
case sensitive (at least if the underlying linker supports it),
and might be completely unrelated to the Fortran name.

BIND(C) without NAME= generates lower case, which is usual for C.

I believe IBM PL/I compilers have had the ability to call other
(presumably IBM) languages, with specific keywords. But mostly that
was when the IBM linker was (supposed to be) upper case only.

VAX/VMS, and I suspect other VMS, has a system, defined along
with VAX, for calling routines with other argument passing
conventions, with %VAL(), %REF(), and %DESCR(), or call by value,
reference, and descriptor. I am not sure what they do about case,
but many VMS names have $ in them, so that is allowed.

[Back in the day, IBM PL/I could call Fortran and COBOL but the
procedure names are all eight EBCDIC characters. The extra stuff
involves mapping datatypes. The current PL/I manuals say it
handles many different character encodings like UTF-16. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	231:57:31
Calls:	6,624
Files:	12,171
Messages:	5,319,447

Spell checking identifiers

Who's Online

System Info