• Spell checking identifiers

    From Johann 'Myrkraverk' Oskarsson@21:1/5 to All on Wed Jun 24 01:38:11 2020
    Dear c.compilers,

    While experimenting with Rust, I came across this suggestion.

    --> foo.rs:5:9
    |
    5 | return j; // the variable, not the type.
    | ^ help: a local variable with a similar name exists: `i`

    Here it is suggesting i where I typed j. This is the same problem as
    spell checking identifiers with fuzzy matching, so apologies for a po- tentially misleading subject.

    So, without going through the source of rustc to find out, I'm curious
    about what general techniques people use to make this work? In particu-
    lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know.

    I've come across a survey of fuzzy matching algorithms, some of which
    work with dictionaries but I have no idea which data structures would
    be appropriate in a compiler, nor do I know what criteria I'd use to
    choose an appropriate algorithm from such a survey.

    As an added bonus, the same technique can of course be used to spell
    check identifiers against a natural language dictionary. But since
    such a dictionary is more static than the list of identifiers in the
    current source file, a precomputed database will work, and a more
    expensive indexing method can be used. Is there an indexing method
    that works for this, but would not be appropriate for fuzzy matching
    against identifiers?

    [Apologies for not responding to my other topic yet, I should be able
    to reply soon.]

    --
    Johann | email: invalid -> com | www.myrkraverk.com/blog/
    I'm not from the Internet, I just work there. | twitter: @myrkraverk
    [There's a vast amount of work on edit distance. My guess is they
    use something like Levenshtein, but rather than use a constant
    distance of 1 between different letters, the distance varies depending
    on how different the letters look. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Johann 'Myrkraverk' Oskarsson@21:1/5 to All on Wed Jun 24 03:56:56 2020
    [There's a vast amount of work on edit distance.  My guess is they
    use something like Levenshtein, but rather than use a constant
    distance of 1 between different letters, the distance varies depending
    on how different the letters look. -John]

    This clang blog specifically mentions Levenshtein,


    http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

    and it looks like what people do is to go through the entire symbol
    table and compute it against the individual erroneous identifier.

    I thought that'd be a bit on the expensive side, because C++ files
    can have 100k+ (or millions?) of lines after preprocessing, so one
    translation unit really can go up to million identifiers in practice.
    [I don't know if that actually happens but I don't think it's safe
    to assume it doesn't.]

    In the 10 years since, people may have changed from standard Levenshtein
    as you mention.

    But then, maybe compilation speed for erroneous input isn't really
    important. rustc is slow for a short input file in both cases [which
    could be the startup cost.]

    --
    Johann | email: invalid -> com | www.myrkraverk.com/blog/
    I'm not from the Internet, I just work there. | twitter: @myrkraverk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@u.washington.edu@21:1/5 to Johann 'Myrkraverk' Oskarsson on Tue Jun 23 16:51:38 2020
    On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:

    (snip)

    This clang blog specifically mentions Levenshtein,

    http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

    and it looks like what people do is to go through the entire symbol
    table and compute it against the individual erroneous identifier.

    I thought that'd be a bit on the expensive side,

    With either constant weighting or character dependent weighting
    it is easy to do with dynamic programming. The time is then O(m n)
    where m and n are the two lengths.

    It seems most obvious to do only variable that are in the appropriate
    scope to be misspelled, but I suspect catching variables used out
    of scope is also worth doing. Well, in the latter case, you could
    hope that they at least spell them the same.

    I think you should turn it off for one character names, though,
    even though I suspect those are more likely. Too many false
    positives!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Derek M. Jones@21:1/5 to All on Wed Jun 24 11:02:21 2020
    Johann,

    So, without going through the source of rustc to find out, I'm curious
    about what general techniques people use to make this work?  In particu-
    lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know.

    More than you probably wanted to know: http://www.coding-guidelines.com/cbook/sent792.pdf

    --
    Derek M. Jones
    blog:shape-of-code.coding-guidelines.com

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Johann 'Myrkraverk' Oskarsson on Wed Jun 24 18:12:35 2020
    On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
    Dear c.compilers,

    While experimenting with Rust, I came across this suggestion.

    --> foo.rs:5:9
    |
    5 | return j; // the variable, not the type.
    | ^ help: a local variable with a similar name exists: `i`

    Here it is suggesting i where I typed j. This is the same problem as
    spell checking identifiers with fuzzy matching, so apologies for a po- tentially misleading subject.

    I don't find these kinds of childish diagnostic messages useful at all.

    They have started to appear in GCC also.

    The good old "undeclared identifier `j`" requires no update, thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gautier_niouzes@hotmail.com@21:1/5 to All on Wed Jun 24 13:08:39 2020
    On Tuesday, June 23, 2020 at 8:40:38 PM UTC+2, Johann 'Myrkraverk' Oskarsson

    So, without going through the source of rustc to find out, I'm curious
    about what general techniques people use to make this work? In particu-
    lar the Damerau–Levenshtein distance algorithm is not appropriate for dictionary lookups, as far as I know.

    GCC's GNAT front-end for Ada does such checks (search for "GNAT.Spelling_Checker").
    The source code (at least versions I've found on the Web) does not refer to a specific algorithm.

    A few examples of detection:

    hac-parser-expressions.adb:129:35: "Symst" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

    hac-parser-expressions.adb:129:35: "Sym_set" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

    hac-parser-expressions.adb:129:35: "Sysmet" is undefined hac-parser-expressions.adb:129:35: possible misspelling of "Symset"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Kaz Kylheku on Wed Jun 24 20:08:03 2020
    Kaz Kylheku <937-053-0959@kylheku.com> schrieb:
    On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
    5 | return j; // the variable, not the type.
    | ^ help: a local variable with a similar name exists: `i`

    I don't find these kinds of childish diagnostic messages useful at all.

    They have started to appear in GCC also.

    The good old "undeclared identifier `j`" requires no update, thanks.

    At least gcc doesn't do the suggestion in that particular case:

    foo.c: In function 'foo':
    foo.c:8:7: error: 'j' undeclared (first use in this function)
    8 | x[j] = a[i] > b[i];
    | ^
    foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@u.washington.edu@21:1/5 to Derek M. Jones on Wed Jun 24 18:28:02 2020
    On Wednesday, June 24, 2020 at 11:45:44 AM UTC-7, Derek M. Jones wrote:

    (snip)

    More than you probably wanted to know: http://www.coding-guidelines.com/cbook/sent792.pdf

    Interesting.

    It seems that the compiler could also suggest when different names
    are confusingly similar, other than spelling, or otherwise possibly
    confusing to readers. Maybe by how they sound when pronounced.

    You might not want J1 and J_one in the same program.

    Reminds me that the IBM OS/360 Fortran H compiler, instead of using
    a hash table, uses six balanced trees. IBM suggests in one manual
    that for faster compilation, one distribute names among the
    six lengths. Back in the time when compilation time was more
    important than readability.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Johann 'Myrkraverk' Oskarsson@21:1/5 to Thomas Koenig on Thu Jun 25 21:44:45 2020
    On 25/06/2020 4:08 am, Thomas Koenig wrote:
    Kaz Kylheku <937-053-0959@kylheku.com> schrieb:
    On 2020-06-23, Johann 'Myrkraverk' Oskarsson <johann@myrkraverk.invalid> wrote:
    5 | return j; // the variable, not the type.
    | ^ help: a local variable with a similar name exists: `i`

    I don't find these kinds of childish diagnostic messages useful at all.

    They have started to appear in GCC also.

    The good old "undeclared identifier `j`" requires no update, thanks.

    At least gcc doesn't do the suggestion in that particular case:

    foo.c: In function 'foo':
    foo.c:8:7: error: 'j' undeclared (first use in this function)
    8 | x[j] = a[i] > b[i];
    | ^
    foo.c:8:7: note: each undeclared identifier is reported only once for each function it appears in


    A friend showed me the ghc code for this, and it deliberatly (according
    to comments) excludes single variable names. I don't understand enough
    Haskell to tell exactly what the code is doing, but they seem to be
    using the Damerau-Levenshtein distance algorithm based on identifier
    names.

    Now, whether this is useful is of course debatable, and I would
    personally like to give people a choice to either turn it on, or
    off. We can argue about defaults if/when I have a published
    compiler.

    --
    Johann | email: invalid -> com | www.myrkraverk.com/blog/
    I'm not from the Internet, I just work there. | twitter: @myrkraverk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Johann 'Myrkraverk' Oskarsson@21:1/5 to gah4@u.washington.edu on Thu Jun 25 22:33:37 2020
    On 24/06/2020 7:51 am, gah4@u.washington.edu wrote:
    On Tuesday, June 23, 2020 at 12:59:35 PM UTC-7, Johann 'Myrkraverk' Oskarsson wrote:

    (snip)

    This clang blog specifically mentions Levenshtein,

    http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker

    and it looks like what people do is to go through the entire symbol
    table and compute it against the individual erroneous identifier.

    I thought that'd be a bit on the expensive side,

    With either constant weighting or character dependent weighting
    it is easy to do with dynamic programming. The time is then O(m n)
    where m and n are the two lengths.

    Are you talking about doing this one by one through the entire symbol
    table?

    It seems most obvious to do only variable that are in the appropriate
    scope to be misspelled, but I suspect catching variables used out
    of scope is also worth doing. Well, in the latter case, you could
    hope that they at least spell them the same.

    Depending on context, one would also want to do this for type names (as
    per the blog above). Depending on the language* and culture**, there
    can be thousands of type names in scope.

    I think you should turn it off for one character names, though,
    even though I suspect those are more likely. Too many false
    positives!

    rustc obviously does this for one character names, at least in the
    case for i and j. I don't know if it's useful to compare a and k.

    * C++ and Java come to mind.

    ** Programming culture, some of them have a name such as Agile, and
    eXtreme Programming; others don't have a name.

    --
    Johann | email: invalid -> com | www.myrkraverk.com/blog/
    I'm not from the Internet, I just work there. | twitter: @myrkraverk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From mac@21:1/5 to All on Thu Jul 9 16:07:45 2020
    You might not want J1 and J_one in the same program.

    Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but required the same capitalization everywhere

    This was much more important in the days of batch systems, where compilers tried to "correct" your code, because it would be a day before you got
    another chance. And because student programs couldn't do any damage.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to mac on Fri Jul 10 07:12:27 2020
    mac <acolvin@efunct.com> schrieb:
    You might not want J1 and J_one in the same program.

    Similarly, some ancient compiler (Euclid?) had case-insensitive lookup, but required the same capitalization everywhere

    Fortran has a a bit of a similar issue with its C interoperability
    feature.

    Entities with C binding have global identifiers in Fortran. Fortran
    is a case-insensitive laguage, so FooBar and foobar look the same
    to Fortran, and you can not have a C binding to both (but either
    one would work).

    In practice, this should not be a big problem.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@u.washington.edu@21:1/5 to Thomas Koenig on Fri Jul 10 13:17:59 2020
    On Friday, July 10, 2020 at 8:23:37 AM UTC-7, Thomas Koenig wrote:

    (snip)

    Fortran has a a bit of a similar issue with its C interoperability
    feature.

    Entities with C binding have global identifiers in Fortran. Fortran
    is a case-insensitive laguage, so FooBar and foobar look the same
    to Fortran, and you can not have a C binding to both (but either
    one would work).

    Much of this was pretty strange before Fortran added the
    C interoperability feature. Mostly, Fortran compilers did what
    they did, and the C names had to match. That included adding
    underscore characters sometimes.

    Now Fortran has BIND(C), and more specifically BIND(C, NAME='cname')
    where you specify the name as it will be known to C. That name is
    case sensitive (at least if the underlying linker supports it),
    and might be completely unrelated to the Fortran name.

    BIND(C) without NAME= generates lower case, which is usual for C.

    I believe IBM PL/I compilers have had the ability to call other
    (presumably IBM) languages, with specific keywords. But mostly that
    was when the IBM linker was (supposed to be) upper case only.

    VAX/VMS, and I suspect other VMS, has a system, defined along
    with VAX, for calling routines with other argument passing
    conventions, with %VAL(), %REF(), and %DESCR(), or call by value,
    reference, and descriptor. I am not sure what they do about case,
    but many VMS names have $ in them, so that is allowed.

    [Back in the day, IBM PL/I could call Fortran and COBOL but the
    procedure names are all eight EBCDIC characters. The extra stuff
    involves mapping datatypes. The current PL/I manuals say it
    handles many different character encodings like UTF-16. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)