• State-of-the-art algorithms for lexical analysis?

    From Roger L Costello@21:1/5 to All on Sun Jun 5 20:53:47 2022
    Hi Folks,

    Is there a list of algorithms used in lexical analysis?

    Are regular expressions still the best way to specify tokens?

    Is creating a Finite Automata for regular expressions the state-of-the-art?

    What is the state-of-the-art algorithm for generating a Finite Automata?

    What is the state-of-the-art algorithm for finding holes in the set of regex patterns?

    What are the state-of-the-art algorithms for lexical analysis?

    If you were to build a lexer-generator tool today, in 2022, what state-of-the-art algorithms would you use?

    /Roger
    [I doubt it. Yes. If you mean a DFA, yes. Same as it was 40 years ago. ... -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Sun Jun 5 16:05:38 2022
    On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:

    (snip)

    Are regular expressions still the best way to specify tokens?

    Some years ago, I used to work with a company that sold hardware
    search processors to a certain three letter agency that we are not
    supposed to mention, but everyone knows.

    It has a completely different PSL, Pattern Specification Language,
    much more powerful than the usual regular expression.

    Both the standard and extended regular expression are nice, in that we
    get used to using them, especially with grep, and without thinking too
    much about them.

    I suspect, though, that if they hadn't previously been defined, we
    might come up with something different today.

    Among others, PSL has the ability to define approximate matches,
    such as a word with one or more misspellings, that is insertions,
    deletions, or substitutions. Usual RE don't have that ability.

    There are also PSL expressions for ranges of numbers.
    You can often do that with very complicated RE, considering
    all of the possibilities. PSL automatically processes those
    possibilities. (Some can expand to complicated code.)

    I suspect that in many cases the usual RE is not optimal for
    lexical analysis, other than being well known.

    But as noted, DFA are likely the best way to do them.

    Though that could change with changes in computer hardware.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to All on Mon Jun 6 10:48:24 2022
    gah4 wrote:

    Pattern Specification Language (PSL) is
    much more powerful than the usual
    regular expression.

    Neat!

    I suspect that if regexes hadn't previously
    been defined, we might come up with
    something different today.

    Wow! That is a remarkable statement.

    I will look into PSL. There are algorithms for converting regexes to DFA and then using the DFA to tokenize the input. Are there algorithms for converting PSL to (what?) and then using the (what?) to tokenize the input?

    /Roger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to All on Mon Jun 6 08:59:47 2022
    On 6/6/22 1:05 AM, gah4 wrote:

    It has a completely different PSL, Pattern Specification Language,
    much more powerful than the usual regular expression.

    I wonder about the need for powerful patterns in programming languages.
    Most items (operators, punctuators, keywords) are fixed literals with a
    fixed ID for use by the parser and code generator. If source code is
    written by humans then the remaining types (identifiers, literals,
    comments) should not have a too complicated syntax. For machine
    generated source code a lexer is not required, the generator can
    immediately produce the tokens for the parser. And if humans should
    understand the code produced by the generator then again the syntax has
    to be as simple and easy understandable as possible to humans.


    Among others, PSL has the ability to define approximate matches,
    such as a word with one or more misspellings, that is insertions,
    deletions, or substitutions. Usual RE don't have that ability.

    That's fine for keywords but does not help with user defined
    identifiers. Still a nice to have feature :-)

    There are also PSL expressions for ranges of numbers.
    You can often do that with very complicated RE, considering
    all of the possibilities. PSL automatically processes those
    possibilities. (Some can expand to complicated code.)

    If this feature is really helpful to the user?


    I suspect that in many cases the usual RE is not optimal for
    lexical analysis, other than being well known.

    But as noted, DFA are likely the best way to do them.

    ACK

    Though that could change with changes in computer hardware.

    Or with the style of writing. APL already tried to simplify typing, in
    the near future a Chinese programming language with a glyph for each
    token (except literals) would eliminate the need for a lexer. Then a
    demand may arise for speech-to-text and reverse tools instead of a
    lexer, for each natural language.

    DoDi
    [Regular expressions have the advantage that once you've paid the one-time cost of making a DFA, the matching is extremely fast. Since the lexer is usually one of the slowest parts of a compiler since it is the only part that has to look at each character of the source program, this is a place where speed matters. Anyone know how fast PSLs are? I've seen fuzzy matchers but they haven't been very fast. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to gah4@u.washington.edu on Mon Jun 6 16:00:37 2022
    On 2022-06-05, gah4 <gah4@u.washington.edu> wrote:
    On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:

    (snip)

    Are regular expressions still the best way to specify tokens?

    Some years ago, I used to work with a company that sold hardware
    search processors to a certain three letter agency that we are not
    supposed to mention, but everyone knows.

    It has a completely different PSL, Pattern Specification Language,
    much more powerful than the usual regular expression.

    Both the standard and extended regular expression are nice, in that we
    get used to using them, especially with grep, and without thinking too
    much about them.

    I suspect, though, that if they hadn't previously been defined, we
    might come up with something different today.

    Whether or not regexes are defined:

    - we would still have have the concept of a machine with a finite number
    of states.

    - the result would hold that a machine with a finite number of states
    can only recognize certain sets of strings (what we call "regular
    languages"), and that those sets can be infinite.

    - the observation would still be made that those sets of strings have
    certain features, like expressing certain kinds of repetitions,
    but not other repetitive patterns such as:
    - an arbitrary number of N parentheses followed by a N closed
    ones, for any N.

    - obvious compressed notations would suggest themselves for expressing
    the features of those sets.

    - someone would dedicate him or herself toward finding the minimal set
    of useful operations in the notation which can capture all such
    sets (e.g. the same process by which we know that ? and + are not
    necessary if we have the Kleene * and branching because
    A+ is just AA*, and A? is (A|). The Kleene star and branching would
    surely be rediscovered.

    We would end up with regex under a different name, using different
    notations: maybe some other symbol instead of star, perhaps in
    a different position, like prefix instead of suffix, or whatever.

    Among others, PSL has the ability to define approximate matches,
    such as a word with one or more misspellings, that is insertions,
    deletions, or substitutions. Usual RE don't have that ability.

    This may be great for some explorative programming, but doesn't do much
    when you're writing a compiler for a very specific, defined language.

    Programmers misspell not only the fixed tokens of a language, but also program-defined identifiers like function names, variables, and types.

    Today, when a C compiler says "undeclared identifier `pintf`, did you
    mean `printf`?", this is not based on some misspelling support in the
    lexical analyzer, and could not reasonably be. First the error is
    identified in the ordinary way, and then some algorithm that is entirely external to parsing is applied to the symbol tables to find identifiers
    similar to the undeclared one.

    There are also PSL expressions for ranges of numbers.
    You can often do that with very complicated RE, considering
    all of the possibilities. PSL automatically processes those
    possibilities. (Some can expand to complicated code.)

    But ranges of numbers are regular sets. You can have a macro operator
    embedded in a regex language whuch generates that same code.

    For instance for matching the decimal strings 27 to 993, there is a
    regex, and a way of calculating that regex.

    We know thre is a regex because the strings set{ "27", "28", ..., "993" }
    is a regular set, being finite. We can form a regex just by combining
    those elements with a | branch operator.

    We can do something which condenses some of the redundancy like:

    9(9(|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7
    |6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8
    |7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9
    |8|7|6|5|4|3|2|1|0))|8(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)
    |7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|
    0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|
    1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|7(9(|9|8|7|6|5|4
    |3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5
    |4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6
    |5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7
    |6|5|4|3|2|1|0))|6(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|
    9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4
    (|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)
    |1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|5(9(|9|8|7|6|5|4|3|2
    |1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3
    |2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4
    |3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5
    |4|3|2|1|0))|4(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|
    7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|
    8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|
    9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|3(9(|9|8|7|6|5|4|3|2|1|0
    )|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1
    |0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2
    |1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3
    |2|1|0))|2(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|
    5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5|4|3|2|1|0)|4(9|8|7|6|5
    |4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|4|3|2|1|0)|1(9|8|7|6|5|
    4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))|1(9(9|8|7|6|5|4|3|2|1|0)|8(9|8|7|6|
    5|4|3|2|1|0)|7(9|8|7|6|5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5
    |4|3|2|1|0)|4(9|8|7|6|5|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|
    4|3|2|1|0)|1(9|8|7|6|5|4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))

    where we can better notate sequences like (9|8|7|6|5|4|3|2|1|0) as
    [0-9].

    What I did there was turn these things into a trie, and then just transliterated that trie into regex syntax.

    (The digits appear in reverse because the trie implementation I'm using
    relies on hash tables, and hash tables don't have a specified order; the
    actual order observed as an artifact of the hashing function. In modern
    systems that function can be perturbed by a randomly generated key for thwarting hash table attacks.)

    Anyway, that sort of thing being what it is, the mechanism for
    generating it thing can be readily embedded as a syntactic sugar into a
    regex language, without making it non-regular in any way.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    [To put it another way, the set of strings you can recoginize with a
    NFA or DFA is the same as the set of strings you can describe with a regex.
    A DFA is such an obvious thing that we would have reverse engineered
    regexes from them if Ken Thompson hadn't done it the other way. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Mon Jun 6 10:03:55 2022
    On Monday, June 6, 2022 at 8:06:28 AM UTC-7, Roger L Costello wrote:

    (snip)

    I will look into PSL. There are algorithms for converting regexes to DFA
    and then using the DFA to tokenize the input. Are there algorithms for converting PSL to (what?) and then using the (what?) to tokenize the input?

    The approximate searches are done using dynamic programming.
    The penalty is 1 for insertion, deletion, or substitution and the score
    is in 3 bits, so up to six spelling errors.

    The whole query is then compiled into code for a systolic array,
    which then runs as fast as the data comes off disk.

    FDF2 is a 9U VME board that runs in a VME based Sun system.

    FDF3 connects directly to a SCSI disk, and also to a Sun workstation.
    In searching, it transfers directly from the disk. To load data into
    the disk, the disk is accessed indirectly through the FDF3.
    It is a desktop box, about the size of a large external SCSI disk.

    Some of it is described here:

    https://aclanthology.org/X93-1011.pdf

    along with its use for searching Japanese text, and:

    https://trec.nist.gov/pubs/trec3/papers/paper.ps.gz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Mon Jun 6 20:11:47 2022
    Is this the PSL to which you refer?
    (Common Pattern Specification Language)

    https://aclanthology.org/X98-1004.pdf

    Or is it something else with a similar name? Is there a reference on
    its specification?


    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Mon Jun 6 21:16:26 2022
    As our moderator wisely states:

    Regular expressions have the advantage that once you've paid the one-time cost
    of making a DFA, the matching is extremely fast. Yhe lexer is usually
    one of the slowest parts of a compiler since it is the only part that has to look at each character of the source program, so this is a place where speed matters.

    And, for most cases they really are sufficient, and it really behooves
    one to stay within those limits. Why? Because when you get a syntax
    error at the lexical level, which is surprisingly frequent unless you
    never mistype closing quotes, you get whole sections of your code
    misparsed and rarely does the compiler error correction help much.
    Other single character errors , not . missing or extra ( { [ or ] } )
    or ; have similar disastrous effects on program meaning, often not
    detected until much later.

    And, as I mentioned before, having the lexer be simply a scanner and
    putting any extra semantics into a separate screener (per Frank
    Deremer's recommendation) makes it all much simpler. You end up with
    small state machines with very few states that easily fit in even
    small machine caches or can be turned into circuitry, FPGAs or ASICs
    that use minimal numbers of gates. Those things can often run as fast
    as you can read the text in. And the screener being much less frequent
    can do more complex things without imposing a significant penalty. The
    screener is essentially running at parser speed and only looking at
    "long" tokens not single (or double) character ones.

    And sadly, you cannot go very much faster. Too often the transitions
    occur at single character boundaries. One is lucky when it is a
    two-character sequence and longer sequences terminating a token are
    rare enough to be in the measurement noise. I know because I tried to
    adapt the Boyer-Moore ideas once (skip and reverse) and found that
    they were essentially ineffective for tokenization. They might apply occasionally in parsing, but that's not as much of a performance hog.

    Unless you are interested in dealing with nested comments or something similar, you don't need a stack in your lexer and so no reason to do LL or LR parsing. (Yes, we extended our Yacc++ lexer to do LR parsing but with special casing so that the stack cost was only there if you had recursive productions and only tracked the start of the recursive production so that you were staying in DFA mode essentially all the time. And, while that helped us in a few cases, it isn't something I would say was important nor recommend.) The only place
    I might have found it interesting is if we made it recognize tokens inside of strings or comments for use in error correction to help with the missing close character cases. That might have made it worthwhile. But that would probably have needed to be done only in the presence of syntax errors with a string or comment in the recent context.

    In fact, there is only thing that I have not seen a DFA lexer do that I think is
    worth doing at the lexical level (and not via a screener). That is recognizing tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are common in things like network protocols and they would be relatively easy
    to implement, but I've not seen it done.

    Beyond that it is my relatively firm belief that one should almost always
    have only simple regular expressions, e.g. that the one for floating point numbers should be one of the most complex ones. Otherwise you are trying
    to do too much in the scanner. And you are asking for trouble when you do.

    Kind regards,
    Chris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to All on Mon Jun 6 12:25:56 2022
    On Monday, June 6, 2022 at 8:06:28 AM UTC-7, Roger L Costello wrote:

    (snip, I wrote)
    I suspect that if regexes hadn't previously
    been defined, we might come up with
    something different today.

    Wow! That is a remarkable statement.

    Well, mostly, regex were defined based on what was reasonable to do on computers at the time. It seems reasonable, then, with the more powerful computers of today, to expect that more features would have been added.

    Some of that was done in the later ERE, Extended Regular Expression.

    But there is a strong tendency not to break backward compatibility,
    and so not add new features later.
    [See my note about DFAs a few messages back. EREs are just syntactic
    sugar on regular REs so sure. PCREs are swell but they are a lot
    slower since backreferences mean you need to be able to back up.
    -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Christopher F Clark on Tue Jun 7 06:52:45 2022
    On 6/6/22 8:16 PM, Christopher F Clark wrote:

    In fact, there is only thing that I have not seen a DFA lexer do that I think is
    worth doing at the lexical level (and not via a screener). That is recognizing
    tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are common in things like network protocols and they would be relatively easy
    to implement, but I've not seen it done.

    I'm not sure what you mean. The nnH syntax has to be included into
    general number syntax (like 0x... or nnE...).

    Or do you mean a token built from the next nn input characters? In this
    case both a lower and upper bound were interesting for e.g. (recognized) identifier length or distinction of Unicode codepoint formats.

    Beyond that it is my relatively firm belief that one should almost always have only simple regular expressions, e.g. that the one for floating point numbers should be one of the most complex ones. Otherwise you are trying
    to do too much in the scanner. And you are asking for trouble when you do.

    ACK

    DoDi
    [I believe he means Fortran style Hollerith strings, where the number says
    how many characters are in the following string. The number is just a count, not semantically a number in the language. DFAs can't do that other than by enumerating every possible length. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Tue Jun 7 19:40:11 2022
    Yes, as our moderator explained. I was talking about things like
    FORTRAN Hollerith strings, but more importantly network packets, where
    they give the size of the "field" within a packet and then you simply
    take that many characters (or bytes or bits or some other quanta) as
    the "token". This is quite important for parsing "binary" data. And, sometimes the numbers are text like I showed but in many protocols the
    numbers are "binary" e.g. something like

    \xAHabcdefghij where \xA is a single 8 bit character (octet) whose
    bits are "0000 1010" (or maybe 4, 8 bit, characters -- 4 octets),
    that represent a 32 integer).

    And, as our moderator pointed out, this makes a terrible regular
    expression, NFA, DFA, but it is actually quite easy in nearly any
    programming language. You read the length in, convert it to an integer
    and then loop reading that many characters from the input and call
    that a "token".

    Kind regards,
    Chris

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [Right. When I was writing Fortran lexers, Hollerith strings were among the simplest of the kludges I had to use. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to Christopher F Clark on Wed Jun 8 05:32:40 2022
    On 6/7/22 6:40 PM, Christopher F Clark wrote:

    And, as our moderator pointed out, this makes a terrible regular
    expression, NFA, DFA, but it is actually quite easy in nearly any
    programming language.

    Now I know what made me think of Hollerith constants with the "H" :-)

    I doubt that it's "quite easy" to use Hollerith constants for humans -
    how often do you have to check whether you got the right number of
    characters when reading or writing such a constant? So the delimited
    form of strings is easier to handle by both humans and DFA's, a win-win situation :-)

    DoDi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Hans-Peter Diettrich on Thu Jun 9 11:54:14 2022
    On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:

    (snip)

    Now I know what made me think of Hollerith constants with the "H" :-)

    I doubt that it's "quite easy" to use Hollerith constants for humans -
    how often do you have to check whether you got the right number of
    characters when reading or writing such a constant? So the delimited
    form of strings is easier to handle by both humans and DFA's, a win-win situation :-)

    It definitely seems that way now.

    There is a document that Knuth calls "Fortran 0", with the description
    of the Fortran language before they finished the first compiler,
    maybe before they started it.

    I never had many of them, but there are plenty of stories about
    "Fortran coding forms", with 80 little boxes on each row,
    to write down what you want punched on cards. Then, as the
    story goes, someone else will punch them for you. I never had
    anyone to punch my cards, though I learned how to use a keypunch
    about when I was nine.

    In any case, if you write your program on a coding form, with
    each character in a little box, it is easy to know how many are
    in each H constant.

    Even more, Fortran I/O depended on getting things in the right
    column until list-directed I/O (name as well as I know, borrowed
    from PL/I) was added in 1977.

    IBM added apostrophe delimited constants to Fortran IV early
    on, but they didn't get into the Fortran 66 standard.

    One reason for the early Fortran character set was the characters
    available on the 026 keypunch. For B5500 ALGOL, you had
    to use multi-punch to get many of the characters that didn't
    have a key. But IBM didn't use that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Vowels@21:1/5 to All on Fri Jun 10 12:21:00 2022
    From: "gah4" <gah4@u.washington.edu>
    Subject: Re: counted strings

    On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:

    In any case, if you write your program on a coding form, with
    each character in a little box, it is easy to know how many are
    in each H constant.

    Nevertheless, counting the number of characters was a constant source of error. It was easy enough to include the letter 'H' in the character count, sp that the following character became gobbled up in the Hollerith constant,
    and resulting in weird error messages.
    When a Hollerith constant was long enough to require a continuation card,
    it was even easier to lose count; the continuation character in column 6 sometimes being included.
    And when the Hollerith constant required 133 characters, how many coud reliably count all of them?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Ward@21:1/5 to Robin Vowels on Sat Jun 11 10:52:08 2022
    On 10/06/2022 03:21, Robin Vowels wrote:
    Nevertheless, counting the number of characters was a constant source
    of error. It was easy enough to include the letter 'H' in the
    character count, sp that the following character became gobbled up in
    the Hollerith constant, and resulting in weird error messages. When a Hollerith constant was long enough to require a continuation card, it
    was even easier to lose count; the continuation character in column
    6 sometimes being included. And when the Hollerith constant required
    133 characters, how many coud reliably count all of them?

    The point about coding forms was that each column of characters
    was numbered, so you just had to take the first column and the last
    and compute last - first + 1 to get the number of characters
    in the string. You don't have to count each one individually.
    If there is a continuation then you just compute last + 66 - first + 1
    For 133 characters, there would be two continuation cards
    and the last column would be the same as the first:
    so quite easy to count reliably in fact!

    Back in the days before pocket calculators, many people could
    do simple arithmetic sums in their heads! :-)

    --
    Martin

    Dr Martin Ward | Email: martin@gkc.org.uk | http://www.gkc.org.uk G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dennis Boone@21:1/5 to All on Sat Jun 11 11:09:11 2022
    And when the Hollerith constant required 133 characters, how many coud reliably count all of them?

    Such a long Hollerith string would be uncommon, I think. The main
    purpose would seem to be headers on a printed report. It appears that
    the 'T' specifier wasn't available in the early 60s versions of IBM
    FORTRAN, but it certainly was there in FORTRAN 66.

    De
    [Early Fortran mostly read and wrote to tape files so who knows what long strings people might have needed. Either way, I think we've beaten this
    topic long enough. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)