Forum: >>> Magnum BBS <<<

State-of-the-art algorithms for lexical analysis?

From Roger L Costello@21:1/5 to All on Sun Jun 5 20:53:47 2022

Hi Folks,

Is there a list of algorithms used in lexical analysis?

Are regular expressions still the best way to specify tokens?

Is creating a Finite Automata for regular expressions the state-of-the-art?

What is the state-of-the-art algorithm for generating a Finite Automata?

What is the state-of-the-art algorithm for finding holes in the set of regex patterns?

What are the state-of-the-art algorithms for lexical analysis?

If you were to build a lexer-generator tool today, in 2022, what state-of-the-art algorithms would you use?

/Roger
[I doubt it. Yes. If you mean a DFA, yes. Same as it was 40 years ago. ... -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Sun Jun 5 16:05:38 2022

On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:

(snip)

Are regular expressions still the best way to specify tokens?

Some years ago, I used to work with a company that sold hardware
search processors to a certain three letter agency that we are not
supposed to mention, but everyone knows.

It has a completely different PSL, Pattern Specification Language,
much more powerful than the usual regular expression.

Both the standard and extended regular expression are nice, in that we
get used to using them, especially with grep, and without thinking too
much about them.

I suspect, though, that if they hadn't previously been defined, we
might come up with something different today.

Among others, PSL has the ability to define approximate matches,
such as a word with one or more misspellings, that is insertions,
deletions, or substitutions. Usual RE don't have that ability.

There are also PSL expressions for ranges of numbers.
You can often do that with very complicated RE, considering
all of the possibilities. PSL automatically processes those
possibilities. (Some can expand to complicated code.)

I suspect that in many cases the usual RE is not optimal for
lexical analysis, other than being well known.

But as noted, DFA are likely the best way to do them.

Though that could change with changes in computer hardware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to All on Mon Jun 6 10:48:24 2022

gah4 wrote:

Pattern Specification Language (PSL) is
much more powerful than the usual
regular expression.

Neat!

I suspect that if regexes hadn't previously
been defined, we might come up with
something different today.

Wow! That is a remarkable statement.

I will look into PSL. There are algorithms for converting regexes to DFA and then using the DFA to tokenize the input. Are there algorithms for converting PSL to (what?) and then using the (what?) to tokenize the input?

/Roger

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to All on Mon Jun 6 08:59:47 2022

On 6/6/22 1:05 AM, gah4 wrote:

It has a completely different PSL, Pattern Specification Language,
much more powerful than the usual regular expression.

I wonder about the need for powerful patterns in programming languages.
Most items (operators, punctuators, keywords) are fixed literals with a
fixed ID for use by the parser and code generator. If source code is
written by humans then the remaining types (identifiers, literals,
comments) should not have a too complicated syntax. For machine
generated source code a lexer is not required, the generator can
immediately produce the tokens for the parser. And if humans should
understand the code produced by the generator then again the syntax has
to be as simple and easy understandable as possible to humans.

Among others, PSL has the ability to define approximate matches,
such as a word with one or more misspellings, that is insertions,
deletions, or substitutions. Usual RE don't have that ability.

That's fine for keywords but does not help with user defined
identifiers. Still a nice to have feature :-)

There are also PSL expressions for ranges of numbers.
You can often do that with very complicated RE, considering
all of the possibilities. PSL automatically processes those
possibilities. (Some can expand to complicated code.)

If this feature is really helpful to the user?

I suspect that in many cases the usual RE is not optimal for
lexical analysis, other than being well known.

But as noted, DFA are likely the best way to do them.

ACK

Though that could change with changes in computer hardware.

Or with the style of writing. APL already tried to simplify typing, in
the near future a Chinese programming language with a glyph for each
token (except literals) would eliminate the need for a lexer. Then a
demand may arise for speech-to-text and reverse tools instead of a
lexer, for each natural language.

DoDi
[Regular expressions have the advantage that once you've paid the one-time cost of making a DFA, the matching is extremely fast. Since the lexer is usually one of the slowest parts of a compiler since it is the only part that has to look at each character of the source program, this is a place where speed matters. Anyone know how fast PSLs are? I've seen fuzzy matchers but they haven't been very fast. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kaz Kylheku@21:1/5 to gah4@u.washington.edu on Mon Jun 6 16:00:37 2022

On 2022-06-05, gah4 <gah4@u.washington.edu> wrote:

On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:

(snip)

Are regular expressions still the best way to specify tokens?

Some years ago, I used to work with a company that sold hardware
search processors to a certain three letter agency that we are not
supposed to mention, but everyone knows.

It has a completely different PSL, Pattern Specification Language,
much more powerful than the usual regular expression.

Both the standard and extended regular expression are nice, in that we
get used to using them, especially with grep, and without thinking too
much about them.

I suspect, though, that if they hadn't previously been defined, we
might come up with something different today.

Whether or not regexes are defined:

- we would still have have the concept of a machine with a finite number
of states.

- the result would hold that a machine with a finite number of states
can only recognize certain sets of strings (what we call "regular
languages"), and that those sets can be infinite.

- the observation would still be made that those sets of strings have
certain features, like expressing certain kinds of repetitions,
but not other repetitive patterns such as:
- an arbitrary number of N parentheses followed by a N closed
ones, for any N.

- obvious compressed notations would suggest themselves for expressing
the features of those sets.

- someone would dedicate him or herself toward finding the minimal set
of useful operations in the notation which can capture all such
sets (e.g. the same process by which we know that ? and + are not
necessary if we have the Kleene * and branching because
A+ is just AA*, and A? is (A|). The Kleene star and branching would
surely be rediscovered.

We would end up with regex under a different name, using different
notations: maybe some other symbol instead of star, perhaps in
a different position, like prefix instead of suffix, or whatever.

Among others, PSL has the ability to define approximate matches,
such as a word with one or more misspellings, that is insertions,
deletions, or substitutions. Usual RE don't have that ability.

This may be great for some explorative programming, but doesn't do much
when you're writing a compiler for a very specific, defined language.

Programmers misspell not only the fixed tokens of a language, but also program-defined identifiers like function names, variables, and types.

Today, when a C compiler says "undeclared identifier `pintf`, did you
mean `printf`?", this is not based on some misspelling support in the
lexical analyzer, and could not reasonably be. First the error is
identified in the ordinary way, and then some algorithm that is entirely external to parsing is applied to the symbol tables to find identifiers
similar to the undeclared one.

There are also PSL expressions for ranges of numbers.
You can often do that with very complicated RE, considering
all of the possibilities. PSL automatically processes those
possibilities. (Some can expand to complicated code.)

But ranges of numbers are regular sets. You can have a macro operator
embedded in a regex language whuch generates that same code.

For instance for matching the decimal strings 27 to 993, there is a
regex, and a way of calculating that regex.

We know thre is a regex because the strings set{ "27", "28", ..., "993" }
is a regular set, being finite. We can form a regex just by combining
those elements with a | branch operator.

We can do something which condenses some of the redundancy like:

9(9(|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7
|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8
|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9
|8|7|6|5|4|3|2|1|0))|8(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)
|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|
0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|
1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|7(9(|9|8|7|6|5|4
|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5
|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6
|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7
|6|5|4|3|2|1|0))|6(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|
9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4
(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)
|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|5(9(|9|8|7|6|5|4|3|2
|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3
|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4
|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5
|4|3|2|1|0))|4(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|
7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|
8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|
9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3|2|1|0))|3(9(|9|8|7|6|5|4|3|2|1|0
)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|5|4|3|2|1|0)|6(|9|8|7|6|5|4|3|2|1
|0)|5(|9|8|7|6|5|4|3|2|1|0)|4(|9|8|7|6|5|4|3|2|1|0)|3(|9|8|7|6|5|4|3|2
|1|0)|2(|9|8|7|6|5|4|3|2|1|0)|1(|9|8|7|6|5|4|3|2|1|0)|0(|9|8|7|6|5|4|3
|2|1|0))|2(9(|9|8|7|6|5|4|3|2|1|0)|8(|9|8|7|6|5|4|3|2|1|0)|7(|9|8|7|6|
5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5|4|3|2|1|0)|4(9|8|7|6|5
|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|4|3|2|1|0)|1(9|8|7|6|5|
4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))|1(9(9|8|7|6|5|4|3|2|1|0)|8(9|8|7|6|
5|4|3|2|1|0)|7(9|8|7|6|5|4|3|2|1|0)|6(9|8|7|6|5|4|3|2|1|0)|5(9|8|7|6|5
|4|3|2|1|0)|4(9|8|7|6|5|4|3|2|1|0)|3(9|8|7|6|5|4|3|2|1|0)|2(9|8|7|6|5|
4|3|2|1|0)|1(9|8|7|6|5|4|3|2|1|0)|0(9|8|7|6|5|4|3|2|1|0))

where we can better notate sequences like (9|8|7|6|5|4|3|2|1|0) as
[0-9].

What I did there was turn these things into a trie, and then just transliterated that trie into regex syntax.

(The digits appear in reverse because the trie implementation I'm using
relies on hash tables, and hash tables don't have a specified order; the
actual order observed as an artifact of the hashing function. In modern
systems that function can be perturbed by a randomly generated key for thwarting hash table attacks.)

Anyway, that sort of thing being what it is, the mechanism for
generating it thing can be readily embedded as a syntactic sugar into a
regex language, without making it non-regular in any way.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
[To put it another way, the set of strings you can recoginize with a
NFA or DFA is the same as the set of strings you can describe with a regex.
A DFA is such an obvious thing that we would have reverse engineered
regexes from them if Ken Thompson hadn't done it the other way. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Mon Jun 6 10:03:55 2022

On Monday, June 6, 2022 at 8:06:28 AM UTC-7, Roger L Costello wrote:

(snip)

I will look into PSL. There are algorithms for converting regexes to DFA
and then using the DFA to tokenize the input. Are there algorithms for converting PSL to (what?) and then using the (what?) to tokenize the input?

The approximate searches are done using dynamic programming.
The penalty is 1 for insertion, deletion, or substitution and the score
is in 3 bits, so up to six spelling errors.

The whole query is then compiled into code for a systolic array,
which then runs as fast as the data comes off disk.

FDF2 is a 9U VME board that runs in a VME based Sun system.

FDF3 connects directly to a SCSI disk, and also to a Sun workstation.
In searching, it transfers directly from the disk. To load data into
the disk, the disk is accessed indirectly through the FDF3.
It is a desktop box, about the size of a large external SCSI disk.

Some of it is described here:

https://aclanthology.org/X93-1011.pdf

along with its use for searching Japanese text, and:

https://trec.nist.gov/pubs/trec3/papers/paper.ps.gz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Mon Jun 6 20:11:47 2022

Is this the PSL to which you refer?
(Common Pattern Specification Language)

https://aclanthology.org/X98-1004.pdf

Or is it something else with a similar name? Is there a reference on
its specification?

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Mon Jun 6 21:16:26 2022

As our moderator wisely states:

Regular expressions have the advantage that once you've paid the one-time cost
of making a DFA, the matching is extremely fast. Yhe lexer is usually
one of the slowest parts of a compiler since it is the only part that has to look at each character of the source program, so this is a place where speed matters.

And, for most cases they really are sufficient, and it really behooves
one to stay within those limits. Why? Because when you get a syntax
error at the lexical level, which is surprisingly frequent unless you
never mistype closing quotes, you get whole sections of your code
misparsed and rarely does the compiler error correction help much.
Other single character errors , not . missing or extra ( { [ or ] } )
or ; have similar disastrous effects on program meaning, often not
detected until much later.

And, as I mentioned before, having the lexer be simply a scanner and
putting any extra semantics into a separate screener (per Frank
Deremer's recommendation) makes it all much simpler. You end up with
small state machines with very few states that easily fit in even
small machine caches or can be turned into circuitry, FPGAs or ASICs
that use minimal numbers of gates. Those things can often run as fast
as you can read the text in. And the screener being much less frequent
can do more complex things without imposing a significant penalty. The
screener is essentially running at parser speed and only looking at
"long" tokens not single (or double) character ones.

And sadly, you cannot go very much faster. Too often the transitions
occur at single character boundaries. One is lucky when it is a
two-character sequence and longer sequences terminating a token are
rare enough to be in the measurement noise. I know because I tried to
adapt the Boyer-Moore ideas once (skip and reverse) and found that
they were essentially ineffective for tokenization. They might apply occasionally in parsing, but that's not as much of a performance hog.

Unless you are interested in dealing with nested comments or something similar, you don't need a stack in your lexer and so no reason to do LL or LR parsing. (Yes, we extended our Yacc++ lexer to do LR parsing but with special casing so that the stack cost was only there if you had recursive productions and only tracked the start of the recursive production so that you were staying in DFA mode essentially all the time. And, while that helped us in a few cases, it isn't something I would say was important nor recommend.) The only place
I might have found it interesting is if we made it recognize tokens inside of strings or comments for use in error correction to help with the missing close character cases. That might have made it worthwhile. But that would probably have needed to be done only in the presence of syntax errors with a string or comment in the recent context.

In fact, there is only thing that I have not seen a DFA lexer do that I think is
worth doing at the lexical level (and not via a screener). That is recognizing tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are common in things like network protocols and they would be relatively easy
to implement, but I've not seen it done.

Beyond that it is my relatively firm belief that one should almost always
have only simple regular expressions, e.g. that the one for floating point numbers should be one of the most complex ones. Otherwise you are trying
to do too much in the scanner. And you are asking for trouble when you do.

Kind regards,
Chris

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to All on Mon Jun 6 12:25:56 2022

On Monday, June 6, 2022 at 8:06:28 AM UTC-7, Roger L Costello wrote:

(snip, I wrote)

I suspect that if regexes hadn't previously
been defined, we might come up with
something different today.

Wow! That is a remarkable statement.

Well, mostly, regex were defined based on what was reasonable to do on computers at the time. It seems reasonable, then, with the more powerful computers of today, to expect that more features would have been added.

Some of that was done in the later ERE, Extended Regular Expression.

But there is a strong tendency not to break backward compatibility,
and so not add new features later.
[See my note about DFAs a few messages back. EREs are just syntactic
sugar on regular REs so sure. PCREs are swell but they are a lot
slower since backreferences mean you need to be able to back up.
-John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to Christopher F Clark on Tue Jun 7 06:52:45 2022

On 6/6/22 8:16 PM, Christopher F Clark wrote:

In fact, there is only thing that I have not seen a DFA lexer do that I think is
worth doing at the lexical level (and not via a screener). That is recognizing
tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are common in things like network protocols and they would be relatively easy
to implement, but I've not seen it done.

I'm not sure what you mean. The nnH syntax has to be included into
general number syntax (like 0x... or nnE...).

Or do you mean a token built from the next nn input characters? In this
case both a lower and upper bound were interesting for e.g. (recognized) identifier length or distinction of Unicode codepoint formats.

Beyond that it is my relatively firm belief that one should almost always have only simple regular expressions, e.g. that the one for floating point numbers should be one of the most complex ones. Otherwise you are trying
to do too much in the scanner. And you are asking for trouble when you do.

ACK

DoDi
[I believe he means Fortran style Hollerith strings, where the number says
how many characters are in the following string. The number is just a count, not semantically a number in the language. DFAs can't do that other than by enumerating every possible length. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Tue Jun 7 19:40:11 2022

Yes, as our moderator explained. I was talking about things like
FORTRAN Hollerith strings, but more importantly network packets, where
they give the size of the "field" within a packet and then you simply
take that many characters (or bytes or bits or some other quanta) as
the "token". This is quite important for parsing "binary" data. And, sometimes the numbers are text like I showed but in many protocols the
numbers are "binary" e.g. something like

\xAHabcdefghij where \xA is a single 8 bit character (octet) whose
bits are "0000 1010" (or maybe 4, 8 bit, characters -- 4 octets),
that represent a 32 integer).

And, as our moderator pointed out, this makes a terrible regular
expression, NFA, DFA, but it is actually quite easy in nearly any
programming language. You read the length in, convert it to an integer
and then loop reading that many characters from the input and call
that a "token".

Kind regards,
Chris

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------ [Right. When I was writing Fortran lexers, Hollerith strings were among the simplest of the kludges I had to use. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to Christopher F Clark on Wed Jun 8 05:32:40 2022

On 6/7/22 6:40 PM, Christopher F Clark wrote:

And, as our moderator pointed out, this makes a terrible regular
expression, NFA, DFA, but it is actually quite easy in nearly any
programming language.

Now I know what made me think of Hollerith constants with the "H" :-)

I doubt that it's "quite easy" to use Hollerith constants for humans -
how often do you have to check whether you got the right number of
characters when reading or writing such a constant? So the delimited
form of strings is easier to handle by both humans and DFA's, a win-win situation :-)

DoDi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Hans-Peter Diettrich on Thu Jun 9 11:54:14 2022

On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:

(snip)

Now I know what made me think of Hollerith constants with the "H" :-)

I doubt that it's "quite easy" to use Hollerith constants for humans -
how often do you have to check whether you got the right number of
characters when reading or writing such a constant? So the delimited
form of strings is easier to handle by both humans and DFA's, a win-win situation :-)

It definitely seems that way now.

There is a document that Knuth calls "Fortran 0", with the description
of the Fortran language before they finished the first compiler,
maybe before they started it.

I never had many of them, but there are plenty of stories about
"Fortran coding forms", with 80 little boxes on each row,
to write down what you want punched on cards. Then, as the
story goes, someone else will punch them for you. I never had
anyone to punch my cards, though I learned how to use a keypunch
about when I was nine.

In any case, if you write your program on a coding form, with
each character in a little box, it is easy to know how many are
in each H constant.

Even more, Fortran I/O depended on getting things in the right
column until list-directed I/O (name as well as I know, borrowed
from PL/I) was added in 1977.

IBM added apostrophe delimited constants to Fortran IV early
on, but they didn't get into the Fortran 66 standard.

One reason for the early Fortran character set was the characters
available on the 026 keypunch. For B5500 ALGOL, you had
to use multi-punch to get many of the characters that didn't
have a key. But IBM didn't use that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Robin Vowels@21:1/5 to All on Fri Jun 10 12:21:00 2022

From: "gah4" <gah4@u.washington.edu>
Subject: Re: counted strings

On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:

In any case, if you write your program on a coding form, with
each character in a little box, it is easy to know how many are
in each H constant.

Nevertheless, counting the number of characters was a constant source of error. It was easy enough to include the letter 'H' in the character count, sp that the following character became gobbled up in the Hollerith constant,
and resulting in weird error messages.
When a Hollerith constant was long enough to require a continuation card,
it was even easier to lose count; the continuation character in column 6 sometimes being included.
And when the Hollerith constant required 133 characters, how many coud reliably count all of them?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Martin Ward@21:1/5 to Robin Vowels on Sat Jun 11 10:52:08 2022

On 10/06/2022 03:21, Robin Vowels wrote:

Nevertheless, counting the number of characters was a constant source
of error. It was easy enough to include the letter 'H' in the
character count, sp that the following character became gobbled up in
the Hollerith constant, and resulting in weird error messages. When a Hollerith constant was long enough to require a continuation card, it
was even easier to lose count; the continuation character in column
6 sometimes being included. And when the Hollerith constant required
133 characters, how many coud reliably count all of them?

The point about coding forms was that each column of characters
was numbered, so you just had to take the first column and the last
and compute last - first + 1 to get the number of characters
in the string. You don't have to count each one individually.
If there is a continuation then you just compute last + 66 - first + 1
For 133 characters, there would be two continuation cards
and the last column would be the same as the first:
so quite easy to count reliably in fact!

Back in the days before pocket calculators, many people could
do simple arithmetic sums in their heads! :-)

--
Martin

Dr Martin Ward | Email: martin@gkc.org.uk | http://www.gkc.org.uk G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dennis Boone@21:1/5 to All on Sat Jun 11 11:09:11 2022

And when the Hollerith constant required 133 characters, how many coud reliably count all of them?

Such a long Hollerith string would be uncommon, I think. The main
purpose would seem to be headers on a printed report. It appears that
the 'T' specifier wasn't available in the early 60s versions of IBM
FORTRAN, but it certainly was there in FORTRAN 66.

De
[Early Fortran mostly read and wrote to tape files so who knows what long strings people might have needed. Either way, I think we've beaten this
topic long enough. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	300
Nodes:	16 (2 / 14)
Uptime:	64:20:45
Calls:	6,712
Files:	12,244
Messages:	5,356,124

State-of-the-art algorithms for lexical analysis?

Who's Online

System Info