... Likewise, one might modify the variable name
pattern, but I'm not sure how one says "everything that doesn't start with one
of these other 110 patterns".
I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
of 101 BASIC Games (DEF FN is the last feature to add).
Then I got to Super Star Trek. To save memory, SST removes most spaces, so lines look like this:
100FORI=1TO10
Here's my current patterns that match bits of this line:
FOR { return FOR; }
[:,;()\^=+\-*/\<\>] { return yytext[0]; }
[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
yylval.d = atof(yytext);
return NUMBER;
}
"FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
yylval.s = g_string_new(yytext);
return IDENTIFIER;
}
These correctly pick out some parts, numbers and = for instance, so it sees:
100 FORI = 1 TO 10
The problem is that FORI part. Some BASICs allow variable names with more than
two characters, so in theory, FORI could be a variable. These BASICs outlaw that in their parsers; any string that starts with a keyword exits then, so this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
a variable token called FORI.
Is there a way to represent this in lex? Over on Stack Overflow the only suggestion seemed to be to use trailing syntax on the keywords, but that appears to require modifying every one of simple patterns for keywords with some extra (and ugly) syntax. Likewise, one might modify the variable name pattern, but I'm not sure how one says "everything that doesn't start with one
of these other 110 patterns".
Unless I'm mistaken, Integer BASIC is a "deep tokenizer", so is it not
the case the spaces exist only at entry time? IE, if one types
'PRINT"HELLO"' that will actually be LISTed as 'PRINT "HELLO"?
[That is my recollection. Storing tokens takes less space than
storing the input text and on those tiny little
micros, every byte counted. -John]
[You can try start states but in my experience if the tokenizing rules
are very context sensitive, it's easier to give up and hand-code the
lexer. The lexical syntax of Basic isn't that big. -John]
...So any statement either starts
with a keyword (a control character), which determines
the statement type, or it is an assignment with the LET
keyword omitted...
"Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution
to the problem of keywords being part of identifiers in languages with optional spaces.
The problem is that some keywords can appear at places other than the beginning of an identifier.
In fact, in the worst case scenario, the language can be ambiguous.
...
With a GLR parser (or something equivalent in power, e.g. an Earley
parser or CYK) and a lexer that returns all possible sets of
tokenizations one can find all the relevant parse trees and then see
if only 1 makes semantic sense.
"Ev. Drikos" <drikosev@gmail.com> posted an interesting albeit partial solution-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
to the problem of keywords being part of identifiers in languages with optional spaces.
I won't include it here.
The problem is that some keywords can appear at places other than the beginning of an identifier.
In fact, in the worst case scenario, the language can be ambiguous.
Consider the following "BASIC" program extended with variables that
are more than one letter long
and spaces being optional.
10 LET ITO = 1
20 LET I = 2
30 LET JTOK = 3
40 LET K = 4
50 FOR N = ITOJTOK
60 REM AMBIGUOUS FOR N = I TO JTOK
70 REM OR FOR N = ITOJ TO K
80 PRINT N;
90 NEXT N
100 END
The problem with such solutions is one is tempted to "fix" them one by
one as they are encountered.
Maury Markowitz <maury.markowitz@gmail.com> mentioned this in his post
where ATO was considered.
It could be A TO or AT O (presuming that TO and AT are both keywords)
Note that this is even an issue with 1 letter variable names if one
has both keywords.
As one starts patching up these cases, the "grammar"
(or its recursive descent implementation most likely)
begins to become what I call "ad hack".
With a GLR parser (or something equivalent in power, e.g. an Earley
parser or CYK) and
a lexer that returns all possible sets of tokenizations one can find
all the relevant parse
trees and then see if only 1 makes semantic sense.
In the above example, that won't help as both interpretations are
legal programs.
One prints 2 3, the other 1 2 3 4.
I cannot imagine a programmer being happy with the error message:
LINE 50 AMBIGUOUS STATEMENT.
With a GLR parser (or something equivalent in power, e.g. an Earley
parser or CYK) and a lexer that returns all possible sets of
tokenizations one can find all the relevant parse trees and then see
if only 1 makes semantic sense.
On 29/02/2020 11:48, Christopher F Clark wrote:
...
Obviously, those who coded such BASIC parsers had some simpler rules,
ie the position of the first 'TO' might be used for the statement 50.
...
As far as I can tell (and I haven't thought seriously about BASIC for
years), there are roughly two cases.
1) We have already identified the lower bound in "lower bound TO upper bound". This can happen if we saw a number, or maybe an identifier
without the word "TO" in it (followed by a space?). This will not be
true, if we just saw an operator such as "+" as we will be expecting
another identifier to continue the expression...
...
So, what would I do? I would do it in 3 parts. A lexer, a "fixer",
and a parser...
I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs like the original DEC/MS, HP/DG etc.
I have it mostly working for a good chunk
of 101 BASIC Games (DEF FN is the last feature to add).
Then I got to Super Star Trek. To save memory, SST removes most spaces, so lines look like this:
100FORI=1TO10
class Lexer:
"""
Lexer emulates quirky Applesoft tokenizer as closely as possible.
Thus, it diverges from what a normal lexer would do. In
particular, spaces are completely insignificant, except within
strings, except if it finds "AT" followed by a space and then an
"N" appears. That is parsed as "AT N", not "ATN". Fun?;)
"""
I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs like the original DEC/MS, HP/DG etc. ...
The problem is that FORI part. Some BASICs allow variable names with more than
two characters, so in theory, FORI could be a variable. ...
Is there a canonical cure for this sort of problem that isn't worse than the disease?
I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
of 101 BASIC Games (DEF FN is the last feature to add).
Then I got to Super Star Trek. To save memory, SST removes most spaces, so lines look like this:
100FORI=1TO10
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 296 |
Nodes: | 16 (2 / 14) |
Uptime: | 39:17:55 |
Calls: | 6,648 |
Files: | 12,193 |
Messages: | 5,329,316 |