I remade {tangle,weave}.web and c{tangle,weave}.w both in more C-like form as ordinary contiguous code+comments (+ UNICODE, because I'd like to keep some of the original spirit of making it as more math-/typeset-friendly program+documentation) ... and might post what I have on GitHub or somewhere, even though it hasn't yet been baseline tested. Among other things, I want to remake my C compiler to directly accept UNICODE for the operators. The
glossary I use is (↑,↓,→,≠,≡,≤,≥,∷,∧,∨,¬) for (++,--,->,!=,==,<=,>=,::,&&,||,!), with the latter treated as legacy digraph forms of the former. I'll post a note on the link, when I get it up. The context-sensitive grammar is a switch statement that runs up to 4 levels deep. It *might* be yacc-convertible, as is, since some versions of yacc may allow for stack-hacking, e.g. an $-1 argument in a rule.
On Monday, April 5, 2021 at 9:55:35 AM UTC-5, gah4 wrote:
It seems to me that macro processors were popular in the 1970's.
Some years ago, I used to work with Mortran, which is a macro processor
meant to be used to process an improved language like Fortran, which is
then converted into Fortran...
You hit the nail on the head here.
Partially in defense of Knuth, it looks like what he was trying to do (in the background of all this) was establish an entirely new framework for parsing, that got out from under the deficiencies of LR, that was (a) streaming and (b) potentially even parallelizable. By "streaming" I mean that you can literally start and end a partial parse in mid-stream and *still* get meaningful
results, even of the chunk that was processed doesn't form any kind of grammatical unit at all ... which has obvious utility for intelligently processing macros.
That's CYK-on-steroids.
The lack of such a facility also stood as a roadblock to making cfront
directly into a C++ compiler. In the old cfront code (e.g. the first version, release E, which we transcribed and is currently here
http://www.softwarepreservation.org/projects/c_plus_plus/index.html#cfront
...
an updated/correction version which will be out on GitHub hopefully soon)
shows indication in it that Stroustrup was trying to make it into a *direct* C++ to C processor. He had an option in the build script in release E or the other early releases to allow for this, but never implemented it. The macro layer stands in the way of everything. So, instead, he set it up C++ to C conversion as a Rube Goldberg device that got rid of macros first ... thus totally destroying the transparency of the translation process; when what we actually wanted as a translation that kept the macros intact, but
intelligently processed them.
(His original aim is much more doable now, by the way, than in the 1980's, because the *forward* advancement of C from the 1980's to C99 actually entails a dramatic *simplification* of what's needed for a "cfront" like utility, and much of what went into the original is no longer needed, on that account. You can almost even write up a utility as a simple editor script, now! ... Except for exception-handling. Most people forget, translation is a *contravariant* function of source language - it simplifies when source language complexifies. So things become *more* doable, not less!)
At the time of LR's inception, you had 2 major results which were never fully developed or exploited: (a) the Chomsky-Schützenberger Theorem (CST), which *very nearly* gives you the ability to make 100% streaming representation for translators, with one "frame" per word -- except the CST "kernels" have word boundary markers; and (b) its successor of sorts , Greibach's "Hardest CFL" construction, which successfully incorporates word-boundary markers into the word frames.
I'll show you how it can be done, now, with the newer formalism for context-free expressions.
This:
L → α | L S β,
S → x γ | u L v δ,
S w L ω
is a translation grammar containing input words from the set {u,v,w,x}, output actions from the set {α,β,γ,δ,ω} which have an obvious interpretation in terms of "reduce" (α,β,γ,δ) and "accept" (ω) actions; plus a starting configuration S w L ω. That's typical of what yacc assumes, except for the loosening of restrictions on what's allowed for the "start".
It is possible, by the way, to hack yacc to allow for *any* rule body under %start and to allow "%start" to be issued anywhere, and multiple times; by
just treating each "%start α" declaration as being synonymous with a hidden rule of the form "$accept: α $end;" ... because that's actually already how
it is processed and even displayed under the "-v" option in some implementations. With the loosening, you can even add in start actions under %start (thereby eliminating the need for some of bison's kludges).
This translation grammar is in "canonical bottom-up" form - all actions occur at the end, no in-between actions. (In yacc, in-between actions are hacked by turning them into actions for special single-use non-terminals that have empty transitions ... which you can see on display when you run the "-v" option on a source file containing rules with in-between actions.)
The translation grammar has this as its CST-kernel:
<0| (u<u| + v<v| + w<w| + x<x| + (α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|)* |L>|w>|S>|0> ω
In the 1960's version of the theorem, the labelled brackets <u|,<v|,...; |u>,|v>,... were interpreted as just extra word symbols (and actually Chomsky
& Schützenberger replaced words with brackets too). In the newer algebraic form ... which may finally see publication soon this year ... the brackets
form an algebra with <i| |j> = 0 if i ≠ j, <i| |i> = 1, which commute with the other symbols. In addition, for translator grammars, input and output symbols commute (e.g. αx = xα, the conversion of αx to xα is, itself, the very essence of a "look-ahead"). One can also conceive of cases where there
are *multiple* input and output channels, each having their own alphabet, the different channels' alphabet commuting with one another.
The kernel has the form
I (X + YR)* YF
with
I = <0|
X = u<u| + v<v| + w<w| + x<x|
YR = α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|
YF = |L>|w>|S>|0> ω
and, in fact, YR and YF both factor into
Y = α<α| + β<β| + γ<γ| + δ<δ| + ω<ω|
R = (|α> + |β>|S>|L>)<L| + (|γ>|x> + |δ>|v>|L>|u>)<S|
F = |ω>|L>|w>|S>|0>
and the original kernel is actually equal to I (X + Y + R)* F ... a form which is suited for translation grammars that are *not* in bottom-up canonical form. The reduction I (X + Y + R)* F = I (X + YR)* YF only applies if the
translation grammar is in canonical form.
By Kleene algebra, the kernel is equal to: I (YR)* (X (YR)*)* YF which
contains the "renormalized" forms of:
Start action: I' = I (YR)* = <0| (YR)*
Inputs: X' = X (YR)* = Σ_x x p'(x), composed of
Word frames: p'(x) = <x| (YR)*
Final action: F' = F = |L>|w>|S>|0>ω
That's a form suitable for both (a) mid-stream parsing, e.g. a word "vxu"
would be processed as p'(v) p'(x) p'(u) ... for instance, if vxu were a macro body; and (b) a parallelized form of CYK. Except: instead of assembling chunks bottom-up in dynamic programming from length 1, then length 2, then length 3, etc. in a "CYK pyramid", you could do it both in parallel *and* exponentially, from length 1, length 2, length 4, length 8, etc.
So, there are your "word frames" and you have a means to produce both
in-stream and even parallel translators ... which is ideally suited also for intelligently handling macros.
It is also suitable for both deterministic and non-deterministic parsing; and it ultimately subsumes the stack-based framework - devised originally by
Tomita - for GLR parsing; as well as generalizing to translation grammars
which are not canonical.
The role that Knuth's LR theory plays is that it provides a means to construct a shift-reduce parser. Seen in this context, it's easy to explain both what it is and does. A shift-reduce parser provides a set of transitions z: q→q' between states q, q' on a [non-]terminal z, which generates a morphism
<z| ↦ p(z) ≡ Σ_{z:q→q'} |q><q|<q'| for all [non-]terminals z; and <0| ↦ <0|
↦ q(z) ≡ Σ_{z:q→q'} |q'>|q><q| for all [non-]terminals z; and |0>
↦ |0>
that (a) yields the same result when substituted into the CST kernel and (b) provides for a more nearly-deterministic push-down transducer when the bracket symbols are subject to the obvious stack interpretation with |0><0| playing
the role of the "empty stack" projection operator.
It can be written more succinctly in graphical form by writing defining [q]
≡ |q><q|, q⇐ ≡ |q>, ⇒q ≡ <q|, [q,⋯,q'] = [q] + ⋯ + [q]' (and other comma-separated subgraphs being understood to be added, e.g. [0]⇒2,[1]⇒3 = |0><0|<2| + |1><1|<3|). In that case, a rule such as S → u L v corresponds to |v>|L>|u><S| which translates into q(v)q(L)q(u)p(S) which takes the form of a sum of graphs that looks like q⇐q⇐q⇐[q]⇒q.
That's the expression for the "roll-back" action. The absence of the explicit account of the roll-back actions from LR theory is the chief deficiency of the entire formalism! It's normally imposed externally, either as a narrative description or else (in an implementation) as a cookie-cutter code skeleton (like in BSD yacc) or in the system library (like UNIX yacc) or user-specifiable (like in bison).
The right-ward pointing segment on the left corresponds to what is actually called a "lane" in LR-theory, while [q] is called a "laneahead" state. You can read the LALR "lookahead"'s directly off the graph - it's whatever lookaheads the states, q, to the right of the laneahead state have.
The current state of LR(k), for k > 0, by the way, runs pretty much like this. Pager's classical theory - which came out shortly after Knuth, has been
largely superseded by IELR (the IELR paper is here
https://core.ac.uk/download/pdf/82047055.pdf) which is what bison uses; and to a limited extent by Pager and Chen's newer formulation that is in "hyacc", which better handles LR(1) and a limited subset of LR(k) for k > 1. Pager and Chen, however, haven't figured out how to deal with "cyclic" lookaheads -
which I think correspond to the cases where the (YR)* roll-back expression has star-height 1. Nor do I know how their newer method compares with IELR; other than that IELR is limited to LR(1).
You might be able to handle the situation within the algebraic framework just described here and, thereby, extend their approach to all LR(k) grammars for k
1.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)