Forum: >>> Magnum BBS <<<

TeX syntax?

From Rock Brentwood@21:1/5 to All on Sun Apr 4 14:08:30 2021

[ This is a followup to a thread from 2007. ]

I've looked high and low without success. Where can i find
something resembling the BNF of Knuth's TeX typesetting syntax?

It's in the file weave.web, section 14.

The syntax for TeX was written as a context-sensitive translation
grammar, suitable for a streaming translator, rather than a
context-free grammar. It may be possible to convert it to one (either
directly or as a context-free enveloping grammar with semantic
constraints). That's a matter that may be worth looking into. But in
its present form, there is no tree-building required or involved: it
can stream. The distinction is analogous to that between SSDT's versus
SDT's ([S]SDT = [simple-]syntax-directed translations). SSDT's can be
streamed, SDT's require stacking or treeing values and, in effect, SDT
= SSDT + value-stacking/treeing.

TeX is written in Web which is essentially Pascal + hyper-linked
comments. It is also in C-Web on the main TeX distribution site, which
is C + hyper-linked comments. They *can* be converted directly to more
normal programs with the comments embedded. I did so in the local
versions of my older MiKTeX distribution, but haven't
regression-tested it yet - since I haven't established a working
baseline yet to work off of.

The syntax is in - and an essential part - of the weave.web file. In detail: Section 14.1 describes the framework used for the syntax
Section 14.2 lists the "category codes" used
Section 14.3 lists additional post-processed lexical units used
Section 14.4 lists describes a processing layer from lexical units to "scraps" Section 14.5 contains the productions for the context sensitive grammar

Section 15 implements the parser; the most important routine being translate_cases() (its name in the C-Web file) - as a master "switch"
statement (or "case" statement in Pascal) in section 15.7.

By the way the "open" case (its subcases are in 15.19), "math" subcase
(its sub-subcases iare in 15.20), "var_head" sub-subcase has a bug in
it. The "intro" sub-sub-subcase listed a transition to rule 31,
instead of to rule 33. (I want my money Knuth! :))

I believe it's possible to convert it all to a context-free grammar,
albeit with the possible inclusion of a few semantic constraints. Why
Knuth chose to write everything this way - as borderline obfuscated
code that cannot be scaled upwards or sideways or integrated in other
existing code - is beyond me. But it is not maintainable, and
heavily-laden with Technical Debt; notably, its *critical* reliance on
the dated assumption that the Future would be Pascal, along with all
the other assumptions and - more importantly - the now-unnecessary
restrictions that came out of that.

Much of the very design of the entire Web framework's very conception
and design was premised on the assumed necessity of those
restrictions; and the whole thing can be done on a much simpler
foundation, when remade in more up-to-date terms (relatively speaking) *natively* in C or C++. Among other things, there isn't a need for any
Web-like framework. You can just simply use ordinary comments. I know,
because I did so: I rewrote the entire set of Web files in my local
copy doing just that. When a baseline is established and it is regression-validated I'll put a copy up on GitHub.

A follow-up to the additional comments at the end of the article:

Knuths TeX book is an abomination, describing lexing and parsing
as mouth, gullet and stomach nonsense.

I know. It's literally a woven and tangled mess - both the book and the code.

[Well, he invented most of what we know about parsing, he gets to
explain it any way he wants. Chapters 7 and 8 describe the syntax >operationally. -John]

Discovery. Not invention. Mathematics is not invented, it is
discovered (and in this case: only a partial and incomplete discovery).

And that, too, is a complete tangle that we had to remake from bottom
up. Now, finally with recent publications [2-5] establishing the
foundations for the new algebraic framework ... along with another,
currently in submission, that may come out in 2021, for the remaking
alluded to in [3] of the 1963 algebraic formulation by Chomsky and Schuetzenberger [1] that lies at the foundation of this all, we're now
finally in a position to refactor both the theory itself and
everything that's based on it or is an application of it; literally
remaking the entire stack from bottom up.

References:
[1] Chomsky, N., Schuetzenberger, M.: "The algebraic theory of context
free languages". In: Braffort, P., Hirschberg, D. (eds.) Computer
Programming and Formal Systems, pp. 118=E2=80=93161. North-Holland, Amsterdam (1963)
[2] H. Lei=C3=9F et al: "C-dioids and =CE=BC-continuous Chomsky algebras". In: Desharnais, J., et al. (eds.) RAMiCS 2018. LNCS, vol. 11194, pp.
21=E2=80=9336. Springer, Cham (2018)
[3] M. Hopkins et al: "Coequalizers and Tensor Products for Continuous Idempotent Semirings". In: Desharnais, J., et al. (eds.) RAMiCS 2018.
LNCS, vol. 11194, pp. 37-52. Springer, Cham (2018)
[4] M.Hopkins: "The algebraic approach I: the algebraization of the
Chomsky hierarchy". In: Berghammer, R., M=C3=B6ller, B., Struth, G. (eds.) RelMiCS 2008. LNCS, vol. 4988, pp. 155=E2=80=93172. Springer, Heidelberg (2008).
[5] N.B.B. Grathwohl et al: "Infinitary axiomatization of the
equational theory of context-free languages". In: Baelde, D., Carayol,
A. (eds.) Fixed Points in Computer Science (FICS 2013). EPTCS, vol.
126, pp. 44=E2=80=9355 (2013)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Rock Brentwood on Mon Apr 5 04:21:34 2021

On Sunday, April 4, 2021 at 6:12:06 PM UTC-7, Rock Brentwood wrote:

[ This is a followup to a thread from 2007. ]

I've looked high and low without success. Where can i find
something resembling the BNF of Knuth's TeX typesetting syntax?

It's in the file weave.web, section 14.

The syntax for TeX was written as a context-sensitive translation
grammar, suitable for a streaming translator, rather than a
context-free grammar. It may be possible to convert it to one (either directly or as a context-free enveloping grammar with semantic
constraints). That's a matter that may be worth looking into. But in
its present form, there is no tree-building required or involved: it
can stream. The distinction is analogous to that between SSDT's versus
SDT's ([S]SDT = [simple-]syntax-directed translations). SSDT's can be streamed, SDT's require stacking or treeing values and, in effect, SDT
= SSDT + value-stacking/treeing.

TeX is written in Web which is essentially Pascal + hyper-linked
comments. It is also in C-Web on the main TeX distribution site, which
is C + hyper-linked comments. They *can* be converted directly to more
normal programs with the comments embedded. I did so in the local
versions of my older MiKTeX distribution, but haven't
regression-tested it yet - since I haven't established a working
baseline yet to work off of.

(snip)

It seems to me that macro processors were popular in the 1970's.

Some years ago, I used to work with Mortran, which is a macro processor
meant to be used to process an improved language like Fortran, which is
then converted into Fortran. It is free-form with semicolon terminated statements, and with block structured if-then-else and for loops.
It is all done with a relatively simple macro processor written in about
as close as possible to standard Fortran 66. The only extension needed
is the ability to compare value read in as characters.

The macro format is very simple, except that macros have the ability
to define other macros. Then, as is usual with compilers, the whole
processor is written in Mortran and translated to Fortran.

So, TeX is also written as a macro processsor, with relatively few
built-in operations, and many more defined through the macro facility.

The input sequences recognized by macros likely can't be generalized.
A macro can specify just about any combination of characters to be
included between macro arguments. Not so many use that, but they can.

There are, then, three ways to name macros. Most common is the
escape character, usually \, followed by some number of letters.
They can also be the escape character folllowed by one non-letter.
The characters which are letters can be changed at any time, but
most of the time, for user-level macros, letters are letters.

The third possibility is that a macro can be any one character with
the catcode of active. No escape character is needed.

It seems that there are now some programs that accept the TeX
way of writing mathematical equations, including the standard
macros for special characters. But not the general ability to
write macros, such as defining new characters or renaming
old ones, or any of the more strange things you can do. I suppose
BNF can be written for that, but likely not for the more general
syntax.
[The m4 macro processor tokenizes its input and then checks to see
if a token is a named macro. It's still used for autoconf
configuration scripts. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to John on Mon Apr 5 13:29:02 2021

On Monday, April 5, 2021 at 7:55:35 AM UTC-7, gah4 wrote:

(snip, John wrote)

[The m4 macro processor tokenizes its input and then checks to see
if a token is a named macro. It's still used for autoconf
configuration scripts. -John]

I know about m4, but have managed never to use it.

Things that I might have done with it, I usually do with awk.

Continuing TeX syntax, white space is ignored after a macro named
with letters, (but not the other ones). That is useful if you use a macro before
actual text words. Also it means that you can ignore newlines that come after
a macro named with letters, but not other ones. You can cause it to ignore
a newline otherwise by ending with a comment, with (usually) a %.

(As with all characters, you can change the catcode, but usually %.)

One other funny thing, though. TeX has glue specifications, such that
one might say:

\hskip 1cm plus 0.5cm minus 0.5m

for horizontal space that can stretch or shrink in line justification.

the plus and minus parts are optional. Often macros expand to glue:

\def\somespace{\hskip 1cm}

and I actually knew someone to use such a macro followed by the word
plus that wasn't part of glue, and resulting in a very strange message.

I suppose one should put a \relax (the no-op macro) after every glue,
but that is pretty rare. What is the chance that one might actually
say 'plus' or 'minus' at that point!

But the sometimes significant sometimes not white space is a real
complication in actual use, and will also complicate a BNF
representation.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rock Brentwood@21:1/5 to All on Fri Apr 9 15:18:40 2021

I remade {tangle,weave}.web and c{tangle,weave}.w both in more C-like form as ordinary contiguous code+comments (+ UNICODE, because I'd like to keep some of the original spirit of making it as more math-/typeset-friendly program+documentation) ... and might post what I have on GitHub or somewhere, even though it hasn't yet been baseline tested. Among other things, I want to remake my C compiler to directly accept UNICODE for the operators. The
glossary I use is (↑,↓,→,≠,≡,≤,≥,∷,∧,∨,¬) for (++,--,->,!=,==,<=,>=,::,&&,||,!), with the latter treated as legacy digraph forms of the former. I'll post a note on the link, when I get it up. The context-sensitive grammar is a switch statement that runs up to 4 levels deep. It *might* be yacc-convertible, as is, since some versions of yacc may allow for stack-hacking, e.g. an $-1 argument in a rule.

On Monday, April 5, 2021 at 9:55:35 AM UTC-5, gah4 wrote:

It seems to me that macro processors were popular in the 1970's.

Some years ago, I used to work with Mortran, which is a macro processor
meant to be used to process an improved language like Fortran, which is
then converted into Fortran...

You hit the nail on the head here.

Partially in defense of Knuth, it looks like what he was trying to do (in the background of all this) was establish an entirely new framework for parsing, that got out from under the deficiencies of LR, that was (a) streaming and (b) potentially even parallelizable. By "streaming" I mean that you can literally start and end a partial parse in mid-stream and *still* get meaningful
results, even of the chunk that was processed doesn't form any kind of grammatical unit at all ... which has obvious utility for intelligently processing macros.

That's CYK-on-steroids.

The lack of such a facility also stood as a roadblock to making cfront
directly into a C++ compiler. In the old cfront code (e.g. the first version, release E, which we transcribed and is currently here

http://www.softwarepreservation.org/projects/c_plus_plus/index.html#cfront
...

an updated/correction version which will be out on GitHub hopefully soon)
shows indication in it that Stroustrup was trying to make it into a *direct* C++ to C processor. He had an option in the build script in release E or the other early releases to allow for this, but never implemented it. The macro layer stands in the way of everything. So, instead, he set it up C++ to C conversion as a Rube Goldberg device that got rid of macros first ... thus totally destroying the transparency of the translation process; when what we actually wanted as a translation that kept the macros intact, but
intelligently processed them.

(His original aim is much more doable now, by the way, than in the 1980's, because the *forward* advancement of C from the 1980's to C99 actually entails a dramatic *simplification* of what's needed for a "cfront" like utility, and much of what went into the original is no longer needed, on that account. You can almost even write up a utility as a simple editor script, now! ... Except for exception-handling. Most people forget, translation is a *contravariant* function of source language - it simplifies when source language complexifies. So things become *more* doable, not less!)

At the time of LR's inception, you had 2 major results which were never fully developed or exploited: (a) the Chomsky-Schützenberger Theorem (CST), which *very nearly* gives you the ability to make 100% streaming representation for translators, with one "frame" per word -- except the CST "kernels" have word boundary markers; and (b) its successor of sorts , Greibach's "Hardest CFL" construction, which successfully incorporates word-boundary markers into the word frames.

I'll show you how it can be done, now, with the newer formalism for context-free expressions.

This:
L → α | L S β,
S → x γ | u L v δ,
S w L ω
is a translation grammar containing input words from the set {u,v,w,x}, output actions from the set {α,β,γ,δ,ω} which have an obvious interpretation in terms of "reduce" (α,β,γ,δ) and "accept" (ω) actions; plus a starting configuration S w L ω. That's typical of what yacc assumes, except for the loosening of restrictions on what's allowed for the "start".

It is possible, by the way, to hack yacc to allow for *any* rule body under %start and to allow "%start" to be issued anywhere, and multiple times; by
just treating each "%start α" declaration as being synonymous with a hidden rule of the form "$accept: α $end;" ... because that's actually already how
it is processed and even displayed under the "-v" option in some implementations. With the loosening, you can even add in start actions under %start (thereby eliminating the need for some of bison's kludges).

This translation grammar is in "canonical bottom-up" form - all actions occur at the end, no in-between actions. (In yacc, in-between actions are hacked by turning them into actions for special single-use non-terminals that have empty transitions ... which you can see on display when you run the "-v" option on a source file containing rules with in-between actions.)

The translation grammar has this as its CST-kernel:
<0| (u<u| + v<v| + w<w| + x<x| + (α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|)* |L>|w>|S>|0> ω

In the 1960's version of the theorem, the labelled brackets <u|,<v|,...; |u>,|v>,... were interpreted as just extra word symbols (and actually Chomsky
& Schützenberger replaced words with brackets too). In the newer algebraic form ... which may finally see publication soon this year ... the brackets
form an algebra with <i| |j> = 0 if i ≠ j, <i| |i> = 1, which commute with the other symbols. In addition, for translator grammars, input and output symbols commute (e.g. αx = xα, the conversion of αx to xα is, itself, the very essence of a "look-ahead"). One can also conceive of cases where there
are *multiple* input and output channels, each having their own alphabet, the different channels' alphabet commuting with one another.

The kernel has the form
I (X + YR)* YF
with
I = <0|
X = u<u| + v<v| + w<w| + x<x|
YR = α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|
YF = |L>|w>|S>|0> ω
and, in fact, YR and YF both factor into
Y = α<α| + β<β| + γ<γ| + δ<δ| + ω<ω|
R = (|α> + |β>|S>|L>)<L| + (|γ>|x> + |δ>|v>|L>|u>)<S|
F = |ω>|L>|w>|S>|0>
and the original kernel is actually equal to I (X + Y + R)* F ... a form which is suited for translation grammars that are *not* in bottom-up canonical form. The reduction I (X + Y + R)* F = I (X + YR)* YF only applies if the
translation grammar is in canonical form.

By Kleene algebra, the kernel is equal to: I (YR)* (X (YR)*)* YF which
contains the "renormalized" forms of:
Start action: I' = I (YR)* = <0| (YR)*
Inputs: X' = X (YR)* = Σ_x x p'(x), composed of
Word frames: p'(x) = <x| (YR)*
Final action: F' = F = |L>|w>|S>|0>ω

That's a form suitable for both (a) mid-stream parsing, e.g. a word "vxu"
would be processed as p'(v) p'(x) p'(u) ... for instance, if vxu were a macro body; and (b) a parallelized form of CYK. Except: instead of assembling chunks bottom-up in dynamic programming from length 1, then length 2, then length 3, etc. in a "CYK pyramid", you could do it both in parallel *and* exponentially, from length 1, length 2, length 4, length 8, etc.

So, there are your "word frames" and you have a means to produce both
in-stream and even parallel translators ... which is ideally suited also for intelligently handling macros.

It is also suitable for both deterministic and non-deterministic parsing; and it ultimately subsumes the stack-based framework - devised originally by
Tomita - for GLR parsing; as well as generalizing to translation grammars
which are not canonical.

The role that Knuth's LR theory plays is that it provides a means to construct a shift-reduce parser. Seen in this context, it's easy to explain both what it is and does. A shift-reduce parser provides a set of transitions z: q→q' between states q, q' on a [non-]terminal z, which generates a morphism
<z| ↦ p(z) ≡ Σ_{z:q→q'} |q><q|<q'| for all [non-]terminals z; and <0| ↦ <0|

↦ q(z) ≡ Σ_{z:q→q'} |q'>|q><q| for all [non-]terminals z; and |0>

↦ |0>
that (a) yields the same result when substituted into the CST kernel and (b) provides for a more nearly-deterministic push-down transducer when the bracket symbols are subject to the obvious stack interpretation with |0><0| playing
the role of the "empty stack" projection operator.

It can be written more succinctly in graphical form by writing defining [q]
≡ |q><q|, q⇐ ≡ |q>, ⇒q ≡ <q|, [q,⋯,q'] = [q] + ⋯ + [q]' (and other comma-separated subgraphs being understood to be added, e.g. [0]⇒2,[1]⇒3 = |0><0|<2| + |1><1|<3|). In that case, a rule such as S → u L v corresponds to |v>|L>|u><S| which translates into q(v)q(L)q(u)p(S) which takes the form of a sum of graphs that looks like q⇐q⇐q⇐[q]⇒q.

That's the expression for the "roll-back" action. The absence of the explicit account of the roll-back actions from LR theory is the chief deficiency of the entire formalism! It's normally imposed externally, either as a narrative description or else (in an implementation) as a cookie-cutter code skeleton (like in BSD yacc) or in the system library (like UNIX yacc) or user-specifiable (like in bison).

The right-ward pointing segment on the left corresponds to what is actually called a "lane" in LR-theory, while [q] is called a "laneahead" state. You can read the LALR "lookahead"'s directly off the graph - it's whatever lookaheads the states, q, to the right of the laneahead state have.

The current state of LR(k), for k > 0, by the way, runs pretty much like this. Pager's classical theory - which came out shortly after Knuth, has been
largely superseded by IELR (the IELR paper is here https://core.ac.uk/download/pdf/82047055.pdf) which is what bison uses; and to a limited extent by Pager and Chen's newer formulation that is in "hyacc", which better handles LR(1) and a limited subset of LR(k) for k > 1. Pager and Chen, however, haven't figured out how to deal with "cyclic" lookaheads -
which I think correspond to the cases where the (YR)* roll-back expression has star-height 1. Nor do I know how their newer method compares with IELR; other than that IELR is limited to LR(1).

You might be able to handle the situation within the algebraic framework just described here and, thereby, extend their approach to all LR(k) grammars for k

1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Rock Brentwood on Fri Apr 9 18:39:39 2021

On Friday, April 9, 2021 at 4:20:27 PM UTC-7, Rock Brentwood wrote:

(His original aim is much more doable now, by the way, than in the 1980's, because the *forward* advancement of C from the 1980's to C99 actually entails
a dramatic *simplification* of what's needed for a "cfront" like utility, and much of what went into the original is no longer needed, on that account. You can almost even write up a utility as a simple editor script, now! ... Except for exception-handling. Most people forget, translation is a *contravariant* function of source language - it simplifies when source language complexifies.
So things become *more* doable, not less!)

I have recently been working with another macro processor from the 1970's, which is written using a set of macros, and converted to standard Fortran 66.

Among others, the macros implement IF-THEN-ELSE structures and the usual DO-WHILE and and DO-UNTIL loops.

I recently realized, though haven't done it yet, that I can rewrite the macros to generate Fortran 90 structured if and looping constructs.

I did that some years ago with the MORTRAN macros, which also are designed
to generate standard Fortran 66 code.

But yes, it is interesting to look at what can be done with old programs
and modern tools.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Apr 25 22:17:10 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 10:09:36 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 08:24:20 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Apr 26 06:40:30 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	72:22:00
Calls:	6,657
Calls today:	3
Files:	12,203
Messages:	5,332,304
Posted today:	1

TeX syntax?

Who's Online

Recent Visitors

System Info