• TeX syntax?

    From Rock Brentwood@21:1/5 to All on Sun Apr 4 14:08:30 2021
    [ This is a followup to a thread from 2007. ]


    I've looked high and low without success. Where can i find
    something resembling the BNF of Knuth's TeX typesetting syntax?

    It's in the file weave.web, section 14.

    The syntax for TeX was written as a context-sensitive translation
    grammar, suitable for a streaming translator, rather than a
    context-free grammar. It may be possible to convert it to one (either
    directly or as a context-free enveloping grammar with semantic
    constraints). That's a matter that may be worth looking into. But in
    its present form, there is no tree-building required or involved: it
    can stream. The distinction is analogous to that between SSDT's versus
    SDT's ([S]SDT = [simple-]syntax-directed translations). SSDT's can be
    streamed, SDT's require stacking or treeing values and, in effect, SDT
    = SSDT + value-stacking/treeing.

    TeX is written in Web which is essentially Pascal + hyper-linked
    comments. It is also in C-Web on the main TeX distribution site, which
    is C + hyper-linked comments. They *can* be converted directly to more
    normal programs with the comments embedded. I did so in the local
    versions of my older MiKTeX distribution, but haven't
    regression-tested it yet - since I haven't established a working
    baseline yet to work off of.

    The syntax is in - and an essential part - of the weave.web file. In detail: Section 14.1 describes the framework used for the syntax
    Section 14.2 lists the "category codes" used
    Section 14.3 lists additional post-processed lexical units used
    Section 14.4 lists describes a processing layer from lexical units to "scraps" Section 14.5 contains the productions for the context sensitive grammar

    Section 15 implements the parser; the most important routine being translate_cases() (its name in the C-Web file) - as a master "switch"
    statement (or "case" statement in Pascal) in section 15.7.

    By the way the "open" case (its subcases are in 15.19), "math" subcase
    (its sub-subcases iare in 15.20), "var_head" sub-subcase has a bug in
    it. The "intro" sub-sub-subcase listed a transition to rule 31,
    instead of to rule 33. (I want my money Knuth! :))

    I believe it's possible to convert it all to a context-free grammar,
    albeit with the possible inclusion of a few semantic constraints. Why
    Knuth chose to write everything this way - as borderline obfuscated
    code that cannot be scaled upwards or sideways or integrated in other
    existing code - is beyond me. But it is not maintainable, and
    heavily-laden with Technical Debt; notably, its *critical* reliance on
    the dated assumption that the Future would be Pascal, along with all
    the other assumptions and - more importantly - the now-unnecessary
    restrictions that came out of that.

    Much of the very design of the entire Web framework's very conception
    and design was premised on the assumed necessity of those
    restrictions; and the whole thing can be done on a much simpler
    foundation, when remade in more up-to-date terms (relatively speaking) *natively* in C or C++. Among other things, there isn't a need for any
    Web-like framework. You can just simply use ordinary comments. I know,
    because I did so: I rewrote the entire set of Web files in my local
    copy doing just that. When a baseline is established and it is regression-validated I'll put a copy up on GitHub.

    A follow-up to the additional comments at the end of the article:
    Knuths TeX book is an abomination, describing lexing and parsing
    as mouth, gullet and stomach nonsense.

    I know. It's literally a woven and tangled mess - both the book and the code.

    [Well, he invented most of what we know about parsing, he gets to
    explain it any way he wants. Chapters 7 and 8 describe the syntax >operationally. -John]

    Discovery. Not invention. Mathematics is not invented, it is
    discovered (and in this case: only a partial and incomplete discovery).

    And that, too, is a complete tangle that we had to remake from bottom
    up. Now, finally with recent publications [2-5] establishing the
    foundations for the new algebraic framework ... along with another,
    currently in submission, that may come out in 2021, for the remaking
    alluded to in [3] of the 1963 algebraic formulation by Chomsky and Schuetzenberger [1] that lies at the foundation of this all, we're now
    finally in a position to refactor both the theory itself and
    everything that's based on it or is an application of it; literally
    remaking the entire stack from bottom up.

    References:
    [1] Chomsky, N., Schuetzenberger, M.: "The algebraic theory of context
    free languages". In: Braffort, P., Hirschberg, D. (eds.) Computer
    Programming and Formal Systems, pp. 118=E2=80=93161. North-Holland, Amsterdam (1963)
    [2] H. Lei=C3=9F et al: "C-dioids and =CE=BC-continuous Chomsky algebras". In: Desharnais, J., et al. (eds.) RAMiCS 2018. LNCS, vol. 11194, pp.
    21=E2=80=9336. Springer, Cham (2018)
    [3] M. Hopkins et al: "Coequalizers and Tensor Products for Continuous Idempotent Semirings". In: Desharnais, J., et al. (eds.) RAMiCS 2018.
    LNCS, vol. 11194, pp. 37-52. Springer, Cham (2018)
    [4] M.Hopkins: "The algebraic approach I: the algebraization of the
    Chomsky hierarchy". In: Berghammer, R., M=C3=B6ller, B., Struth, G. (eds.) RelMiCS 2008. LNCS, vol. 4988, pp. 155=E2=80=93172. Springer, Heidelberg (2008).
    [5] N.B.B. Grathwohl et al: "Infinitary axiomatization of the
    equational theory of context-free languages". In: Baelde, D., Carayol,
    A. (eds.) Fixed Points in Computer Science (FICS 2013). EPTCS, vol.
    126, pp. 44=E2=80=9355 (2013)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Rock Brentwood on Mon Apr 5 04:21:34 2021
    On Sunday, April 4, 2021 at 6:12:06 PM UTC-7, Rock Brentwood wrote:
    [ This is a followup to a thread from 2007. ]
    I've looked high and low without success. Where can i find
    something resembling the BNF of Knuth's TeX typesetting syntax?
    It's in the file weave.web, section 14.

    The syntax for TeX was written as a context-sensitive translation
    grammar, suitable for a streaming translator, rather than a
    context-free grammar. It may be possible to convert it to one (either directly or as a context-free enveloping grammar with semantic
    constraints). That's a matter that may be worth looking into. But in
    its present form, there is no tree-building required or involved: it
    can stream. The distinction is analogous to that between SSDT's versus
    SDT's ([S]SDT = [simple-]syntax-directed translations). SSDT's can be streamed, SDT's require stacking or treeing values and, in effect, SDT
    = SSDT + value-stacking/treeing.

    TeX is written in Web which is essentially Pascal + hyper-linked
    comments. It is also in C-Web on the main TeX distribution site, which
    is C + hyper-linked comments. They *can* be converted directly to more
    normal programs with the comments embedded. I did so in the local
    versions of my older MiKTeX distribution, but haven't
    regression-tested it yet - since I haven't established a working
    baseline yet to work off of.

    (snip)

    It seems to me that macro processors were popular in the 1970's.

    Some years ago, I used to work with Mortran, which is a macro processor
    meant to be used to process an improved language like Fortran, which is
    then converted into Fortran. It is free-form with semicolon terminated statements, and with block structured if-then-else and for loops.
    It is all done with a relatively simple macro processor written in about
    as close as possible to standard Fortran 66. The only extension needed
    is the ability to compare value read in as characters.

    The macro format is very simple, except that macros have the ability
    to define other macros. Then, as is usual with compilers, the whole
    processor is written in Mortran and translated to Fortran.

    So, TeX is also written as a macro processsor, with relatively few
    built-in operations, and many more defined through the macro facility.

    The input sequences recognized by macros likely can't be generalized.
    A macro can specify just about any combination of characters to be
    included between macro arguments. Not so many use that, but they can.

    There are, then, three ways to name macros. Most common is the
    escape character, usually \, followed by some number of letters.
    They can also be the escape character folllowed by one non-letter.
    The characters which are letters can be changed at any time, but
    most of the time, for user-level macros, letters are letters.

    The third possibility is that a macro can be any one character with
    the catcode of active. No escape character is needed.

    It seems that there are now some programs that accept the TeX
    way of writing mathematical equations, including the standard
    macros for special characters. But not the general ability to
    write macros, such as defining new characters or renaming
    old ones, or any of the more strange things you can do. I suppose
    BNF can be written for that, but likely not for the more general
    syntax.
    [The m4 macro processor tokenizes its input and then checks to see
    if a token is a named macro. It's still used for autoconf
    configuration scripts. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to John on Mon Apr 5 13:29:02 2021
    On Monday, April 5, 2021 at 7:55:35 AM UTC-7, gah4 wrote:

    (snip, John wrote)
    [The m4 macro processor tokenizes its input and then checks to see
    if a token is a named macro. It's still used for autoconf
    configuration scripts. -John]

    I know about m4, but have managed never to use it.

    Things that I might have done with it, I usually do with awk.

    Continuing TeX syntax, white space is ignored after a macro named
    with letters, (but not the other ones). That is useful if you use a macro before
    actual text words. Also it means that you can ignore newlines that come after
    a macro named with letters, but not other ones. You can cause it to ignore
    a newline otherwise by ending with a comment, with (usually) a %.

    (As with all characters, you can change the catcode, but usually %.)

    One other funny thing, though. TeX has glue specifications, such that
    one might say:

    \hskip 1cm plus 0.5cm minus 0.5m

    for horizontal space that can stretch or shrink in line justification.

    the plus and minus parts are optional. Often macros expand to glue:

    \def\somespace{\hskip 1cm}

    and I actually knew someone to use such a macro followed by the word
    plus that wasn't part of glue, and resulting in a very strange message.

    I suppose one should put a \relax (the no-op macro) after every glue,
    but that is pretty rare. What is the chance that one might actually
    say 'plus' or 'minus' at that point!

    But the sometimes significant sometimes not white space is a real
    complication in actual use, and will also complicate a BNF
    representation.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rock Brentwood@21:1/5 to All on Fri Apr 9 15:18:40 2021
    I remade {tangle,weave}.web and c{tangle,weave}.w both in more C-like form as ordinary contiguous code+comments (+ UNICODE, because I'd like to keep some of the original spirit of making it as more math-/typeset-friendly program+documentation) ... and might post what I have on GitHub or somewhere, even though it hasn't yet been baseline tested. Among other things, I want to remake my C compiler to directly accept UNICODE for the operators. The
    glossary I use is (↑,↓,→,≠,≡,≤,≥,∷,∧,∨,¬) for (++,--,->,!=,==,<=,>=,::,&&,||,!), with the latter treated as legacy digraph forms of the former. I'll post a note on the link, when I get it up. The context-sensitive grammar is a switch statement that runs up to 4 levels deep. It *might* be yacc-convertible, as is, since some versions of yacc may allow for stack-hacking, e.g. an $-1 argument in a rule.

    On Monday, April 5, 2021 at 9:55:35 AM UTC-5, gah4 wrote:
    It seems to me that macro processors were popular in the 1970's.

    Some years ago, I used to work with Mortran, which is a macro processor
    meant to be used to process an improved language like Fortran, which is
    then converted into Fortran...

    You hit the nail on the head here.

    Partially in defense of Knuth, it looks like what he was trying to do (in the background of all this) was establish an entirely new framework for parsing, that got out from under the deficiencies of LR, that was (a) streaming and (b) potentially even parallelizable. By "streaming" I mean that you can literally start and end a partial parse in mid-stream and *still* get meaningful
    results, even of the chunk that was processed doesn't form any kind of grammatical unit at all ... which has obvious utility for intelligently processing macros.

    That's CYK-on-steroids.

    The lack of such a facility also stood as a roadblock to making cfront
    directly into a C++ compiler. In the old cfront code (e.g. the first version, release E, which we transcribed and is currently here

    http://www.softwarepreservation.org/projects/c_plus_plus/index.html#cfront
    ...

    an updated/correction version which will be out on GitHub hopefully soon)
    shows indication in it that Stroustrup was trying to make it into a *direct* C++ to C processor. He had an option in the build script in release E or the other early releases to allow for this, but never implemented it. The macro layer stands in the way of everything. So, instead, he set it up C++ to C conversion as a Rube Goldberg device that got rid of macros first ... thus totally destroying the transparency of the translation process; when what we actually wanted as a translation that kept the macros intact, but
    intelligently processed them.

    (His original aim is much more doable now, by the way, than in the 1980's, because the *forward* advancement of C from the 1980's to C99 actually entails a dramatic *simplification* of what's needed for a "cfront" like utility, and much of what went into the original is no longer needed, on that account. You can almost even write up a utility as a simple editor script, now! ... Except for exception-handling. Most people forget, translation is a *contravariant* function of source language - it simplifies when source language complexifies. So things become *more* doable, not less!)

    At the time of LR's inception, you had 2 major results which were never fully developed or exploited: (a) the Chomsky-Schützenberger Theorem (CST), which *very nearly* gives you the ability to make 100% streaming representation for translators, with one "frame" per word -- except the CST "kernels" have word boundary markers; and (b) its successor of sorts , Greibach's "Hardest CFL" construction, which successfully incorporates word-boundary markers into the word frames.

    I'll show you how it can be done, now, with the newer formalism for context-free expressions.

    This:
    L → α | L S β,
    S → x γ | u L v δ,
    S w L ω
    is a translation grammar containing input words from the set {u,v,w,x}, output actions from the set {α,β,γ,δ,ω} which have an obvious interpretation in terms of "reduce" (α,β,γ,δ) and "accept" (ω) actions; plus a starting configuration S w L ω. That's typical of what yacc assumes, except for the loosening of restrictions on what's allowed for the "start".

    It is possible, by the way, to hack yacc to allow for *any* rule body under %start and to allow "%start" to be issued anywhere, and multiple times; by
    just treating each "%start α" declaration as being synonymous with a hidden rule of the form "$accept: α $end;" ... because that's actually already how
    it is processed and even displayed under the "-v" option in some implementations. With the loosening, you can even add in start actions under %start (thereby eliminating the need for some of bison's kludges).

    This translation grammar is in "canonical bottom-up" form - all actions occur at the end, no in-between actions. (In yacc, in-between actions are hacked by turning them into actions for special single-use non-terminals that have empty transitions ... which you can see on display when you run the "-v" option on a source file containing rules with in-between actions.)

    The translation grammar has this as its CST-kernel:
    <0| (u<u| + v<v| + w<w| + x<x| + (α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|)* |L>|w>|S>|0> ω

    In the 1960's version of the theorem, the labelled brackets <u|,<v|,...; |u>,|v>,... were interpreted as just extra word symbols (and actually Chomsky
    & Schützenberger replaced words with brackets too). In the newer algebraic form ... which may finally see publication soon this year ... the brackets
    form an algebra with <i| |j> = 0 if i ≠ j, <i| |i> = 1, which commute with the other symbols. In addition, for translator grammars, input and output symbols commute (e.g. αx = xα, the conversion of αx to xα is, itself, the very essence of a "look-ahead"). One can also conceive of cases where there
    are *multiple* input and output channels, each having their own alphabet, the different channels' alphabet commuting with one another.

    The kernel has the form
    I (X + YR)* YF
    with
    I = <0|
    X = u<u| + v<v| + w<w| + x<x|
    YR = α + |S>|L>β)<L| + (|x>γ + |v>|L>|u>δ)<S|
    YF = |L>|w>|S>|0> ω
    and, in fact, YR and YF both factor into
    Y = α<α| + β<β| + γ<γ| + δ<δ| + ω<ω|
    R = (|α> + |β>|S>|L>)<L| + (|γ>|x> + |δ>|v>|L>|u>)<S|
    F = |ω>|L>|w>|S>|0>
    and the original kernel is actually equal to I (X + Y + R)* F ... a form which is suited for translation grammars that are *not* in bottom-up canonical form. The reduction I (X + Y + R)* F = I (X + YR)* YF only applies if the
    translation grammar is in canonical form.

    By Kleene algebra, the kernel is equal to: I (YR)* (X (YR)*)* YF which
    contains the "renormalized" forms of:
    Start action: I' = I (YR)* = <0| (YR)*
    Inputs: X' = X (YR)* = Σ_x x p'(x), composed of
    Word frames: p'(x) = <x| (YR)*
    Final action: F' = F = |L>|w>|S>|0>ω

    That's a form suitable for both (a) mid-stream parsing, e.g. a word "vxu"
    would be processed as p'(v) p'(x) p'(u) ... for instance, if vxu were a macro body; and (b) a parallelized form of CYK. Except: instead of assembling chunks bottom-up in dynamic programming from length 1, then length 2, then length 3, etc. in a "CYK pyramid", you could do it both in parallel *and* exponentially, from length 1, length 2, length 4, length 8, etc.

    So, there are your "word frames" and you have a means to produce both
    in-stream and even parallel translators ... which is ideally suited also for intelligently handling macros.

    It is also suitable for both deterministic and non-deterministic parsing; and it ultimately subsumes the stack-based framework - devised originally by
    Tomita - for GLR parsing; as well as generalizing to translation grammars
    which are not canonical.

    The role that Knuth's LR theory plays is that it provides a means to construct a shift-reduce parser. Seen in this context, it's easy to explain both what it is and does. A shift-reduce parser provides a set of transitions z: q→q' between states q, q' on a [non-]terminal z, which generates a morphism
    <z| ↦ p(z) ≡ Σ_{z:q→q'} |q><q|<q'| for all [non-]terminals z; and <0| ↦ <0|
    ↦ q(z) ≡ Σ_{z:q→q'} |q'>|q><q| for all [non-]terminals z; and |0>
    ↦ |0>
    that (a) yields the same result when substituted into the CST kernel and (b) provides for a more nearly-deterministic push-down transducer when the bracket symbols are subject to the obvious stack interpretation with |0><0| playing
    the role of the "empty stack" projection operator.

    It can be written more succinctly in graphical form by writing defining [q]
    ≡ |q><q|, q⇐ ≡ |q>, ⇒q ≡ <q|, [q,⋯,q'] = [q] + ⋯ + [q]' (and other comma-separated subgraphs being understood to be added, e.g. [0]⇒2,[1]⇒3 = |0><0|<2| + |1><1|<3|). In that case, a rule such as S → u L v corresponds to |v>|L>|u><S| which translates into q(v)q(L)q(u)p(S) which takes the form of a sum of graphs that looks like q⇐q⇐q⇐[q]⇒q.

    That's the expression for the "roll-back" action. The absence of the explicit account of the roll-back actions from LR theory is the chief deficiency of the entire formalism! It's normally imposed externally, either as a narrative description or else (in an implementation) as a cookie-cutter code skeleton (like in BSD yacc) or in the system library (like UNIX yacc) or user-specifiable (like in bison).

    The right-ward pointing segment on the left corresponds to what is actually called a "lane" in LR-theory, while [q] is called a "laneahead" state. You can read the LALR "lookahead"'s directly off the graph - it's whatever lookaheads the states, q, to the right of the laneahead state have.

    The current state of LR(k), for k > 0, by the way, runs pretty much like this. Pager's classical theory - which came out shortly after Knuth, has been
    largely superseded by IELR (the IELR paper is here https://core.ac.uk/download/pdf/82047055.pdf) which is what bison uses; and to a limited extent by Pager and Chen's newer formulation that is in "hyacc", which better handles LR(1) and a limited subset of LR(k) for k > 1. Pager and Chen, however, haven't figured out how to deal with "cyclic" lookaheads -
    which I think correspond to the cases where the (YR)* roll-back expression has star-height 1. Nor do I know how their newer method compares with IELR; other than that IELR is limited to LR(1).

    You might be able to handle the situation within the algebraic framework just described here and, thereby, extend their approach to all LR(k) grammars for k
    1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Rock Brentwood on Fri Apr 9 18:39:39 2021
    On Friday, April 9, 2021 at 4:20:27 PM UTC-7, Rock Brentwood wrote:
    (His original aim is much more doable now, by the way, than in the 1980's, because the *forward* advancement of C from the 1980's to C99 actually entails
    a dramatic *simplification* of what's needed for a "cfront" like utility, and much of what went into the original is no longer needed, on that account. You can almost even write up a utility as a simple editor script, now! ... Except for exception-handling. Most people forget, translation is a *contravariant* function of source language - it simplifies when source language complexifies.
    So things become *more* doable, not less!)

    I have recently been working with another macro processor from the 1970's, which is written using a set of macros, and converted to standard Fortran 66.

    Among others, the macros implement IF-THEN-ELSE structures and the usual DO-WHILE and and DO-UNTIL loops.

    I recently realized, though haven't done it yet, that I can rewrite the macros to generate Fortran 90 structured if and looping constructs.

    I did that some years ago with the MORTRAN macros, which also are designed
    to generate standard Fortran 66 code.

    But yes, it is interesting to look at what can be done with old programs
    and modern tools.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)