• Verbose Regular Expressions

    From Jon Ribbens@21:1/5 to Thomas 'PointedEars' Lahn on Tue Jun 8 17:48:01 2021
    On 2021-06-08, Thomas 'PointedEars' Lahn <PointedEars@web.de> wrote:
    Jon Ribbens wrote:
    On 2021-05-25, Michael Haufe (TNO) <tno@thenewobjective.com> wrote:
    Since you are supporting comments, it seems like you could support
    named matches without too much additional effort.

    I'm not sure what you mean - named capture groups are already a standard
    part of JavaScript, so I don't need to add them.

    But only because you implicitly limit your runtime environment and target implementation; i.e. regardless of that it is standardized, what matters
    here is that you are targeting *only Google V8* JavaScript thanks to
    Node.js as the *only* runtime environment :)

    Yes. I only need support for Node.js 16+ and Chrome 91+, so that's fine
    by me. And adding named capture groups to JavaScript implementations
    that don't already have them would transform the project from a neat
    9-line template-string trick into a complete re-implementation of
    the JavaScript regular expression engine, which would be a whole
    different project...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas 'PointedEars' Lahn@21:1/5 to Jon Ribbens on Tue Jun 8 19:31:28 2021
    Jon Ribbens wrote:

    I've just published an npm package: https://www.npmjs.com/package/verbose-regexp

    It provides a way to use verbose regular expressions in JavaScript and TypeScript, similar to re.VERBOSE in Python. It provides that white-space
    at the start and end of lines are ignored, as are newlines, and anything following // to the end of the line.

    It allows you to easily write multi-line regular expressions, and to make your regular expressions more self-documenting using formatting and
    comments.
    […]
    Any comments or thoughts would be appreciated.

    I have had support for some PCRE features, including this one, in JSX:regexp.js:jsx.regexp.RegExp for some time [1], but to support
    it by an actual syntax extension via Node.js is a nice idea. (I did
    not even know that that was possible.)

    [1] <https://github.com/PointedEars/JSX/blob/master/regexp.js>

    (Needs fixing now that the new properties of RegExp instances are
    read-only. TODO for the summer break.)
    --
    PointedEars
    FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
    Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas 'PointedEars' Lahn@21:1/5 to Jon Ribbens on Tue Jun 8 19:35:09 2021
    Jon Ribbens wrote:

    On 2021-05-25, Michael Haufe (TNO) <tno@thenewobjective.com> wrote:
    Since you are supporting comments, it seems like you could support
    named matches without too much additional effort.

    I'm not sure what you mean - named capture groups are already a standard
    part of JavaScript, so I don't need to add them.

    But only because you implicitly limit your runtime environment and target implementation; i.e. regardless of that it is standardized, what matters
    here is that you are targeting *only Google V8* JavaScript thanks to Node.js
    as the *only* runtime environment :)

    --
    PointedEars
    FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
    Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas 'PointedEars' Lahn@21:1/5 to Jon Ribbens on Tue Jun 8 20:04:51 2021
    Jon Ribbens wrote:

    […] adding named capture groups to JavaScript implementations
    that don't already have them would transform the project from a neat
    9-line template-string trick into a complete re-implementation of
    the JavaScript regular expression engine, which would be a whole
    different project...

    You might want to reconsider after you have read JSX:regexp.js. It was not trivial to do it, but it certainly did not require what you suggest here :)

    --
    PointedEars
    FAQ: <http://PointedEars.de/faq> | <http://PointedEars.de/es-matrix> <https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
    Twitter: @PointedEars2 | Please do not cc me./Bitte keine Kopien per E-Mail.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Sauyet@21:1/5 to Jon Ribbens on Fri Jun 18 05:46:59 2021
    Jon Ribbens wrote:

    I've just published an npm package: https://www.npmjs.com/package/verbose-regexp

    [ ... ]

    It allows you to easily write multi-line regular expressions, and to make your regular expressions more self-documenting using formatting and comments.

    Very nice! I can think of many times I've wanted something like this.

    [ ... ]
    You can use regular expression flags by accessing them as a property of rx, e.g.:

    const alpha = rx.i`[a-z]+`

    This makes me doubt the use of template tag functions here, especially
    as the implementation then builds 128 separate functions to handle this.

    I don't see a great advantage to

    ```
    const dateTime = rx.gi`
    (\d{4}-\d{2}-\d{2}) // date
    T // time separator
    (\d{2}:\d{2}:\d{2}) // time
    `
    ```

    over

    ```
    const dateTime = rx (`
    (\d{4}-\d{2}-\d{2}) // date
    T // time separator
    (\d{2}:\d{2}:\d{2}) // time
    `, 'gi')
    ```

    especially when I have to remember to include the flags in alphabetic
    order and when it won't automatically update if and when the underlying
    regex engine includes new flags. Is there a compelling advantage to
    this?

    Still, this is great. I had thought of doing this before but somehow
    expected it to be much more difficult. Kudos!

    -- Scott

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Scott Sauyet on Fri Jun 18 15:19:34 2021
    On 2021-06-18, Scott Sauyet <scott@sauyet.com> wrote:
    I don't see a great advantage to

    ```
    const dateTime = rx.gi`
    (\d{4}-\d{2}-\d{2}) // date
    T // time separator
    (\d{2}:\d{2}:\d{2}) // time
    `
    ```

    over

    ```
    const dateTime = rx (`
    (\d{4}-\d{2}-\d{2}) // date
    T // time separator
    (\d{2}:\d{2}:\d{2}) // time
    `, 'gi')
    ```

    especially when I have to remember to include the flags in alphabetic
    order and when it won't automatically update if and when the underlying
    regex engine includes new flags. Is there a compelling advantage to
    this?

    The latter wouldn't work as the function would receive the string after
    escape processing, i.e. given rx(`\d{4}`, 'g') it would receive 'd{4}'.
    You'd have to do rx(String.raw`\d{4}`, 'g') which is starting to become
    very ugly and verbose.

    In Python you could override the attribute accessor for "rx" and make
    property fetches dynamic, so that the flag functions wouldn't need to
    actually exist, and the flags could be in any order, but as far as I'm
    aware JavaScript doesn't have a way of overriding general property
    lookups unfortunately.

    As an aside, personally I prefer the flags being up front anyway - it's annoying reading a long regular expression, reaching the end, and
    finding that you now need to go back and read it all again, because
    there's a modifier flag appended that changes its meaning.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Sauyet@21:1/5 to Jon Ribbens on Sat Jun 19 20:25:42 2021
    Jon Ribbens wrote:
    Scott Sauyet wrote:

    An ill-considered alternative.

    The function would receive the string after
    escape processing, i.e. given rx(`\d{4}`, 'g') it would receive 'd{4}'.
    You'd have to do rx(String.raw`\d{4}`, 'g') which is starting to become
    very ugly and verbose.

    Ah yes. Obviously I had not considered the problem thoroughly. This
    does feel like an elegant solution... except for those 127 functions!


    As an aside, personally I prefer the flags being up front anyway
    - it's annoying reading a long regular expression, reaching the
    end, and finding that you now need to go back and read it all again,
    because there's a modifier flag appended that changes its meaning.

    Agreed, although I don't feel it to be a big deal. I usually scan for
    the end of the regex before I even start to analyze it. But they would definitely be better up front.

    If I have some extra time in the next few days, I'll spend part of it
    trying to create an alternative to the way flags are handled.

    Cheers,

    -- Scott

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jon Ribbens@21:1/5 to Scott Sauyet on Sun Jun 20 11:21:36 2021
    On 2021-06-20, Scott Sauyet <scott.sauyet@gmail.com> wrote:
    Jon Ribbens wrote:
    Scott Sauyet wrote:
    An ill-considered alternative.

    The function would receive the string after
    escape processing, i.e. given rx(`\d{4}`, 'g') it would receive 'd{4}'.
    You'd have to do rx(String.raw`\d{4}`, 'g') which is starting to become
    very ugly and verbose.

    Ah yes. Obviously I had not considered the problem thoroughly. This
    does feel like an elegant solution... except for those 127 functions!

    I mean I'm with you on that to an extent, but as overheads go,
    given the general massive overhead of using JavaScript rather than,
    say, C, it's unnoticeable in the end. The syntax itself is very neat,
    it is extremely compact and very readable. rx.gi`foo` is the sort of
    thing one might choose even if one was the language designer and
    unconstrained by the existing parser. And alphabetical-order is the
    sort of thing that I would probably do voluntarily even if not forced
    to ;-)

    Obviously if JavaScript does add further flags to RegExps then there
    are downsides including (a) needing to release a new version of the
    module and (b) O(2^n) in memory and load time. But it seems unlikely
    this would be a weekly occurrence...

    As an aside, personally I prefer the flags being up front anyway
    - it's annoying reading a long regular expression, reaching the
    end, and finding that you now need to go back and read it all again,
    because there's a modifier flag appended that changes its meaning.

    Agreed, although I don't feel it to be a big deal. I usually scan for
    the end of the regex before I even start to analyze it. But they would definitely be better up front.

    If I have some extra time in the next few days, I'll spend part of it
    trying to create an alternative to the way flags are handled.

    The only options I can think of off the top of my head are:

    (a) properties (like it is now)
    (b) a function (e.g. rx('gi')`foo`)
    (c) including the flags in the string parameter

    The constraints for (c) would need to be that the way the flags were
    included could not be a valid regular expression now, or preferably
    in the future either. Python offers a syntax which is, e.g. (?gi),
    which seems suitable.

    Whatever you suggest I'd want it to be backwards compatible with the
    current release, which I think all of the above could be. In principle
    it could support all of (a), (b), and (c) at once ;-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Sauyet@21:1/5 to Jon Ribbens on Sun Jun 27 16:37:43 2021
    Jon Ribbens wrote:
    Scott Sauyet wrote:
    Jon Ribbens wrote:

    Ah yes. Obviously I had not considered the problem thoroughly. This
    does feel like an elegant solution... except for those 127 functions!

    I mean I'm with you on that to an extent, but as overheads go,
    given the general massive overhead of using JavaScript rather than,
    say, C, it's unnoticeable in the end.

    Probably, but untidiness always bothers me whether it's a real problem
    or not.

    [ ... ] And alphabetical-order is the
    sort of thing that I would probably do voluntarily even if not forced
    to ;-) >

    I just find a large difference between what I would do for my own code
    and what I would put out for public consumption. It's probably not a
    big deal since an out-of-order alternative would fail immediately and
    not lurk for months, but it still feels odd to me.


    Obviously if JavaScript does add further flags to RegExps then there
    are downsides including (a) needing to release a new version of the
    module and (b) O(2^n) in memory and load time. But it seems unlikely
    this would be a weekly occurrence...

    No, but being future-proof is also pretty useful.


    [ ... ]

    If I have some extra time in the next few days, I'll spend part of it
    trying to create an alternative to the way flags are handled.

    The only options I can think of off the top of my head are:

    (a) properties (like it is now)
    (b) a function (e.g. rx('gi')`foo`)
    (c) including the flags in the string parameter

    I probably would not consider (c). That seems too likely to fall apart
    in the future. But (b) sounds like a perfectly reasonable option. I
    haven't found any time for this, and am not likely to soon. But I would
    see (b) as the best option around. Even the ability to name the
    partially applied function seems useful.


    Whatever you suggest I'd want it to be backwards compatible with the
    current release, which I think all of the above could be. In principle
    it could support all of (a), (b), and (c) at once ;-)

    I assume the emoticon meant you were disclaiming the idea. If you
    actually are suggesting it, then I'd suggest that that way lies madness
    and jQuery.

    Regardless of all the above, I'm impressed. This is a very nice little
    tool.

    -- Scott

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)