• Interesting paper on regex NFA matching

    From John R Levine@21:1/5 to All on Thu Jan 25 22:10:56 2024
    The easiest way to implement regular expressions is to turn them into
    NFAs, but that has well known problems that if you feed hostile input
    to the NFAs, they can take exponential time and it's a way to DDoS the
    server. If you translate the NFA to a DFA it runs in linear time, but
    if the regex is ambiguous, as many are, the DFA may match something
    different from the NFA.

    This paper on the Arxiv preprint server proposes some rather complex
    tweaks to NFA regex matching to fix many performance issues while giving
    the same answers as before.

    https://arxiv.org/abs/2401.12639

    Regards,
    John Levine, johnl@taugh.com, Taughannock Networks, Trumansburg NY
    Please consider the environment before reading this e-mail. https://jl.ly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From cross@spitfire.i.gajendra.net@21:1/5 to johnl@taugh.com on Fri Jan 26 04:15:30 2024
    In article <24-01-003@comp.compilers>, John R Levine <johnl@taugh.com> wrote: >The easiest way to implement regular expressions is to turn them into
    NFAs, but that has well known problems that if you feed hostile input
    to the NFAs, they can take exponential time and it's a way to DDoS the >server. If you translate the NFA to a DFA it runs in linear time, but
    if the regex is ambiguous, as many are, the DFA may match something
    different from the NFA.

    NFAs won't take exponential time if implemented carefully,
    though they may be superlinear. Still, we're talking quadratic,
    not exponential.

    Backtracking implementations can take exponential time, but
    those aren't all NFAs.

    Converting an NFA to a DFA could be exponential, however.

    https://swtch.com/~rsc/regexp/regexp1.html

    This paper on the Arxiv preprint server proposes some rather complex
    tweaks to NFA regex matching to fix many performance issues while giving
    the same answers as before.

    https://arxiv.org/abs/2401.12639

    If I'm reading the abstract right, they modify a backtracking
    implementation so that they can efficiently match some extended
    regex's. Note however, that such so-called "extended" regexs
    are not really regular expressions in the formal sense; that is,
    they do not describe members of the regular languages, but
    rather languages beyond those expressible as a DFA. Regardless
    this work is interesting, since AFAIK the speedup for this
    class of regex to linear time is novel. Thanks for passing it
    on.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to John R Levine on Fri Jan 26 04:16:38 2024
    On 2024-01-26, John R Levine <johnl@taugh.com> wrote:
    The easiest way to implement regular expressions is to turn them into
    NFAs, but that has well known problems that if you feed hostile input
    to the NFAs, they can take exponential time and it's a way to DDoS the server.

    Do you have a reference for this? I seem to recall that NFA simulation
    is quadratic or cubic at worst.

    The exponential problem that immediately comes to mind in connection
    with regexes is that the NFA to DFA subset construction can experience
    an exponential state explosion.

    The NFA simulator won't experience that expoential explosion because
    it only generates the subsets that occur from the specific input
    that it's given; it's not reasoning about all possible inputs.

    if the regex is ambiguous, as many are, the DFA may match something
    different from the NFA.

    Is that also documented somewhere? DFA matching something different
    from the NFA has to be a bug. The algorithms we have in this area
    are correct.

    What does it mean for a regex to be ambiguous?

    This paper on the Arxiv preprint server proposes some rather complex
    tweaks to NFA regex matching to fix many performance issues while giving
    the same answers as before.

    https://arxiv.org/abs/2401.12639

    I'm going to look at this in more detail, but I can tell you from the
    abstract that they are taking on backtracking regex implementations that
    suffer from "catastrophic backtracking", which informs me of the general direction.

    Backtracking regexes are used to implement the hacky features found
    in PCRE (perl compatible regex).

    They have almost nothing to do with building NFA graphs and simulating
    them, or going to DFA from that.

    Backtracking follows and interprets the abstract syntax tree of a regex.

    One known problem with backtracking regex implementations is
    that they don't treat branches; R1|R2|R3|..|R4 correctly.

    Backtrackers iterate over the branches left to right and try them.
    They stop when they encounter a match for the first one, and declare
    that the entire disjunction matched. That would be fine if we are just
    asking whether there is a match, but it's incorrect if we are matching a
    regex prefix (or looking for a substring) and need the longest match.
    For instance "a|aa" matches "aardvark", but only the first "a"!

    If we simulate the a|aa NFA graph, what we do is feed every character of aardvark until we hit an impasse (the machine informs us that no more
    input can be accepted). At that point, we consider the most recent
    character position where we had been in an accepted state. That position
    will be the second "a" of "aardvark". After we feed the first "a"
    to the simulator, it will report that it's in an acceptance state. But
    we keep going; we feed the second "a", and for that it also reports
    acceptance. When we feed the "r" the machine can indicate that it has
    hit a dead end: that character cannot be matched. It is at this point
    that we can take a shortcut and stop feeding more characters. The last
    time we had an acceptance state was just after we fed the second "a",
    and so we know that the regex matched the "aa" part of the word.
    The NFA doesn't care whether the original syntax is a|aa or aa|a: the
    operator is commutative.

    A DFA implementation will not exhibit the behavior of searching through
    the branches of a|aa left to right for a match, because it follows from
    the NFA. So because of this kind of situation, the backtracking regex
    faces difficulties in translation to DFA. It is caused by the incorrect,
    loose, hacky interpretation of the backtracking implementation which
    pull stunts like not maing A|B commutative.

    NFA simulation isn't backtracking; it doesn't hit those kinds of
    explosions inherent in backtracking.

    Backtracking is not required for implementing classic regexes, like
    POSIX EREs and BREs and whatnot, only Perl-like cruft.

    Backtracking gets into trouble due to searches occuring at multiple
    levels of recursion. It's like any recursive search with an exploding
    space. It's clear memoization can help under the right conditions,
    whereby it can recognize redundancy in the space, to avoid processing
    subspaces that are identical to ones already processed ("overlapping subproblems").

    (The NFA to DFA construction involves a kind of memoization. How so?
    The algorithm generates subsets of NFA states for imaginary inputs,
    and turns those sets of NFA states into single DFA states. As it
    probes into that space, it has to recognize subsets it has already
    generated before! Each such a seen-before subset represents an
    existing DFA state that has already been created. The transitions
    should go to that state rather than making a new one which will
    turn out identical.)

    (Also, caching approaches are possible in simulating an NFA:
    instead of computing a DFA ahead of time, we lazily generate some of
    it as we go through the input, and cache some of it. Say up to 32
    states, with some LRU replacement or whtaever. It's like "JIT" for
    NFAs. We get some of the speed benefit of DFA, but the state
    explosion of the DFA is thwarted.)

    (I personaly am not so interested in backtracking regexes; and they are
    not directly applicable to compilers. You'd never want to design a
    lexical analyzer that requires "lookaround" or whatnot in order to
    recognize tokens. It's nice they are generating Arxiv papers for
    someone; no genuine intellectual inquiry is invalid; good luck! It's
    not applicable to the "Programming Languages" category it has been filed
    under. I will skim through it, though.)

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    [You're right, it's polynomial, but that's bad enough.

    Ambiguous is if there are multiple subpatterns that can match the
    same string, e.g. (foo|[a-z]+) matching "foo". Often your code cares
    which one it matches even though that's arguably a bug. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to Kaz Kylheku on Sat Jan 27 12:47:33 2024
    Kaz Kylheku <433-929-6894@kylheku.com> wrote:

    (I personaly am not so interested in backtracking regexes; and they are
    not directly applicable to compilers. You'd never want to design a
    lexical analyzer that requires "lookaround" or whatnot in order to
    recognize tokens. It's nice they are generating Arxiv papers for
    someone; no genuine intellectual inquiry is invalid; good luck! It's
    not applicable to the "Programming Languages" category it has been filed under. I will skim through it, though.)

    You are mistaken that they are not applicable to programming languages.

    Even Pascal [originally] had cases where lookahead and backtracking
    are required to properly tokenize the input.

    The well-known case is "1." if followed by another "." it is two tokens,
    and if followed by a digit it is a floating point number.
    I don't recall what it should be if followed by any other character.

    In any case, parser generators like ANTLR use backtracking to great
    effect (and yes, we "borrowed" that idea in Yacc++) and so did the
    entire PEG (parsing expression grammar) community, using memoizartion
    to give linear time performance in many cases.

    Thus, to me, it is no surprise to see this reintegrated back into the
    PCRE world.

    Moreover, even if it is of little interest [mistakenly in my opinion] to the "programming language" community, it is of definite interest in the
    "real world:" PCRE regular expressions are both used (and abused)
    in many applications, including SNORT where linear performance and
    preventing DDOS attacks would be of prime importance. (While at Intel,
    I had the honor of working with the Hyperscan (Sensory Networks) team
    and improving PCRE performance for SNORT was a major "selling point"

    And "greedy" (longest match), "non-greedy" (shortest match) and other variations like "lexically first in the grammar" matching all have their applications. Saying that only one of them is correct is ignoring the
    needs of actual users.

    And, yes, we had discussions about "automata theory correct" interpretations
    of what certain PCRE expressions meant and how even "canonical"
    implementations had various "bugs" that made their meaning ambiguous.

    By the way, to give credit where due, we "borrowed" the idea of
    "tunnel automatons" from Josef Grosch to implement controlled
    backtracking in Yacc++. It lets one do subset construction (i.e.
    the LR(0) algorithm) for items, breaking out for "syntactic predicates"
    exactly as needed.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Christopher F Clark on Sat Jan 27 20:20:29 2024
    On 2024-01-27, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
    Kaz Kylheku <433-929-6894@kylheku.com> wrote:

    (I personaly am not so interested in backtracking regexes; and they are
    not directly applicable to compilers. You'd never want to design a
    lexical analyzer that requires "lookaround" or whatnot in order to
    recognize tokens. It's nice they are generating Arxiv papers for
    someone; no genuine intellectual inquiry is invalid; good luck! It's
    not applicable to the "Programming Languages" category it has been filed
    under. I will skim through it, though.)

    You are mistaken that they are not applicable to programming languages.

    Even Pascal [originally] had cases where lookahead and backtracking
    are required to properly tokenize the input.

    The well-known case is "1." if followed by another "." it is two tokens,
    and if followed by a digit it is a floating point number.
    I don't recall what it should be if followed by any other character.

    I think that case can be handled by technology like Lex trailing
    contexts, which doesn't require backtracking.

    {digits}/. {
    // ... set up semantic value ..
    return INTEGER;
    }

    {digits}.{digits} {
    // ... set up semantic value ..
    return FLOAT;
    }

    Another possibility is to recongize {digits}. as one lexeme and call unput('.'). That's a minor form of backtracking, that doesn't require us
    to do anything funny to the regex implementation.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    [Patterns with alternatives and trailing contexts do have to back up
    and the flex manual warns it's quite expensive. See the
    PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to Kaz Kylheku on Mon Jan 29 10:39:58 2024
    Although, our esteemed moderator already corrected this error:
    [Patterns with alternatives and trailing contexts do have to back up
    and the flex manual warns it's quite expensive. See the
    PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]

    I still want to write an extended reply to this, where
    Kaz Kylheku wrote:

    I think that case can be handled by technology like Lex trailing
    contexts, which doesn't require backtracking.


    {digits}/. {
    // ... set up semantic value ..
    return INTEGER;
    }


    {digits}.{digits} {
    // ... set up semantic value ..
    return FLOAT;
    }


    Another possibility is to recongize {digits}. as one lexeme and call unput('.'). That's a minor form of backtracking, that doesn't require us
    to do anything funny to the regex implementation.

    You may not recognize it as backtracking, but it is reading two characters
    past the end of the integer token to disambiguate it. And while one character of lookahead is generally "free" two characters requires more extensive buffering and mechanisms to reset the state.

    Thus, what LEX and flex are implementing is not strictly a pure
    regular expression matcher, but one with a slight extension to what
    can be recognized. While it is an extremely useful and appropriate
    convention for the application area, it isn't strictly within the
    bounds. The engine which runs such machines needs to have the hooks to implement that extension.

    One can prove this by using the hierarchy of language families where
    regular expressions are a subset of LL(1) grammars which are a subset
    of LR(1) grammars. If you try to write an LL(1) grammar for that, you
    will find yourself unable to construct a proper FIRST set which disambiguates the two rules Similarly, if one is attempting to make an LR grammar for it uaing an algorithm that builds first an LR(0) machine, you will see that it
    has a shift/reduce conflict on the "." character and one which cannot be disambiguated by one character of lookahead.

    Thus, what is implemented in LEX is essentially "longest, lexically first" matching. Where the lexer saves the last state where a token was accepted
    and continues attempting to match, and on failure backs up to the last recognized match (and then applies the rule which is lexically first that caused that match).

    The only difference between this and what PEG (and PCRE) mechanisms
    do is that they implement "lexical first, longest" match. That reorders the set of matches that are considered. As the paper you dismissed notes,
    that can be compared to depth-first, versus breadth-first search of the
    tree of potential matches. Both strategies have their advantages and
    areas where they are the proper choice and places where they are the
    wrong choice.

    If one finds oneself writing "trailing context" or "unput" in one's lexical specification, one is using lookahead. And sometimes in lexers one
    uses lookahead without even noticing it, as LEX doesn't emit warnings
    when it does so. (We had to disable such warnings in our implementation
    to get it to approximate LEX behavior.)

    -------

    One final note is that language classes are determined by writing a
    grammar which accepts (not lexes or parses) the given language.
    That is one can write a grammar that doesn't disambiguate the cases
    (i.e. combines them into one case), but then one hasn't correctly
    tokenized it. I don't know of an automata theory (or compiler) book
    that mentions this distinction, much less highlights its implications,
    so I will forgive not knowing it unless one makes an argument that
    depends upon it, which you did.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Christopher F Clark on Mon Jan 29 19:28:02 2024
    On 2024-01-29, Christopher F Clark <christopher.f.clark@compiler-resources.com> wrote:
    Although, our esteemed moderator already corrected this error:
    [Patterns with alternatives and trailing contexts do have to back up
    and the flex manual warns it's quite expensive. See the
    PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]

    I still want to write an extended reply to this, where
    Kaz Kylheku wrote:

    I think that case can be handled by technology like Lex trailing
    contexts, which doesn't require backtracking.


    {digits}/. {
    // ... set up semantic value ..
    return INTEGER;
    }


    {digits}.{digits} {
    // ... set up semantic value ..
    return FLOAT;
    }


    Another possibility is to recongize {digits}. as one lexeme and call
    unput('.'). That's a minor form of backtracking, that doesn't require us
    to do anything funny to the regex implementation.

    You may not recognize it as backtracking, but it is reading two characters past the end of the integer token to disambiguate it.

    But some backtracking is required for recognizing the longest matching
    prefixes of the input, even if we us nothing but basic regex features!

    If we consider the trivial pattern (a*b)+:

    Given the input "aaabaaabaaac" the prefix which constitutes a token
    is the "aaabaaab".

    In order to extract that token, the machine will have to consume
    symbols all the way through the "aaac" which is excluded from the
    lexeme. Only at the "c" character does it hit the situation that no
    transition is available, and so the matching terminates. Then it has to
    return to the most recent position in the input where the automaton had
    been in an acceptance state.

    It is not the same as the backtracking of a graph/tree search algorithm,
    which returns to prior points to try alternatives, due to doing a
    depth-first search.

    At the point where we back up in the input, we are not trying
    different branches of a depth-first-search. We have completed
    a depth-first-search!

    One input retrace at the conclusion of a succesful search is not
    qualitatively the same situation as multiple, cascaded retraces
    integrated into the search algorithm.

    Thus, what is implemented in LEX is essentially "longest, lexically first" matching. Where the lexer saves the last state where a token was accepted and continues attempting to match, and on failure backs up to the last recognized match (and then applies the rule which is lexically first that caused that match).

    Yes; that is required in any regex matching function that searches
    for a prefix, or infix.

    The fundamental automaton's job is to recognize strings: does a given
    input string belong to the language (set of strings) described by that automaton/regex.

    In lexical analysis, we solve the problem of, what is the longest prefix
    of the input which belongs to that set of strings, the obvious way to do
    which is to match as far as we can and then return to the most recent
    good position.

    The only difference between this and what PEG (and PCRE) mechanisms
    do is that they implement "lexical first, longest" match.

    I think when you say "lexical first", you might be referring to the
    rules themselves: rules that are lexically earlier in the grammar
    "win" over later rules.

    (In Lex, that still matters. If two rules match tokens of exactly
    the same length, the lexically earlier rule gets the token.)

    This "lexical first, longest" is why the order of R1|R2|..|RN matters
    in some so-called regex implementations.

    Unfortunately, as far as regexes go, that is broken.

    It may still be a language describing a regular set, or at least
    in some cases, but the regex language for doing that requires the
    disjunction operator to be commutative.

    This is just historically so, but with justification.

    The people who invented regular expressions decades ago, as far back as
    the 1960's and earlier, and wrote papers about them decided on a
    commutative disjunction, for good reasons.

    A commutative disjunction is necessary in order to have a direct
    relationship between sets and regular expressions.

    If R1 describes a set of strings S1, and R2 describes a set of strings
    S2, then R1|R2 describes S1 ∪ S2.

    Because ∪ commutes, we want | to commute, since it means the same thing.

    If | doesn't commute, then it doesn't correspond to the ∪ operator. The relationship to sets has been severed or muddled.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to Kaz Kylheku on Thu Feb 1 04:09:48 2024
    Kaz Kylheku wrote:

    But some backtracking is required for recognizing the longest matching prefixes of the input, even if we use nothing but basic regex features!

    The key word in your sentence is "prefixes". The recognition problem
    is not about prefixes, but about complete strings. This is why my answer mentioned that the distinction between lexing (and parsing) and recognition
    is not covered in any automata books I have read.

    The definition of regular expressions is that when the finite automata
    halts (at the end of the string or when there are no transitions out
    of the state that it is in) it is either in an accepting state or not. That's the language recognition problem. There is no mention of "backing
    up to the last state that accepted." And for language recognition
    that is important, otherwise, any random trailing garbage after a
    legal string would still be considered a match because a prefix
    matched.

    Thus, given your regular expression (a*b)+.,

    b, ab, aab, aaab, ... , bb, abb, aabb, ..., bab, baab, ...., aaaabaaab
    are all in the language, but aaabaaabaaac does not match that
    regular expression. A prefix of it does, but not the entire string.
    And, thus the string aaabaaabaaac does not satisfy the language
    recognition problem.

    Now, when you ask does a string which is a prefix of aaabaaabaaac
    match the regular expression, you get two strings which qualify aaab
    and aaabaab.But that's not the *language recognition* problem.
    That's a different problem, one relevant to lexing.

    However, either string satisfies the question of is this string in the language.
    Thus, both strings aaab and aaabaaab satisfy the language recognition
    problem. Only one satisfies your definition of the "correct" lexing problem because you are fixated on a specific extension of how regular expressions
    need to work. It is not a bad extension, but it is an extension of the
    meaning of regular expressions in the lexing context which is not the
    context of language recognition which is where regular expressions
    were defined.

    Notably, if one focuses on the language recognition problem, the one
    where prefixes don't count, only whole strings do, then whether the |
    operator is commutative or not, does not matter. It only matters when
    one can terminate (and accept) without considering the entire string.

    Still, there are cases where one might prefer to get the shortest answer, rather than the longest answer. Or cases, where one might prefer a
    shorter answer when two possible answers match. Those situations
    require a more complex language. A language which regular expressions
    are only a subset of. This class of languages is what the PCRE (and PEG) extensions allow one to access. It is possible to express every regular expression in those languages, trivially so with PCRE and with only a
    small amount of care with PEGs (whose / operator is non-commutative
    and chosen that way not to be confused with the | operator). And,
    moreover, you can get the same results for prefix matching if that is
    what you desire.

    However, you can also get results where other matches are the ones
    that are "preferred" and returned. The user simply has more choices.
    Now, if one hasn't chosen wisely and one has carelessly written an
    "extended" regular expression which returns a different preferred choice
    than the one which one expected, that is the fault of the writer not the formalism. People are just as capable of writing buggy regular expressions
    as they are of writing buggy programs, and that's true whether using
    the extensions or not. Maybe the user really wanted "(a+b)+" or the
    more complex "(a*b)+a*" or some other variant.

    And while there are buggy implementations of PCRE regular expressions,
    that doesn't mean the formalism itself is bad.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)