NFAs, but that has well known problems that if you feed hostile input
to the NFAs, they can take exponential time and it's a way to DDoS the >server. If you translate the NFA to a DFA it runs in linear time, but
if the regex is ambiguous, as many are, the DFA may match something
different from the NFA.
This paper on the Arxiv preprint server proposes some rather complex
tweaks to NFA regex matching to fix many performance issues while giving
the same answers as before.
https://arxiv.org/abs/2401.12639
The easiest way to implement regular expressions is to turn them into
NFAs, but that has well known problems that if you feed hostile input
to the NFAs, they can take exponential time and it's a way to DDoS the server.
if the regex is ambiguous, as many are, the DFA may match something
different from the NFA.
This paper on the Arxiv preprint server proposes some rather complex
tweaks to NFA regex matching to fix many performance issues while giving
the same answers as before.
https://arxiv.org/abs/2401.12639
(I personaly am not so interested in backtracking regexes; and they are
not directly applicable to compilers. You'd never want to design a
lexical analyzer that requires "lookaround" or whatnot in order to
recognize tokens. It's nice they are generating Arxiv papers for
someone; no genuine intellectual inquiry is invalid; good luck! It's
not applicable to the "Programming Languages" category it has been filed under. I will skim through it, though.)
Kaz Kylheku <433-929-6894@kylheku.com> wrote:
(I personaly am not so interested in backtracking regexes; and they are
not directly applicable to compilers. You'd never want to design a
lexical analyzer that requires "lookaround" or whatnot in order to
recognize tokens. It's nice they are generating Arxiv papers for
someone; no genuine intellectual inquiry is invalid; good luck! It's
not applicable to the "Programming Languages" category it has been filed
under. I will skim through it, though.)
You are mistaken that they are not applicable to programming languages.
Even Pascal [originally] had cases where lookahead and backtracking
are required to properly tokenize the input.
The well-known case is "1." if followed by another "." it is two tokens,
and if followed by a digit it is a floating point number.
I don't recall what it should be if followed by any other character.
[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.
{digits}/. {
// ... set up semantic value ..
return INTEGER;
}
{digits}.{digits} {
// ... set up semantic value ..
return FLOAT;
}
Another possibility is to recongize {digits}. as one lexeme and call unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.
Although, our esteemed moderator already corrected this error:
[Patterns with alternatives and trailing contexts do have to back up
and the flex manual warns it's quite expensive. See the
PERFORMANCE CONSIDERATIONS and DEFICIENCIES parts of the man page. -John]
I still want to write an extended reply to this, where
Kaz Kylheku wrote:
I think that case can be handled by technology like Lex trailing
contexts, which doesn't require backtracking.
{digits}/. {
// ... set up semantic value ..
return INTEGER;
}
{digits}.{digits} {
// ... set up semantic value ..
return FLOAT;
}
Another possibility is to recongize {digits}. as one lexeme and call
unput('.'). That's a minor form of backtracking, that doesn't require us
to do anything funny to the regex implementation.
You may not recognize it as backtracking, but it is reading two characters past the end of the integer token to disambiguate it.
Thus, what is implemented in LEX is essentially "longest, lexically first" matching. Where the lexer saves the last state where a token was accepted and continues attempting to match, and on failure backs up to the last recognized match (and then applies the rule which is lexically first that caused that match).
The only difference between this and what PEG (and PCRE) mechanisms
do is that they implement "lexical first, longest" match.
But some backtracking is required for recognizing the longest matching prefixes of the input, even if we use nothing but basic regex features!
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 75:51:10 |
Calls: | 6,716 |
Calls today: | 4 |
Files: | 12,247 |
Messages: | 5,357,404 |