Are regular expressions still the best way to specify tokens?
Pattern Specification Language (PSL) is
much more powerful than the usual
regular expression.
I suspect that if regexes hadn't previously
been defined, we might come up with
something different today.
It has a completely different PSL, Pattern Specification Language,
much more powerful than the usual regular expression.
Among others, PSL has the ability to define approximate matches,
such as a word with one or more misspellings, that is insertions,
deletions, or substitutions. Usual RE don't have that ability.
There are also PSL expressions for ranges of numbers.
You can often do that with very complicated RE, considering
all of the possibilities. PSL automatically processes those
possibilities. (Some can expand to complicated code.)
I suspect that in many cases the usual RE is not optimal for
lexical analysis, other than being well known.
But as noted, DFA are likely the best way to do them.
Though that could change with changes in computer hardware.
On Sunday, June 5, 2022 at 2:08:12 PM UTC-7, Roger L Costello wrote:
(snip)
Are regular expressions still the best way to specify tokens?
Some years ago, I used to work with a company that sold hardware
search processors to a certain three letter agency that we are not
supposed to mention, but everyone knows.
It has a completely different PSL, Pattern Specification Language,
much more powerful than the usual regular expression.
Both the standard and extended regular expression are nice, in that we
get used to using them, especially with grep, and without thinking too
much about them.
I suspect, though, that if they hadn't previously been defined, we
might come up with something different today.
Among others, PSL has the ability to define approximate matches,
such as a word with one or more misspellings, that is insertions,
deletions, or substitutions. Usual RE don't have that ability.
There are also PSL expressions for ranges of numbers.
You can often do that with very complicated RE, considering
all of the possibilities. PSL automatically processes those
possibilities. (Some can expand to complicated code.)
I will look into PSL. There are algorithms for converting regexes to DFA
and then using the DFA to tokenize the input. Are there algorithms for converting PSL to (what?) and then using the (what?) to tokenize the input?
Regular expressions have the advantage that once you've paid the one-time cost
of making a DFA, the matching is extremely fast. Yhe lexer is usually
one of the slowest parts of a compiler since it is the only part that has to look at each character of the source program, so this is a place where speed matters.
I suspect that if regexes hadn't previously
been defined, we might come up with
something different today.
Wow! That is a remarkable statement.
In fact, there is only thing that I have not seen a DFA lexer do that I think is
worth doing at the lexical level (and not via a screener). That is recognizing
tokens the start with a length prefix, e.g. 10Habcdefhij. Such tokens are common in things like network protocols and they would be relatively easy
to implement, but I've not seen it done.
Beyond that it is my relatively firm belief that one should almost always have only simple regular expressions, e.g. that the one for floating point numbers should be one of the most complex ones. Otherwise you are trying
to do too much in the scanner. And you are asking for trouble when you do.
And, as our moderator pointed out, this makes a terrible regular
expression, NFA, DFA, but it is actually quite easy in nearly any
programming language.
Now I know what made me think of Hollerith constants with the "H" :-)
I doubt that it's "quite easy" to use Hollerith constants for humans -
how often do you have to check whether you got the right number of
characters when reading or writing such a constant? So the delimited
form of strings is easier to handle by both humans and DFA's, a win-win situation :-)
On Thursday, June 9, 2022 at 9:33:52 AM UTC-7, Hans-Peter Diettrich wrote:
In any case, if you write your program on a coding form, with
each character in a little box, it is easy to know how many are
in each H constant.
Nevertheless, counting the number of characters was a constant source
of error. It was easy enough to include the letter 'H' in the
character count, sp that the following character became gobbled up in
the Hollerith constant, and resulting in weird error messages. When a Hollerith constant was long enough to require a continuation card, it
was even easier to lose count; the continuation character in column
6 sometimes being included. And when the Hollerith constant required
133 characters, how many coud reliably count all of them?
And when the Hollerith constant required 133 characters, how many coud reliably count all of them?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 64:20:45 |
Calls: | 6,712 |
Files: | 12,244 |
Messages: | 5,356,124 |