Forum: >>> Magnum BBS <<<

Is it the job of a parser to validate the input data?

From Roger L Costello@21:1/5 to All on Wed Aug 11 22:24:49 2021

Hello Compiler Experts!

There are many data formats which contain things like this:

A number, N
N occurrences of something

For example, 3 followed by the names of three students:

3
John Doe
Sally Smith
Judy Jones

I have a question about parsing such data. Is it the job of a parser to ensure that the number of student names matches the number? Or, is it the job of the parser to merely tokenize whatever is in the input and then create an abstract syntax tree containing the tokens?

I imagine you will tell me, "it depends". But what is typically the case?

/Roger
[You can indeed do it either way. I prefer to do the counting in the AST creation
so it can produce errors like "too few names" rather than a generic "syntax error",
although putting it all in the parser makes it more likely that the language you
parse is actually the language you think you're parsing. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to Roger L Costello on Thu Aug 12 15:10:40 2021

Roger L Costello <costello@mitre.org> asked:

There are many data formats which contain things like this:

A number, N
N occurrences of something

For example, 3 followed by the names of three students:

3
John Doe
Sally Smith
Judy Jones

I have a question about parsing such data. Is it the job of a parser to ensure
that the number of student names matches the number? Or, is it the job of the
parser to merely tokenize whatever is in the input and then create an abstract
syntax tree containing the tokens?

It is almost always done in the AST creation routines, not only do you
as our insightful moderator mentioned generally get better error
messages that way, but curiously, the features of extract a number,
turn it into a count, and apply that count (and yes those might be 3
distinct operations) to be how many items a list involves has not been implemented in any parser generator or lexer generator that I have
ever seen. That's a bizarre omission, particularly since it is a
common feature in many languages like networking protocols. Doing
fixed counts isn't rare, but doing a count held in a "register" or
"variable" seems to not be done.

The conversion step should generally be deferred to "semantic (aka
action) code or a predicate" as the process is messy and best handled
by some well tuned code not something a lexer/parser generator just
outputs and hopes it is semantically correct.

I have i(all 3 steps) on my near-endless to-do list to fix that for
Yacc++, but it isn't near the top of it.

By the way, when working with Michella Becchi on doing a hardware
regular expression engine at Intel, she studied the problem of counted
regular expressions and proposed some interesting implementation
details of how to handle them. Anyone interested in high speed regular expression implementations would be well advised to look up her papers
on the topic.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to costello@mitre.org on Thu Aug 12 09:34:00 2021

On Wed, 11 Aug 2021 22:24:49 +0000, Roger L Costello
<costello@mitre.org> wrote:

There are many data formats which contain things like this:

A number, N
N occurrences of something

For example, 3 followed by the names of three students:

3
John Doe
Sally Smith
Judy Jones

I have a question about parsing such data. Is it the job of a parser to ensure >that the number of student names matches the number? Or, is it the job of the >parser to merely tokenize whatever is in the input and then create an abstract >syntax tree containing the tokens?

I imagine you will tell me, "it depends". But what is typically the case?

It's the job of a parser to ensure that the input's syntax is correct.
What that means exactly is up to the developer.

If you consider that in your 'language' a list consists of a number
followed by exactly that many strings ... well then you could argue
that the parser should enforce that.

However, as John mentioned, often it is difficult to generate really
meaningful error messages during parsing. I would contend that in
your example the /syntax/ of lists is really is a number followed by
zero or more strings (number string*), and that verifying the string
count is semantics, not syntax. I believe that, whenever possible,
semantics are best left until after parsing is finished.

YMMV,
George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From luser droog@21:1/5 to Christopher F Clark on Fri Sep 3 21:37:53 2021

On Thursday, August 12, 2021 at 10:52:15 AM UTC-5, Christopher F Clark wrote:

It is almost always done in the AST creation routines, not only do you
as our insightful moderator mentioned generally get better error
messages that way, but curiously, the features of extract a number,
turn it into a count, and apply that count (and yes those might be 3
distinct operations) to be how many items a list involves has not been implemented in any parser generator or lexer generator that I have
ever seen. That's a bizarre omission, particularly since it is a
common feature in many languages like networking protocols. Doing
fixed counts isn't rare, but doing a count held in a "register" or
"variable" seems to not be done.

I think the omission comes from difficulty in the formalization. Having to apply such a dynamic count moves the parser from context-free to context-sensitive, jumping to a new level in the Chomsky hierarchy.
So all the training wheels come off and it all gets scary.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	295
Nodes:	16 (2 / 14)
Uptime:	20:16:49
Calls:	6,640
Files:	12,188
Messages:	5,325,290

Is it the job of a parser to validate the input data?

Who's Online

System Info