Any advice on what node types should be in an IR tree? There seem to be
many options and I am not sure where to strike the balance.
My guess is that it's best to keep the number of different node types reasonably small. That should mean less code is needed to walk the tree.
But when does that number become too small?
For example, consider expressions.
There /could/ be one node type for "+" and another for "*", as in a node
with these fields:
+
left
right
Alternatively, both operators could be handled by a single node type of "dyadic operation". Such a node would have four fields:
nodetype = dyadic
operator (e.g. "+")
left
right
Further, why limit the number of arguments to 1 or 2? If one were to
allow an arbitrary number of arguments then such a type of node could
also be used for function calls.
In fact, that form is effectively an s expression of the form
(operator arguments)
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
Any advice on what node types should be in an IR tree? There seem to be
many options and I am not sure where to strike the balance.
My guess is that it's best to keep the number of different node types reasonably small. That should mean less code is needed to walk the tree.
But when does that number become too small?
For example, consider expressions.
There /could/ be one node type for "+" and another for "*", as in a node
with these fields:
+
left
right
Alternatively, both operators could be handled by a single node type of "dyadic operation". Such a node would have four fields:
nodetype = dyadic
operator (e.g. "+")
left
right
That could be further generalised by having one type of node which would
be usable for both monadic and dyadic operators:
nodetype = operation
operator
number of arguments: 1 or 2
arguments
Further, why limit the number of arguments to 1 or 2? If one were to
allow an arbitrary number of arguments then such a type of node could
also be used for function calls.
In fact, that form is effectively an s expression of the form
(operator arguments)
So for expressions, which of the above schemes is best to use for nodes
in an IR tree?
(A potential fly in the ointment is whether such a node is suitable for functions which modify some of their parameters.)
Then there are other (non-expression) nodes.
To allow for things like code hoisting I think I need the tree initially
to be at a fairly high level, containing things such as loops and
selections though it would be lowered later to use labels and
conditional branches.
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
Is that enough? Too many? Too few? What types are missing?
On 23/11/2022 15:39, Dmitry A. Kazakov wrote:
a - b + c
produces
operation => +
operands => a,-b,c
Both of those are curious. Why have more than two operands?
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
operator a+b
call f(a,b)
Then various ligatures that are not operators, but some meta-operators
e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
etc. Usually you cannot treat them as first-class operators, because a
in a::b might not be an expression.
I think I understood your post up until that paragraph. What do you mean
by ligatures and meta operators? And aside from the semantics what's
wrong with "a in a::b"?
I do not create one tree for all program, just for individual
expressions. IMO there is no need to generate a tree for a switch or a
loop, unless you have switch-operator and loop-operator, of course.
Again, a curious comment. My sources are naturally hierarchical so a
tree seems to be a good way to represent them. Are you saying another
form could be better?
On 2022-11-23 15:38, James Harris wrote:
Any advice on what node types should be in an IR tree? There seem to
be many options and I am not sure where to strike the balance.
I have an abstract node type with derived node types for expression
terms like identifiers and literals and for branches. The branch node
type is like:
operation - an enumeration
location - the source code location
operands - the nodes list (1..n)
operation can be for example "+" or "call." E.g.
f(a,b,c)
produces a node
operation => call
operands => f,a,b,c
a + b + c
produces
operation => +
operands => a,b,c
a - b + c
produces
operation => +
operands => a,-b,c
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
operator a+b
call f(a,b)
Then various ligatures that are not operators, but some meta-operators
e.g. if you have built-in ranges a..b, components a.b, namespaces a::b
etc. Usually you cannot treat them as first-class operators, because a
in a::b might not be an expression.
I do not create one tree for all program, just for individual
expressions. IMO there is no need to generate a tree for a switch or a
loop, unless you have switch-operator and loop-operator, of course.
On 23/11/2022 14:38, James Harris wrote:
Any advice on what node types should be in an IR tree? There seem to
be many options and I am not sure where to strike the balance.
My comments will be about an AST. I don't know whether you are using
'IR' as a synonym for the AST; I think it's generally used for the next
stage beyond AST.
The problem is, there are loads of ways to do this, they will all work.
There is no right or wrong way.
Further, why limit the number of arguments to 1 or 2? If one were to
allow an arbitrary number of arguments then such a type of node could
also be used for function calls.
How practical this is depends on your language, the arrangements for
matching an arbitrary length list to the node.
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
Is that enough? Too many? Too few? What types are missing?
My ASTs are unusual (although I only discovered this recently) in that
they only represent executable code. That is, the bodies of functions,
or init data for a variable.
Others use the AST also for non-executable elements, like function declarations. I couldn't see the point of this, unless this is a
scripting where declarations /are/ executed from the top of the module.
However even without declaration nodes, I use either 130 or 200 node
types (depending on whether ADD, SUB etc have dedicated tags, or grouped under one BINOP tag). My C compiler uses 80.
But you must have already done all this in your compiler; didn't that
use an AST?
On 23/11/2022 18:33, Bart wrote:
However even without declaration nodes, I use either 130 or 200 node
types (depending on whether ADD, SUB etc have dedicated tags, or
grouped under one BINOP tag). My C compiler uses 80.
Wow, those numbers sound high! Maybe I have been trying to cut the
numbers down unnecessarily.
On 23/11/2022 14:38, James Harris wrote:
Any advice on what node types should be in an IR tree? There seem to
be many options and I am not sure where to strike the balance.
My comments will be about an AST. I don't know whether you are using
'IR' as a synonym for the AST; I think it's generally used for the next
stage beyond AST.
My guess is that it's best to keep the number of different node types
reasonably small. That should mean less code is needed to walk the
tree. But when does that number become too small?
For example, consider expressions.
There /could/ be one node type for "+" and another for "*", as in a
node with these fields:
+
left
right
Alternatively, both operators could be handled by a single node type
of "dyadic operation". Such a node would have four fields:
nodetype = dyadic
operator (e.g. "+")
left
right
The problem is, there are loads of ways to do this, they will all work.
There is no right or wrong way.
I use both of the above methods across three active compilers. With the first, its easy to dispatch directly the "+" node when you need to deal
with "add".
With the second, you have to dispatch first on "dyadic" and then on "operator". But it's not a big deal.
That could be further generalised by having one type of node which
would be usable for both monadic and dyadic operators:
nodetype = operation
operator
number of arguments: 1 or 2
arguments
Further, why limit the number of arguments to 1 or 2? If one were to
allow an arbitrary number of arguments then such a type of node could
also be used for function calls.
How practical this is depends on your language, the arrangements for
matching an arbitrary length list to the node.
Currently I allow 3 operands for one language, and 2 for another.
Remember this not just about expressions, it has to be able to represent anything.
For variable-length lists, any of the operands (although usually just
the first) is interpreted as the root of a linked list. (With some
compiler versions in dynamic code, each operand could directly be a list.)
In fact, that form is effectively an s expression of the form
(operator arguments)
So for expressions, which of the above schemes is best to use for
nodes in an IR tree?
(A potential fly in the ointment is whether such a node is suitable
for functions which modify some of their parameters.)
Then there are other (non-expression) nodes.
To allow for things like code hoisting I think I need the tree
initially to be at a fairly high level, containing things such as
loops and selections though it would be lowered later to use labels
and conditional branches.
So what other node types are needed? One could have
program
function
identifier: constant or variable or parameter
assignment
iteration
sequential selection
switch
break iteration
break sequential selection
break switch
Is that enough? Too many? Too few? What types are missing?
My ASTs are unusual (although I only discovered this recently) in that
they only represent executable code. That is, the bodies of functions,
or init data for a variable.
Others use the AST also for non-executable elements, like function declarations. I couldn't see the point of this, unless this is a
scripting where declarations /are/ executed from the top of the module.
However even without declaration nodes, I use either 130 or 200 node
types (depending on whether ADD, SUB etc have dedicated tags, or grouped under one BINOP tag). My C compiler uses 80.
But you must have already done all this in your compiler; didn't that
use an AST?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 300 |
Nodes: | 16 (2 / 14) |
Uptime: | 01:10:31 |
Calls: | 6,706 |
Calls today: | 6 |
Files: | 12,235 |
Messages: | 5,349,918 |