• Address manipulations

    From James Harris@21:1/5 to All on Sun Nov 7 23:57:38 2021
    I'll set out below what to my knowledge is a novel way of looking at
    certain aspects of expression parsing. Don't be alarmed, it doesn't
    parse Martian. In fact, I think (subject to correction) that it
    implements the normal kind of parsing that a programmer would be
    familiar with. But AISI it handles some of it in a simpler, more
    natural, and more understandable way than I've seen anywhere else.


    To explain, since the 1960s it has been traditional to think of some identifiers are resolving to lvalues and others to rvalues. However, I
    suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

    X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in certain contexts. But programmers find it natural if in other contexts X is
    implicitly and automatically dereferenced to yield a value. Classically,
    in the assignment

    X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    What matters is semantics but contexts are easiest to discuss in terms
    of the syntax so I'll do that. In simple terms one could say that if an expression (of any sort) is followed by one of

    = (assignment)
    . (field selection)
    ( (function invocation)
    [ (array lookup)

    or is tweaked with increment or decrement operators (as in C's ++ and
    --) then the /address/ is used. In all other contexts, however, an
    implicit deference is automatically inserted by a compiler such that the
    value at the designated address is used instead. To illustrate, consider

    A[2][4]

    Note that after both A *and* the first closing square bracket there is
    no dereference. In syntax terms one can consider that that's because
    each is followed by one of the aforementioned symbols. IOW both A and
    the first closing square bracket are followed by an opening square
    bracket so there is no deference. But there /is/ an automatic
    dereference after the final square bracket because it is not followed by
    one of the listed symbols. So the key as to whether an automatic
    dereference is inserted or not is what comes next after an expression.

    That's very flexible, allowing expressions to work with an arbitrary
    number of addresses. For example,

    B = A[2][4][6][8][10]

    etc. That expression uses addresses all the way through. Each array
    lookup results in yet another address. Only after the final square
    bracket would there be a dereference.

    Of course, it's not just array indexing. Anything which /produces/ an
    address can have its output fed into anything which /uses/ an address
    and such operators can be combined arbitrarily. For example,

    vectors[1](2).data[3] = y

    Such an expression may be horrendous but illustrates how a programmer
    could combine addresses in any way desired. Only after the y would there
    be a dereference.

    (Perhaps it's strange that as programmers we accept the inconsistency
    that some contexts get implicit dereferences and some don't. But we
    would probably not want to write all deref or no-deref points in code.
    So we are where we are.)

    Importantly, it is always possible to dereference an address to get a
    value but there is no way to operate on a value to get its address. For
    that reason my precedence table has all the address-consuming operators
    first. That's probably true of most other languages as well but I've not
    seen that set out as a rationale.

    Consider how C uses its 'address of' operator, & as a prefix.

    &X gets the address of X
    &X[4] gets the address of X[4]
    &X.f gets the address of field f

    Yet C's & is not a normal operator. It does not transform its argument.
    As stated, it is not possible to get from a value to an address. So &E
    cannot evaluate E and then take its address. Therefore & is not an
    operator in the normal sense that it manipulates a value. Instead, &E
    inhibits the automatic dereference that would have been inserted at the
    end of E: it prevents emission of the dereference that the compiler
    would otherwise have emitted.

    There is, perhaps, an additional oddity that an 'operator' at the
    beginning of a subexpression really applies at the end of that
    subexpression.

    It may be more straightforward for & to be placed at the location where
    the dereference would otherwise have been.

    Assuming for discussion purposes that trailing & and infix & can be distinguished (so we don't need to use another symbol) the above
    expressions would become

    X& the address of X
    X[4]& the address of X[4]
    X.f& the address of field f

    Then the unary trailing & joins the symbols in the list above and
    becomes just another of the operators which, when it appears after an expression, inhibits the automatic dereference that would otherwise have occurred at that point:

    = assign
    . field selection
    ( function call
    [ index
    & nothing except, like all the others, inhibit dereference

    To summarise, there would no longer be the conceptual difference between lvalues and rvalues. All identifiers would be considered as producing
    their addresses, never their values. There would instead be contexts in
    which automatic dereference takes place, and the programmer would put &
    in any of those places where the automatic dereference was to be inhibited.

    AFAIK that's a new way of looking at addresses in expressions but maybe
    you know otherwise.

    More importantly, as a programmer how easy would you find it to think in
    those terms?

    I wanted to go on to say more but this post is already overlong. I'll
    come back to some of the other points.

    Naturally, comments welcome!


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Mon Nov 8 00:25:58 2021
    On 07/11/2021 23:57, James Harris wrote:

    To explain, since the 1960s it has been traditional to think of some identifiers are resolving to lvalues and others to rvalues. However, I suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

      X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in certain contexts. But programmers find it natural if in other contexts X is implicitly and automatically dereferenced to yield a value. Classically,
    in the assignment

      X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    (Haven't we been here before?)

    In X = X, both sides are dereferenced, one for reading, one for writing:

    mov D0, [x]
    mov [x], D0

    If you like, emulate a language that doesn't dereference automatically
    by writing &X instead of X. Then to do that assignment, you'd need to write:

    *(&X) = *(&X)

    Why would you need * on both sides if only one side is dereferenced?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rod Pemberton@21:1/5 to James Harris on Sun Nov 7 20:24:50 2021
    On Sun, 7 Nov 2021 23:57:38 +0000
    James Harris <james.harris.1@gmail.com> wrote:

    I'll set out below what to my knowledge is a novel way of looking at
    certain aspects of expression parsing. Don't be alarmed, it doesn't
    parse Martian. In fact, I think (subject to correction) that it
    implements the normal kind of parsing that a programmer would be
    familiar with. But AISI it handles some of it in a simpler, more
    natural, and more understandable way than I've seen anywhere else.


    To explain, since the 1960s it has been traditional to think of some identifiers are resolving to lvalues and others to rvalues. However,
    I suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

    X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in
    certain contexts.

    Well, that depends on the language. E.g., the variety of PL/1 I
    programmed, all variables were passed-by-reference. I.e., they were
    always treated as addresses. You could specify pass-by-value, if
    desired (unneeded). For C, they are pretty much always treated as
    addresses, which is mostly unseen or unnoticed by the programmer, or
    even rejected as a concept by those pedants on c.l.c.. Of course, they
    are passed-by-value in C, but sometimes by reference, if coded that way.

    But programmers find it natural if in other
    contexts X is implicitly and automatically dereferenced to yield a
    value. Classically, in the assignment

    X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    What? ... Of course, it is. X is dereferenced twice here.

    You must get the address of X in both instances.
    You need the address on the right to read/access X's value.
    You need the address on the left to write/store X's value.

    What matters is semantics but contexts are easiest to discuss in
    terms of the syntax so I'll do that. In simple terms one could say
    that if an expression (of any sort) is followed by one of

    = (assignment)
    . (field selection)
    ( (function invocation)
    [ (array lookup)

    or is tweaked with increment or decrement operators (as in C's ++ and
    --) then the /address/ is used.

    Personally, I think you're looking at this all the wrong way around.
    Treat everything as an address from the get-go. Then, you should be
    able to recognize that everything is an address.

    E.g.,
    printf("Hello World\n");

    "Hello World\n" <-- placeholder for the address of the string constant:
    Hello World\n\0 which stored somewhere else

    Hello World\n\0 <-- string constant stored at the address of a
    placeholder, which you see as: "Hello World\n", i.e., which
    is essentially a unnamed or hidden variable, or compiler created
    temporary variable

    In all other contexts, however, an implicit [dereferences]

    I don't give deference to dereferences.

    If everything is an address from the get-go, there are no implicit dereferences. As I said, the variety of PL/1 I programmed worked this
    way, as does much of C, whether recognized as such or not.

    In all other contexts, however, an implicit [dereference] is
    automatically inserted by a compiler such that the value at the
    designated address is used instead. To illustrate, consider

    A[2][4]


    Let's make it proper with an assignment:

    B = A[2][4];

    Start with A is an address.
    Also, B is an address.
    Adjust A's address by [2][4], which depends on the type's size.
    Read data from the adjusted address of whatever type A was declared as.
    Store data read from A at address B.

    Note that after both A *and* the first closing square bracket there
    is no dereference. In syntax terms one can consider that that's
    because each is followed by one of the aforementioned symbols. IOW
    both A and the first closing square bracket are followed by an
    opening square bracket so there is no deference. But there /is/ an
    automatic dereference after the final square bracket because it is
    not followed by one of the listed symbols. So the key as to whether
    an automatic dereference is inserted or not is what comes next after
    an expression.

    In my explanation above, the assignment operator = does the dereference
    of adjusted address for A to read and the dereference of B's address to
    store.

    That's very flexible, allowing expressions to work with an arbitrary
    number of addresses. For example,

    B = A[2][4][6][8][10]

    etc. That expression uses addresses all the way through. Each array
    lookup results in yet another address. Only after the final square
    bracket would there be a dereference.

    ...

    Of course, it's not just array indexing. Anything which /produces/ an
    address can have its output fed into anything which /uses/ an address
    and such operators can be combined arbitrarily. For example,

    vectors[1](2).data[3] = y

    Please, trust me and treat everything as an address. It will make your
    like much easier.

    Such an expression may be horrendous but illustrates how a programmer
    could combine addresses in any way desired. Only after the y would
    there be a dereference.

    (Perhaps it's strange that as programmers we accept the inconsistency
    that some contexts get implicit dereferences and some don't. But we
    would probably not want to write all deref or no-deref points in
    code. So we are where we are.)

    IMO, implicit dereferences are just used to explain away the fact that everything is really an address, because this fact is beyond the
    comprehension of novices who are not taught about addresses and data
    types, but are taught about variables and strings, etc.

    Importantly, it is always possible to dereference an address to get a
    value but there is no way to operate on a value to get its address.
    For that reason my precedence table has all the address-consuming
    operators first. That's probably true of most other languages as well
    but I've not seen that set out as a rationale.

    Consider how C uses its 'address of' operator, & as a prefix.

    &X gets the address of X
    &X[4] gets the address of X[4]
    &X.f gets the address of field f

    Yet C's & is not a normal operator.

    ...

    It does not transform its argument.

    Correct.

    It actually tells C to **NOT** dereference the address of X, or
    adjusted address from X, as is normally done prior to an assignment or pass-by-value to a function, thereby leaving the address instead of the
    value.

    As stated, it is not possible to get from a value to an
    address. So &E cannot evaluate E and then take its address. Therefore
    & is not an operator in the normal sense that it manipulates a value. Instead, &E inhibits the automatic dereference that would have been
    inserted at the end of E: it prevents emission of the dereference
    that the compiler would otherwise have emitted.

    Yes. This is a result of everything in C being an address, a concept
    rejected by C pedants on c.l.c. and elsewhere, and even you ...

    There is, perhaps, an additional oddity that an 'operator' at the
    beginning of a subexpression really applies at the end of that
    subexpression.

    It may be more straightforward for & to be placed at the location
    where the dereference would otherwise have been.

    Assuming for discussion purposes that trailing & and infix & can be distinguished (so we don't need to use another symbol) the above
    expressions would become

    X& the address of X
    X[4]& the address of X[4]
    X.f& the address of field f

    Then the unary trailing & joins the symbols in the list above and
    becomes just another of the operators which, when it appears after an expression, inhibits the automatic dereference that would otherwise
    have occurred at that point:

    = assign
    . field selection
    ( function call
    [ index
    & nothing except, like all the others, inhibit dereference

    To summarise, there would no longer be the conceptual difference
    between lvalues and rvalues.

    ...

    All identifiers would be considered as
    producing their addresses, never their values.

    As stated previously here and numerous other posts, that is, IMO, the
    correct approach.

    There would instead be
    contexts in which automatic dereference takes place, and the
    programmer would put & in any of those places where the automatic
    dereference was to be inhibited.

    AFAIK that's a new way of looking at addresses in expressions but
    maybe you know otherwise.

    New? Perhaps, a new understanding for you, I guess.

    More importantly, as a programmer how easy would you find it to think
    in those terms?

    I already do. Have for decades, in regards to C. No one discussing C
    ever agrees with me though.

    I wanted to go on to say more but this post is already overlong. I'll
    come back to some of the other points.

    Naturally, comments welcome!

    --
    Is this the year that Oregon ceases to exist?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Mon Nov 8 07:53:21 2021
    On 08/11/2021 00:25, Bart wrote:
    On 07/11/2021 23:57, James Harris wrote:

    To explain, since the 1960s it has been traditional to think of some
    identifiers are resolving to lvalues and others to rvalues. However, I
    suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

       X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in
    certain contexts. But programmers find it natural if in other contexts
    X is implicitly and automatically dereferenced to yield a value.
    Classically, in the assignment

       X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    (Haven't we been here before?)

    Some. Although I think this is the first time I've ever written here
    about some aspects such as the trailing & construct.


    In X = X, both sides are dereferenced, one for reading, one for writing:

             mov  D0,  [x]
             mov  [x], D0

    Rather than dereferenced do you mean that both sides are /accessed/?

    I should explain what I mean by dereferencing. I mean, effectively,
    following a pointer. I don't know x86-64 asm but in x86-32 asm the
    dereference operation would be of the form

    mov eax, [eax]

    IOW EAX contains an address and that instruction replaces it with the
    value at that address, aka it follows a pointer, aka it 'dereferences'!

    To make it clearer consider assignments with different variables

    B = A

    that could translate to

    lea eax, [A]
    mov eax, [eax] ;<=== A is dereferenced
    lea ebx, [B]
    ;<=== B is not dereferenced
    mov [ebx], eax ;<=== the assign operation

    A and B are both treated the same way - i.e. as addresses. However, B
    cannot be dereferenced - i.e. its address cannot be converted to a value
    - because its address is what's needed.


    If you like, emulate a language that doesn't dereference automatically
    by writing &X instead of X. Then to do that assignment, you'd need to
    write:

       *(&X) = *(&X)

    Why would you need * on both sides if only one side is dereferenced?


    If I read that right the * on the RHS will be honoured but the one on
    the left will not! Part of my thesis is that the LHS's * will be
    inhibited by the = assignment. That's easy to see in simple assembly. If
    the RHS of your expression is translated to

    lea eax, [X] ;the & operator
    mov eax, [eax] ;the * operator

    then the LHS would correspondingly be translated to

    lea eax, [X] ;the & operator

    But the * operator on the LHS would be suppressed. If you would
    translate it differently I suggest there would still be one fewer
    dereferences on the LHS than on the RHS. Do you see now what I am
    getting at?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to James Harris on Mon Nov 8 09:52:41 2021
    On 08/11/2021 00:57, James Harris wrote:

    Consider how C uses its 'address of' operator, & as a prefix.

      &X      gets the address of X
      &X[4]   gets the address of X[4]
      &X.f    gets the address of field f

    Yet C's & is not a normal operator. It does not transform its argument
    I'm not commenting on your main points at the moment - I think it is an interesting view, and worth thinking about.

    However, your comment that "C's & is not a normal operator" is somewhat
    bizarre - it implies there is such a thing as a "normal operator". C
    has all sorts of operators - function calls are operators, sizeof and
    _Alignof are operators (neither of which evaluates their operand, and
    the operand can be a type rather than an expression), assignment is an
    operator (while in many languages, it is a statement). Casts are
    operators that may or may not affect the value of the operand. The
    comma operator evaluates and then discards its first operand. Structure
    and union member access are operators.

    I suppose you mean to say that "&" is somewhat different from addition
    or multiplication. Alternatively, you could say that most operators in
    C are not normal operators!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Rod Pemberton on Mon Nov 8 08:41:00 2021
    On 08/11/2021 01:24, Rod Pemberton wrote:
    On Sun, 7 Nov 2021 23:57:38 +0000
    James Harris <james.harris.1@gmail.com> wrote:

    ...

    Classically, in the assignment

    X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    What? ... Of course, it is. X is dereferenced twice here.

    You must get the address of X in both instances.
    You need the address on the right to read/access X's value.
    You need the address on the left to write/store X's value.

    You seem to be thinking of /accesses/ rather than dereferences. Bart
    did, too. By dereference I mean fetching the value at an address. For
    example, in C

    **p

    will have one more dereference than

    *p



    What matters is semantics but contexts are easiest to discuss in
    terms of the syntax so I'll do that. In simple terms one could say
    that if an expression (of any sort) is followed by one of

    = (assignment)
    . (field selection)
    ( (function invocation)
    [ (array lookup)

    or is tweaked with increment or decrement operators (as in C's ++ and
    --) then the /address/ is used.

    Personally, I think you're looking at this all the wrong way around.

    :-)

    Treat everything as an address from the get-go.

    I do. In my compiler every identifier is initially treated as an
    address. The difference is in /where/ dereference operations should be inserted.

    ...

    Let's make it proper with an assignment:

    B = A[2][4];

    Start with A is an address.
    Also, B is an address.
    Adjust A's address by [2][4], which depends on the type's size.
    Read data from the adjusted address of whatever type A was declared as.
    Store data read from A at address B.

    ...

    In my explanation above, the assignment operator = does the dereference
    of adjusted address for A to read and the dereference of B's address to store.

    A key question for you: Would the assignment operator still do the
    dereference in

    A = B + C

    ?

    ...

    = assign
    . field selection
    ( function call
    [ index
    & nothing except, like all the others, inhibit dereference

    To summarise, there would no longer be the conceptual difference
    between lvalues and rvalues.

    ...

    All identifiers would be considered as
    producing their addresses, never their values.

    As stated previously here and numerous other posts, that is, IMO, the
    correct approach.

    Sounds good but do you also accept my thesis about expressions having
    implicit dereference points? I am saying they take place in those places
    which are not followed by one of the above symbols.

    For example,

    print A + 4
    print B[0] + 4

    Aren't there implicit dereference points in there where I've put @ signs
    in the following?

    print A@ + 4
    print B[0]@ + 4

    NB no deref immediately after B even though there is one after A.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dmitry A. Kazakov@21:1/5 to Rod Pemberton on Mon Nov 8 09:57:29 2021
    On 2021-11-08 02:24, Rod Pemberton wrote:
    On Sun, 7 Nov 2021 23:57:38 +0000
    James Harris <james.harris.1@gmail.com> wrote:

    Of course, it's not just array indexing. Anything which /produces/ an
    address can have its output fed into anything which /uses/ an address
    and such operators can be combined arbitrarily. For example,

    vectors[1](2).data[3] = y

    Please, trust me and treat everything as an address. It will make your
    like much easier.

    Not true even for the assembler James calls language. Even machine code
    need to have registers and immediates.

    Everything (object) is a set of instructions bringing the computations
    into the state corresponding the actual value of the object in the
    actual context.

    Note that even this set is not fixed, it may vary, as the object can be
    stored in a register, it can be packed in a way you cannot address it,
    it can be marshaled over a network connection, the subprogram can be
    inlined, a closure with the object can passed indirectly via display or whatever, and so on and so forth.

    --
    Regards,
    Dmitry A. Kazakov
    http://www.dmitry-kazakov.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to David Brown on Mon Nov 8 10:03:01 2021
    On 08/11/2021 08:52, David Brown wrote:
    On 08/11/2021 00:57, James Harris wrote:

    Consider how C uses its 'address of' operator, & as a prefix.

      &X      gets the address of X
      &X[4]   gets the address of X[4]
      &X.f    gets the address of field f

    Yet C's & is not a normal operator. It does not transform its argument

    I'm not commenting on your main points at the moment - I think it is an interesting view, and worth thinking about.

    Cool. Considered views are appreciated.


    However, your comment that "C's & is not a normal operator" is somewhat bizarre - it implies there is such a thing as a "normal operator". C
    has all sorts of operators - function calls are operators, sizeof and _Alignof are operators (neither of which evaluates their operand, and
    the operand can be a type rather than an expression), assignment is an operator (while in many languages, it is a statement). Casts are
    operators that may or may not affect the value of the operand. The
    comma operator evaluates and then discards its first operand. Structure
    and union member access are operators.

    I suppose you mean to say that "&" is somewhat different from addition
    or multiplication. Alternatively, you could say that most operators in
    C are not normal operators!

    Maybe it comes down to nomenclature. I think of an operator as something
    which 'operates' on one or more 'operands' (ostensibly at run time but,
    for example, operations involving constants may be pre-evaluated in the compiler).

    I agree that C has some 'operators' which do not do that - particularly
    sizeof which rather than emitting code to calculate the size really
    changes subsequent evaluation rules (!) so that what follows is not even evaluated! The oddity of that is, IMO, reflected in the number of times
    the C standards include words such as

    "except in the case of sizeof ..."

    So I agree with you about sizeof and _Alignof (not that I've ever used
    it but I can guess from the name what it's for).

    However, function calls, assignment, casts, and comma fit what I would
    call operators because they operate on values ostensibly at run time.

    Structure and union member accesses are interesting ones. First
    impression is that I would call them operators because they add the
    field offset to the expression on their left and are thus examples of
    what I call 'addressing functions', as is array indexing or even
    locating a node in a complex data structure.

    That said, compared with C's sizeof & is more of an operator because
    while it says /not/ to do something it at least says not to do something
    to the value to which a subexpression evaluates. :-) In my original post
    the issue is raised as to whether & is better preceding or following the subexpression it relates to.

    I have wondered in the past whether there's a more logical replacement
    for 'operators' which change evaluation rules such as sizeof but I've
    not [yet :-)] come up with one.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to David Brown on Mon Nov 8 10:19:30 2021
    On 08/11/2021 08:52, David Brown wrote:
    On 08/11/2021 00:57, James Harris wrote:

    Consider how C uses its 'address of' operator, & as a prefix.

      &X      gets the address of X
      &X[4]   gets the address of X[4]
      &X.f    gets the address of field f

    Yet C's & is not a normal operator. It does not transform its argument
    I'm not commenting on your main points at the moment - I think it is an interesting view, and worth thinking about.

    However, your comment that "C's & is not a normal operator" is somewhat bizarre - it implies there is such a thing as a "normal operator".

    Well, you can't implement it with a function! Such as:

    int a;
    int* p = addressof(a);

    Although you can do it if you twist the language around, but in general,
    if 'a' normally means its value, you can't turn a value into the address
    where it's stored.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Mon Nov 8 10:36:52 2021
    On 08/11/2021 07:53, James Harris wrote:
    On 08/11/2021 00:25, Bart wrote:

    In X = X, both sides are dereferenced, one for reading, one for writing:

              mov  D0,  [x]
              mov  [x], D0

    Rather than dereferenced do you mean that both sides are /accessed/?

    I should explain what I mean by dereferencing. I mean, effectively,
    following a pointer.

    Following a pointer and then doing what? If you have a chains of derefs
    like this:

    ***p = 0;

    The first two will be read, the last used for writing.

    IOW EAX contains an address and that instruction replaces it with the
    value at that address, aka it follows a pointer, aka it 'dereferences'!

    OK, now I understand. If you have a machine with one register which
    contains a pointer, and read the address at the pointer:

    mov R, [R]

    then R is replaced with the target. But that doesn't happen here:

    mov [R], 0 # R is unchanged

    It needn't happen here either:

    mov R2, [R] # R is unchanged

    I see 'dereferencing' as something to do with type system.

    If P is a pointer, it might have type T*. If you dereference it, the
    value you get has type T. The '*' reference has disappeared! But that
    happens whether reading or writing:

    *Q = *P

    Both P and Q have type T*. During and after the assigning, they will
    still have type T*.

    To implement the assignment, * is used to dereference P's value of type
    T* to get a value X of type T, and * is used to dereference Q's value of
    type T*, to store X of type T.


      lea eax, [X]    ;the & operator

    But the * operator on the LHS would be suppressed.

    Only because you have haven't shown it. But to write to the address in
    eax to complete the assignment, you have to use [eax].

    You seem to want to distinguish between an address used for reading
    ([eax]), and an address used for writing ([eax]).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Lindsey@21:1/5 to James Harris on Mon Nov 8 11:38:53 2021
    On 07/11/2021 23:57, James Harris wrote:
    I'll set out below what to my knowledge is a novel way of looking at certain aspects of expression parsing. Don't be alarmed, it doesn't parse Martian. In fact, I think (subject to correction) that it implements the normal kind of parsing that a programmer would be familiar with. But AISI it handles some of it
    in a simpler, more natural, and more understandable way than I've seen anywhere
    else.


    To explain, since the 1960s it has been traditional to think of some identifiers
    are resolving to lvalues and others to rvalues. However, I suggest below that another way of looking at matters is that when parsing an expression the presence of an identifier name such as

      X

    /always/ results not in the value but in the address of the named identifier X.
    An address is, of course, how it is interpreted in certain contexts. But programmers find it natural if in other contexts X is implicitly and automatically dereferenced to yield a value. Classically, in the assignment

      X = X

    even though they look the same the last X is dereferenced while the first is not.

    I think you have just re-invented Algol68.

    --
    Charles H. Lindsey ---------At my New Home, still doing my own thing------
    Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk Email: chl@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
    PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Walker@21:1/5 to Charles Lindsey on Mon Nov 8 13:55:10 2021
    On 08/11/2021 11:38, Charles Lindsey wrote:
    On 07/11/2021 23:57, James Harris wrote:
    To explain, since the 1960s it has been traditional to think of
    some identifiers are resolving to lvalues and others to rvalues.

    I think you mean the '70s, or perhaps even the '80s? It
    didn't become in any way "traditional" until well after C became
    popular. Also, I suspect you meant "expression" rather than
    "identifier"?

    [...] Classically, in the assignment
    X = X
    even though they look the same the last X is dereferenced while the
    first is not.
    I think you have just re-invented Algol68.

    Everyone gets there in the end! It's so-o-o much simpler.

    --
    Andy Walker, Nottingham.
    Andy's music pages: www.cuboid.me.uk/andy/Music
    Composer of the day: www.cuboid.me.uk/andy/Music/Composers/Mendelssohn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Mon Nov 8 14:08:46 2021
    On 07/11/2021 23:57, James Harris wrote:
    I'll set out below what to my knowledge is a novel way of looking at
    certain aspects of expression parsing. Don't be alarmed, it doesn't
    parse Martian. In fact, I think (subject to correction) that it
    implements the normal kind of parsing that a programmer would be
    familiar with. But AISI it handles some of it in a simpler, more
    natural, and more understandable way than I've seen anywhere else.


    To explain, since the 1960s it has been traditional to think of some identifiers are resolving to lvalues and others to rvalues. However, I suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

      X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in certain contexts. But programmers find it natural if in other contexts X is implicitly and automatically dereferenced to yield a value. Classically,
    in the assignment

      X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    What matters is semantics but contexts are easiest to discuss in terms
    of the syntax so I'll do that. In simple terms one could say that if an expression (of any sort) is followed by one of

      =   (assignment)
      .   (field selection)
      (   (function invocation)
      [   (array lookup)

    or is tweaked with increment or decrement operators (as in C's ++ and
    --) then the /address/ is used. In all other contexts, however, an
    implicit deference is automatically inserted by a compiler such that the value at the designated address is used instead. To illustrate, consider

    This is pretty much my approach in my static language. However it is a
    little simplistic, and introduces some restrictions.

    For example, it makes it harder to apply "." and "[]" to values that are
    not in memory, since there is no address. This would apply to arrays and structs small enough to be located in registers, passed as value
    parameters, or returned from a function by value. Example:

    f().m
    g()[i]

    With a dynamic language, then it may be different yet again. In mine,
    anything you can apply "." or "[]" to is generally manpulated by
    reference, but can be used as though it was all by value. These will work:

    f().m
    g()[i]

    As can this:

    (a+b).m # (with suitable types and overloads)
    (c+d)[i] # c, d can be strings for example

    Here, "." and "[]" are being applied to an rvalue, something else that
    doesn't have an address.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bart on Mon Nov 8 16:35:02 2021
    On 08/11/2021 11:19, Bart wrote:
    On 08/11/2021 08:52, David Brown wrote:
    On 08/11/2021 00:57, James Harris wrote:

    Consider how C uses its 'address of' operator, & as a prefix.

       &X      gets the address of X
       &X[4]   gets the address of X[4]
       &X.f    gets the address of field f

    Yet C's & is not a normal operator. It does not transform its argument
    I'm not commenting on your main points at the moment - I think it is an
    interesting view, and worth thinking about.

    However, your comment that "C's & is not a normal operator" is somewhat
    bizarre - it implies there is such a thing as a "normal operator".

    Well, you can't implement it with a function! Such as:

     int a;
     int* p = addressof(a);

    Although you can do it if you twist the language around, but in general,
    if 'a' normally means its value, you can't turn a value into the address where it's stored.


    It is correct that you can't implement the & address operator as a
    function in C. But I can't see how that is relevant - you also cannot implement most other C operators as functions. My point is that as C
    operators go, & does not stand out as being unusual.

    Oh, and I'm sure you'll be pleased to hear that in C++, not only can "addressof" be implemented as a function, but it is part of the standard library (so that you can always get the address of an object, even if
    its class has overridden the unary & operator).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Mon Nov 8 18:30:03 2021
    On 08/11/2021 10:36, Bart wrote:
    On 08/11/2021 07:53, James Harris wrote:
    On 08/11/2021 00:25, Bart wrote:

    In X = X, both sides are dereferenced, one for reading, one for writing: >>>
              mov  D0,  [x]
              mov  [x], D0

    Rather than dereferenced do you mean that both sides are /accessed/?

    I should explain what I mean by dereferencing. I mean, effectively,
    following a pointer.

    Following a pointer and then doing what? If you have a chains of derefs
    like this:

     ***p = 0;

    The first two will be read, the last used for writing.

    IOW EAX contains an address and that instruction replaces it with the
    value at that address, aka it follows a pointer, aka it 'dereferences'!

    OK, now I understand. If you have a machine with one register which
    contains a pointer, and read the address at the pointer:

      mov R, [R]

    then R is replaced with the target.

    Yes, that's approximately the model. Your R could, in practice, be the
    value at the top of the evaluation stack - even if the top word of the evaluation stack is kept in a register, if you see what I mean. But,
    yes, what I am calling a dereference would replace TOS with what TOS
    points at.

    Every variable reference such as

    X

    would (in terms of the parse tree) add a node for the address of X. Then
    in certain contexts only it would add a node to replace TOS with what
    TOS points at.

    But that doesn't happen here:

      mov [R], 0           # R is unchanged

    It needn't happen here either:

      mov R2, [R]          # R is unchanged

    For both of those consider a fully generic model of assignment which
    strips out any recognition of particular cases:

    (expression 0) = (expression 1)

    In that, expression 0 can be absolutely anything legal which results in
    an address. Similarly, expression 1, type checking permitting, can be absolutely anything legal which results in the value to be stored at the aforementioned address. In register terms you could have the evaluated
    result of expression 0 in R0 and the evaluated result of expression 1 in
    R1. Then the assignment would be

    mov [R0], R1




    I see 'dereferencing' as something to do with type system.

    Perhaps that's because an explicit dereference does, indeed, always
    convert one type to another, as you point out below. But the important
    point, here, is that a dereference replaces TOS with what TOS points at.


    If P is a pointer, it might have type T*. If you dereference it, the
    value you get has type T. The '*' reference has disappeared! But that
    happens whether reading or writing:

      *Q = *P

    Both P and Q have type T*. During and after the assigning, they will
    still have type T*.

    To implement the assignment, * is used to dereference P's value of type
    T* to get a value X of type T, and * is used to dereference Q's value of
    type T*, to store X of type T.

    The pointers complicate matters a little but don't change anything. Your
    *Q=*P expression as used in C would still insert an automatic
    dereference after P. If @ indicates where that dereference happens then
    the expression would be

    *Q = *P@

    And it's not the assignment operator which inserts the dereference.
    Consider

    *Q = *P@ + *P@

    C would add those two auto dereferences. Why? In simple terms because
    neither P is followed by one of the operators which inhibit auto
    dereference. By contrast, Q is followed by = and so it gets no auto dereference.




       lea eax, [X]    ;the & operator

    But the * operator on the LHS would be suppressed.

    Only because you have haven't shown it. But to write to the address in
    eax to complete the assignment, you have to use [eax].

    You seem to want to distinguish between an address used for reading
    ([eax]), and an address used for writing ([eax]).


    I can only suggest to think of it in terms of R0 and R1, as above, where
    the expression on the LHS is evaluated to produce R0 and the expression
    on the RHS is evaluated to produce R1. R0 has to end up holding an
    address (because it will be used in the assignment operation). By
    contrast, R1 has to end up holding a value (because it could have been
    formed by operators which work on values - e.g. addition - and thus have
    no addressable location).


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Mon Nov 8 19:54:46 2021
    On 08/11/2021 18:30, James Harris wrote:
    On 08/11/2021 10:36, Bart wrote:
    On 08/11/2021 07:53, James Harris wrote:
    On 08/11/2021 00:25, Bart wrote:

    In X = X, both sides are dereferenced, one for reading, one for
    writing:

              mov  D0,  [x]
              mov  [x], D0

    Rather than dereferenced do you mean that both sides are /accessed/?

    I should explain what I mean by dereferencing. I mean, effectively,
    following a pointer.

    Following a pointer and then doing what? If you have a chains of
    derefs like this:

      ***p = 0;

    The first two will be read, the last used for writing.

    IOW EAX contains an address and that instruction replaces it with the
    value at that address, aka it follows a pointer, aka it 'dereferences'!

    OK, now I understand. If you have a machine with one register which
    contains a pointer, and read the address at the pointer:

       mov R, [R]

    then R is replaced with the target.

    Yes, that's approximately the model. Your R could, in practice, be the
    value at the top of the evaluation stack - even if the top word of the evaluation stack is kept in a register, if you see what I mean. But,
    yes, what I am calling a dereference would replace TOS with what TOS
    points at.

    So for you, dereferencing can only ever produce an rvalue.

    Using an analogy of numbered lockers, if you had a card in your hand
    with locker number 37 on it, dereferencing is the process of opening
    door 37, and extracting some artefact.

    But if you had the card in one hand, and already had an artefact in the
    other, what would you call the process of opening door 37, and
    /inserting/ that object?

    To me, acting on that '37' by opening the door to the locker is
    'dereferencing' whether you put something in or take something out.

    Going back to code, take this example:

    *Q += *P

    Now, *Q has to be dereferenced to extract a value, modify it with *P,
    and put it back.

    But that doesn't happen here:

       mov [R], 0           # R is unchanged

    It needn't happen here either:

       mov R2, [R]          # R is unchanged

    For both of those consider a fully generic model of assignment which
    strips out any recognition of particular cases:

      (expression 0) = (expression 1)

    In that, expression 0 can be absolutely anything legal which results in
    an address. Similarly, expression 1, type checking permitting, can be absolutely anything legal which results in the value to be stored at the aforementioned address. In register terms you could have the evaluated
    result of expression 0 in R0 and the evaluated result of expression 1 in
    R1. Then the assignment would be

      mov [R0], R1

    At one time I used to transform my assignments so that:

    A := B

    was processed as:

    (&A)^ := B

    (&A)^ is exactly equivalent to the auto-dereferencing of variables that
    would go on (as lvalue or rvalue), but it was done like that to ensure
    the LHS was actually an lvalue. So trying:

    345 := B

    wouldn't work.



    I see 'dereferencing' as something to do with type system.

    Perhaps that's because an explicit dereference does, indeed, always
    convert one type to another, as you point out below. But the important
    point, here, is that a dereference replaces TOS with what TOS points at.

    It depends on the 'instruction' set of the virtual machine.

    I normally do A := B with:

    push B
    pop A

    I could also do it like this:

    push &B
    pushptr # replace TOS with *TOS - your 'deref'
    pop B

    or doing it both sides:

    push &B
    pushptr
    push &A
    popptr

    In the case of C := A := B, the last popptr would be replaced with:

    storeptr # does not pop the stack
    pop C # or push &C; popptr

    So for me, it's also about the operations. Simple loads and stores use
    PUSH and POP (or STORE), which use immediate operands; anything more
    elaborate uses the more general purpose PUSHPTR and POPPTR (and
    STOREPTR), whose operands are addresses.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Mon Nov 8 20:31:51 2021
    On 08/11/2021 19:54, Bart wrote:
    On 08/11/2021 18:30, James Harris wrote:
    On 08/11/2021 10:36, Bart wrote:

    ...

    OK, now I understand. If you have a machine with one register which
    contains a pointer, and read the address at the pointer:

       mov R, [R]

    then R is replaced with the target.

    Yes, that's approximately the model. Your R could, in practice, be the
    value at the top of the evaluation stack - even if the top word of the
    evaluation stack is kept in a register, if you see what I mean. But,
    yes, what I am calling a dereference would replace TOS with what TOS
    points at.

    So for you, dereferencing can only ever produce an rvalue.

    I think l/r relates to how a value is treated rather than to anything
    intrinsic about it. Dereferencing could produce a pointer, for example!


    Using an analogy of numbered lockers, if you had a card in your hand
    with locker number 37 on it, dereferencing is the process of opening
    door 37, and extracting some artefact.

    No, if I had a card with 37 on it dereferencing would be opening locker
    37 and replacing the card with what's in the locker. In asm

    mov eax, 37 <== get card in hand
    mov eax, [eax] <== dereference

    and there could be as many of the latter as necessary

    mov eax, [eax] <== dereference
    mov eax, [eax] <== dereference
    mov eax, [eax] <== dereference
    mov eax, [eax] <== dereference

    In that sense, the general case is really simple. Every dereference
    would be exactly that one instruction.


    But if you had the card in one hand, and already had an artefact in the other, what would you call the process of opening door 37, and
    /inserting/ that object?

    I would call that /accessing/.


    To me, acting on that '37' by opening the door to the locker is 'dereferencing' whether you put something in or take something out.

    Fine but that's not what I was thinking of when I used the term. See

    https://en.wikipedia.org/wiki/Dereference_operator

    where it speaks about *returning the value at the pointer address*.


    Going back to code, take this example:

       *Q += *P

    Now, *Q has to be dereferenced to extract a value, modify it with *P,
    and put it back.


    ...

    I see 'dereferencing' as something to do with type system.

    Perhaps that's because an explicit dereference does, indeed, always
    convert one type to another, as you point out below. But the important
    point, here, is that a dereference replaces TOS with what TOS points at.

    It depends on the 'instruction' set of the virtual machine.

    I normally do A := B with:

        push B
        pop A

    Fine but that wouldn't work if A and B were arbitrary expressions.


    I could also do it like this:

        push &B
        pushptr      # replace TOS with *TOS - your 'deref'
        pop B

    What would that look like if the assignment were

    A := B + C

    ?


    or doing it both sides:

        push &B
        pushptr
        push &A
        popptr

    That looks close to what I have been talking about but what if the terms
    were expressions such as

    A[2] := B[3] + C[4]

    ?


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bart@21:1/5 to James Harris on Mon Nov 8 21:08:42 2021
    On 08/11/2021 20:31, James Harris wrote:
    On 08/11/2021 19:54, Bart wrote:

    I would call that /accessing/.

    OK. We'll have to disagree on that point. Except, what would you call
    what happens on the LHS here:

    *Q += *P



    To me, acting on that '37' by opening the door to the locker is
    'dereferencing' whether you put something in or take something out.

    Fine but that's not what I was thinking of when I used the term. See

      https://en.wikipedia.org/wiki/Dereference_operator

    where it speaks about *returning the value at the pointer address*.

    That looks a poorly written article.

    Note that it uses examples of "*" and "^" for dereference operators, but
    fails to address the fact those same operators are also used on the LHS
    of an assignment.

    However look on the section on Pascal, where it mentions 'dereference'
    but the only examples are on the LHS of an assignment, notably:

    Complex^ := Complex

    I could also do it like this:

         push &B
         pushptr      # replace TOS with *TOS - your 'deref'
         pop B

    What would that look like if the assignment were

      A := B + C

    If the pushes were done via PUSHPTR, then:

    Stack (grows LTR)

    push &B &B
    pushptr B
    push &C B &C
    pushptr B C
    add B+C
    pop A -

    That looks close to what I have been talking about but what if the terms
    were expressions such as

      A[2] := B[3] + C[4]

    That gets complicated to do by hand. The actual IR I generate for that is:

    push &b
    push 3 i64
    pushptroff i64 8 -8
    push &c
    push 4 i64
    pushptroff i64 8 -8
    add i64
    push &a
    push 2 i64
    popptroff i64 8 -8


    PUSHPTROFF is like PUSHPTR but takes an offset, which can be scaled and
    a further constant offset added (the 8 and -8 shown). It's equivalent to
    this C:

    *((char*)A+2*8-8) = *((char*)B+3*8-8) + *((char*)C+4*8-8)

    The -8 is due to my arrays being 1-based. This reduces down to this x64
    code:

    mov D0, [b+16]
    mov D1, [c+24]
    add D0, D1
    mov [a+8], D0

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Bart on Tue Nov 9 10:42:51 2021
    On 08/11/2021 21:08, Bart wrote:
    On 08/11/2021 20:31, James Harris wrote:
    On 08/11/2021 19:54, Bart wrote:

    I would call that /accessing/.

    OK. We'll have to disagree on that point.

    Sure.

    Except,

    :-)

    what would you call
    what happens on the LHS here:

       *Q += *P

    I don't know. I don't support augmented assignment at the moment but
    thinking it through I guess that in the general case that would resolve to

    R0 = LHS expression
    R1 = RHS expression

    then

    add [R0], R1

    There's no dereference in the assignment itself. But in

    *Q += *P * *P

    I'd say that the two arguments to * would be dereferenced before the multiplication takes place. Therefore, in your example,

    *Q += *P

    while there would be a derefernce after P it would be due to P's context
    (i.e. not being followed by a symbol which suppresses dereferences)
    rather than to the assignment operation.

    ...

    I could also do it like this:

         push &B
         pushptr      # replace TOS with *TOS - your 'deref'
         pop B

    What would that look like if the assignment were

       A := B + C

    If the pushes were done via PUSHPTR, then:

                         Stack (grows LTR)

            push &B      &B
            pushptr      B
            push &C      B &C
            pushptr      B C
            add          B+C
            pop A        -

    OK. Then as it replaces &B with B I'd say your pushptr instructions are dereference operations.

    Note that your code results in the /value/ B+C on the stack, not an
    address.



    That looks close to what I have been talking about but what if the
    terms were expressions such as

       A[2] := B[3] + C[4]

    That gets complicated to do by hand. The actual IR I generate for that is:

        push           &b
        push           3          i64
        pushptroff                i64 8 -8
        push           &c
        push           4          i64
        pushptroff                i64 8 -8
        add                       i64
        push           &a
        push           2          i64
        popptroff                 i64 8 -8


    PUSHPTROFF is like PUSHPTR but takes an offset, which can be scaled and
    a further constant offset added (the 8 and -8 shown). It's equivalent to
    this C:

        *((char*)A+2*8-8) = *((char*)B+3*8-8) + *((char*)C+4*8-8)

    The -8 is due to my arrays being 1-based. This reduces down to this x64
    code:

              mov D0, [b+16]
              mov D1, [c+24]
              add D0, D1
              mov [a+8], D0

    OK. In resolving to [a+8] your optimiser has defeated part of the point
    I wanted to make which was that the LHS can be some arbitrarily complex expression and so would end up being an address stored in a register but
    your example does still show that the RHS gets resolved to D0 holding a
    value rather than an address.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Andy Walker on Sat Feb 12 18:07:32 2022
    On 08/11/2021 13:55, Andy Walker wrote:
    On 08/11/2021 11:38, Charles Lindsey wrote:
    On 07/11/2021 23:57, James Harris wrote:
    To explain, since the 1960s it has been traditional to think of
    some identifiers are resolving to lvalues and others to rvalues.

        I think you mean the '70s, or perhaps even the '80s?  It
    didn't become in any way "traditional" until well after C became
    popular.  Also, I suspect you meant "expression" rather than
    "identifier"?

    I meant 'identifier' in the context of the post but you are right that
    this is meant to apply to subexpressions. I've just broached the wider
    subject in my reply to Charles (qv).

    If I have gone down a path the designers of Algol went down then I am in
    good company, albeit a long way behind them.

    As also mentioned to Charles, ISTM such flexibility needs to be tamed
    and made safe to use.



    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Charles Lindsey on Sat Feb 12 18:02:42 2022
    On 08/11/2021 11:38, Charles Lindsey wrote:
    On 07/11/2021 23:57, James Harris wrote:
    I'll set out below what to my knowledge is a novel way of looking at
    certain aspects of expression parsing. Don't be alarmed, it doesn't
    parse Martian. In fact, I think (subject to correction) that it
    implements the normal kind of parsing that a programmer would be
    familiar with. But AISI it handles some of it in a simpler, more
    natural, and more understandable way than I've seen anywhere else.


    To explain, since the 1960s it has been traditional to think of some
    identifiers are resolving to lvalues and others to rvalues. However, I
    suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

       X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in
    certain contexts. But programmers find it natural if in other contexts
    X is implicitly and automatically dereferenced to yield a value.
    Classically, in the assignment

       X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    I think you have just re-invented Algol68.


    It's funny you should say that. I do think there's a similarity to
    Algol68 in returns from functions which I hadn't even mentioned but will
    do so now.

    Consider a subexpression such as

    A + B

    I suggested before that both A and B should initially be taken
    semantically as being addresses and that in the context in which they
    appear both would be dereferenced because, in simple terms, they are not followed by one of the symbols which inhibit dereferences:

    . member selection
    ( function call
    [ array indexing
    = assignment
    * do nothing (other than inhibit dereference)

    Now consider

    A[1] + B(0)

    To be consistent, the subexpression A[1] would also yield the /address/
    of the element rather than its value. Then assignment to an element
    would happen naturally:

    A[1] = A[2]

    Because of the = sign the A[1] would not be dereferenced. By contrast,
    because there's no inhibiting symbol the A[2] /would/ be dereferenced -
    exactly as for simple variables and as expected in most familiar
    programming languages.

    Now, here's the point I wanted to add: To be even more consistent the
    same would be true of B(0). It, too, would result in the address of the
    return value rather than the value itself. AIUI that is what Algol68
    also does but it could lead to some strange expressions such as

    B(0) = 5

    To 'tame' that I am thinking that a function would have to explicitly
    mark any return as being writeable if it could be assigned to by the
    caller. Any return which was not thus marked would only be treatable as
    a value. I think that offers the best of both worlds: consistent
    treatment but with the default option being safe.


    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Charles Lindsey@21:1/5 to James Harris on Sun Feb 13 15:26:07 2022
    On 12/02/2022 18:02, James Harris wrote:
    On 08/11/2021 11:38, Charles Lindsey wrote:

    I think you have just re-invented Algol68.


    It's funny you should say that. I do think there's a similarity to Algol68 in returns from functions which I hadn't even mentioned but will do so now.

    Consider a subexpression such as

      A + B

    I suggested before that both A and B should initially be taken semantically as
    being addresses and that in the context in which they appear both would be dereferenced because, in simple terms, they are not followed by one of the symbols which inhibit dereferences:

      . member selection
      ( function call
      [ array indexing
      = assignment
      * do nothing (other than inhibit dereference)

    Now consider

      A[1] + B(0)

    To be consistent, the subexpression A[1] would also yield the /address/ of the
    element rather than its value. Then assignment to an element would happen naturally:

      A[1] = A[2]

    Because of the = sign the A[1] would not be dereferenced. By contrast, because
    there's no inhibiting symbol the A[2] /would/ be dereferenced - exactly as for
    simple variables and as expected in most familiar programming languages.

    Now, here's the point I wanted to add: To be even more consistent the same would
    be true of B(0). It, too, would result in the address of the return value rather
    than the value itself. AIUI that is what Algol68 also does but it could lead to
    some strange expressions such as

      B(0) = 5

    To 'tame' that I am thinking that a function would have to explicitly mark any
    return as being writeable if it could be assigned to by the caller. Any return
    which was not thus marked would only be treatable as a value. I think that offers the best of both worlds: consistent treatment but with the default option
    being safe.

    Yes, Algol 68 takes care of all that. The LHS of an assignment MUST be a reference (otherwise you are trying to assign to a constant). That means you know, at compile time, the exact type expected on the RHS (hence it is a "strong" context), so you can use any known coercion to make it so (usually dereferencing as in your examples). In the case of an operator in an expression,
    the context of each operand is "weak", so fewer coercions are permitted; the reason for this is that operators can be overloaded (there is no overloadiing of
    functions in Algol 68).

    C gets to more or less the same result by mumbling about LHS and RHS values, which is harder to get your head around.

    When it comes to arrays (and structures too), if the type of A is reference-to-row-of-something, then the type of A[0] is reference-to-something, so you can assign to it, dereferencing the RHS if necessary. But if the type of A is just row-of-something, then it is a constant array, and A[0] is a constant something.

    --
    Charles H. Lindsey ---------At my New Home, still doing my own thing------
    Tel: +44 161 488 1845 Web: https://www.clerew.man.ac.uk Email: chl@clerew.man.ac.uk Snail-mail: Apt 40, SK8 5BF, U.K.
    PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alexei A. Frounze@21:1/5 to James Harris on Sun Feb 13 13:49:23 2022
    On Sunday, November 7, 2021 at 3:57:40 PM UTC-8, James Harris wrote:
    [Joining late, haven't read all of the conversation.]

    I'll set out below what to my knowledge is a novel way of looking at
    certain aspects of expression parsing. Don't be alarmed, it doesn't
    parse Martian. In fact, I think (subject to correction) that it
    implements the normal kind of parsing that a programmer would be
    familiar with. But AISI it handles some of it in a simpler, more
    natural, and more understandable way than I've seen anywhere else.

    Not sure about simpler/more natural. There may be "some" regularization,
    true.

    To explain, since the 1960s it has been traditional to think of some identifiers are resolving to lvalues and others to rvalues.

    C enums are never lvalues by design.
    C arrays are somewhat an artificial construct and so can be viewed
    kind of as both or neither (you can't directly assign to an entire array
    with = unless you're assigning to a struct that contains an array,
    you can't pass an array by value, unless it's again wrapped in a
    struct, but an array still contains some other lvalues in the end).

    However, I
    suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

    X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in certain contexts. But programmers find it natural if in other contexts X is implicitly and automatically dereferenced to yield a value. Classically,
    in the assignment

    X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    Um... It may be somewhat confusing because dereferences for the
    purpose of reading from memory and dereferences for the purpose of
    writing to memory appear somewhat different.

    C's assign operators require their left operand to be an lvalue.
    What is an lvalue (in, perhaps, a somewhat mechanistic view)?
    It's an expression formed by dereferencing an address. And you
    naturally need a memory address to both read and write memory.

    But your = operator by itself screams in your face "I'm a memory
    writing dereference!". Effectively, you may think that the lvalue's
    own dereference and the one implied by the = operator are
    duplicating one another or are two parts of one thing. Either way,
    when you're writing a C compiler, once you've checked the types
    in the assignment expression, you end up either eliminating the
    lvalue's own dereference or you somehow fuse it with =
    because in the end you generate just a single memory store
    instruction that represents both the dereference and =.

    Given this you may indeed think that the left operand of =
    needs to be no more than an address and it's somehow
    different from the right operand of =. But you need to consider
    both, the dereference and =, together.

    What matters is semantics but contexts are easiest to discuss in terms
    of the syntax so I'll do that. In simple terms one could say that if an expression (of any sort) is followed by one of

    = (assignment)
    . (field selection)
    ( (function invocation)
    [ (array lookup)

    or is tweaked with increment or decrement operators (as in C's ++ and
    --) then the /address/ is used. In all other contexts, however, an
    implicit deference is automatically inserted by a compiler such that the value at the designated address is used instead. To illustrate, consider

    A[2][4]

    Note that after both A *and* the first closing square bracket there is
    no dereference.

    Strictly speaking, there is and its result is, as usual, an element of the array, which happens to be another array, which luckily needs no memory read/write (yet) and only pointer arithmetic is needed here.
    But with a further dereference you will have to access memory because
    that array element is not an array anymore.

    But in a different language every array element access (dereference/
    subscript) may involve memory access. Java's multidimensional
    arrays implemented that way: an element of one array is a pointer to
    (or an address of) another array. And you need to fetch addresses of
    subarrays, you can't simply compute them by adding an index to
    the pointer to the enclosing array.

    In syntax terms one can consider that that's because
    each is followed by one of the aforementioned symbols. IOW both A and
    the first closing square bracket are followed by an opening square
    bracket so there is no deference. But there /is/ an automatic
    dereference after the final square bracket because it is not followed by
    one of the listed symbols. So the key as to whether an automatic
    dereference is inserted or not is what comes next after an expression.

    That's very flexible, allowing expressions to work with an arbitrary
    number of addresses. For example,

    B = A[2][4][6][8][10]

    etc. That expression uses addresses all the way through. Each array
    lookup results in yet another address. Only after the final square
    bracket would there be a dereference.

    Doesn't have to be that way. Java is an example.

    Of course, it's not just array indexing. Anything which /produces/ an
    address can have its output fed into anything which /uses/ an address
    and such operators can be combined arbitrarily. For example,

    vectors[1](2).data[3] = y

    Such an expression may be horrendous but illustrates how a programmer
    could combine addresses in any way desired. Only after the y would there
    be a dereference.

    It's probably important to note that the above expression in C produces
    a temporary value (the function return value) that needs to hang around
    for a while in order for .data to be accessed off it. Mechanically
    it needs to be an lvalue, but it's short lived and messing with it is
    therefore troublesome, hence the standard says modifying the return
    value yields undefined behavior. That is, if that data member is a
    pointer, the expression may be well formed. If data is an array, you
    have UB right there where you attempt to modify its 3rd element.

    (Perhaps it's strange that as programmers we accept the inconsistency
    that some contexts get implicit dereferences and some don't. But we
    would probably not want to write all deref or no-deref points in code.
    So we are where we are.)

    Definitely, you don't have to expose the underlying mechanics when
    it creates unnecessary friction (e.g. in form of verbosity and mental
    effort). But with enough shortcuts you may end up looking at a
    collection of nonuniform things. C arrays have their own problems,
    C string literals add to this, then again you don't have to have both .
    and -> to access members of a structure. And then there's C++ with
    a mess of different ways to construct and initialize objects using
    different syntaxes.
    I particularly like Pascal's approach to passing variables by reference:
    just prepend "var" before the parameter and the additional associated dereferences will be generated by the compiler.

    Importantly, it is always possible to dereference an address to get a
    value but there is no way to operate on a value to get its address.

    When implementing a C compiler you may treat most (if not all)
    expressions as trees with operators in non-leaf nodes and
    integer/float numbers and addresses in leaf nodes.
    That's all there is, pretty much.
    Structures, arrays, complex numbers don't map onto the CPU registers
    and don't make it to the backend level.
    So, your "cond ? struct1 : struct2" transform into
    "*(cond ? &struct1 : &struct2)" under the hood just like
    "struct1 = struct2" transforms into "memcpy(&struct1, &struct2)"
    or something similar that can be more readily be translated into
    CPU instructions and mapped into its registers.

    For
    that reason my precedence table has all the address-consuming operators first. That's probably true of most other languages as well but I've not
    seen that set out as a rationale.

    Consider how C uses its 'address of' operator, & as a prefix.

    &X gets the address of X
    &X[4] gets the address of X[4]
    &X.f gets the address of field f

    Yet C's & is not a normal operator. It does not transform its argument.
    As stated, it is not possible to get from a value to an address. So &E
    cannot evaluate E and then take its address. Therefore & is not an
    operator in the normal sense that it manipulates a value. Instead, &E inhibits the automatic dereference that would have been inserted at the
    end of E: it prevents emission of the dereference that the compiler
    would otherwise have emitted.

    There is, perhaps, an additional oddity that an 'operator' at the
    beginning of a subexpression really applies at the end of that
    subexpression.

    Well, if it helps to read the code, you could use parens, e.g. &(X[4]),
    but they are meaningless here. You could also prohibit large and complex expressions and require them to be broken down into shorter and
    simpler ones with e.g. temporary variables at every step, but
    that (temporaries and low code density) in itself is problematic.
    If you read X[4] into a temporary, taking its (temporary's) address wouldn't give you the address within X[], which is kinda bad.
    I think postfix expressions (and I mean not just postfix ++ and -- but
    all of this subscripting, calling, member accessing) in C are more useful
    than not.

    It may be more straightforward for & to be placed at the location where
    the dereference would otherwise have been.

    Assuming for discussion purposes that trailing & and infix & can be distinguished (so we don't need to use another symbol) the above
    expressions would become

    X& the address of X
    X[4]& the address of X[4]
    X.f& the address of field f

    Should we also use numeric negation this way, e.g. X[4]- in place of
    -X[4]? That would look pretty awkward to mathy people (not that
    they'd find it an insurmountable obstacle, I hope).
    I think what we've got in C here is good enough.

    Then the unary trailing & joins the symbols in the list above and
    becomes just another of the operators which, when it appears after an expression, inhibits the automatic dereference that would otherwise have occurred at that point:

    = assign
    . field selection
    ( function call
    [ index
    & nothing except, like all the others, inhibit dereference

    To summarise, there would no longer be the conceptual difference between lvalues and rvalues. All identifiers would be considered as producing
    their addresses, never their values. There would instead be contexts in
    which automatic dereference takes place, and the programmer would put &
    in any of those places where the automatic dereference was to be inhibited.

    Then you also need to distinguish pointer arithmetic from non-pointer arithmetic
    if you still want to keep both.
    If a-b now gives me the distance between a and b in memory instead of the numeric difference of the values stored at addresses a and b, it's kinda bad. Similarly, I don't always mean a pointer when I write a+1.
    No?

    AFAIK that's a new way of looking at addresses in expressions but maybe
    you know otherwise.

    More importantly, as a programmer how easy would you find it to think in those terms?

    I'd keep implicit pointers hidden. Seems like you want to expose them for no good reason.

    Alex

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Harris@21:1/5 to Alexei A. Frounze on Mon Feb 14 18:38:49 2022
    On 13/02/2022 21:49, Alexei A. Frounze wrote:
    On Sunday, November 7, 2021 at 3:57:40 PM UTC-8, James Harris wrote:

    [Joining late, haven't read all of the conversation.]

    No problem. Welcome!

    ...

    However, I
    suggest below that another way of looking at matters is that when
    parsing an expression the presence of an identifier name such as

    X

    /always/ results not in the value but in the address of the named
    identifier X. An address is, of course, how it is interpreted in certain
    contexts. But programmers find it natural if in other contexts X is
    implicitly and automatically dereferenced to yield a value. Classically,
    in the assignment

    X = X

    even though they look the same the last X is dereferenced while the
    first is not.

    Um... It may be somewhat confusing because dereferences for the
    purpose of reading from memory and dereferences for the purpose of
    writing to memory appear somewhat different.

    Bart said similar but by 'dereference' I mean essentially what C's '*'
    prefix operator does.

    As humans we understand assignment so perhaps we focus on that specific
    case but consider that /in general/ an assignment requires two
    expressions. The first has to result in an address (or reference, if you prefer) but the second expression naturally results in a value. For
    example, in

    X = X + 1

    the LHS has to result in an address whereas the RHS naturally results in
    a value rather than an address.


    C's assign operators require their left operand to be an lvalue.
    What is an lvalue (in, perhaps, a somewhat mechanistic view)?
    It's an expression formed by dereferencing an address. And you
    naturally need a memory address to both read and write memory.

    But your = operator by itself screams in your face "I'm a memory
    writing dereference!". Effectively, you may think that the lvalue's
    own dereference and the one implied by the = operator are
    duplicating one another or are two parts of one thing. Either way,
    when you're writing a C compiler, once you've checked the types
    in the assignment expression, you end up either eliminating the
    lvalue's own dereference or you somehow fuse it with =
    because in the end you generate just a single memory store
    instruction that represents both the dereference and =.

    I should be clearer about terms:

    * reference: an address (or the equivalent)
    * dereference: fetch the value at the reference

    Such a dereference is /a monadic operation/ which takes what it assumes
    to be an address and yields the value at that address.

    As mentioned, it's akin to C's monadic asterisk operator (except that
    it's how code is processed; it's not present in source code).


    Given this you may indeed think that the left operand of =
    needs to be no more than an address and it's somehow
    different from the right operand of =. But you need to consider
    both, the dereference and =, together.

    In the example of

    X = X + 1

    note that both X's would initially be addresses but the second would be dereferenced (as defined above) because it is not followed by one of the operators which inhibit dereferences.

    ...

    In all other contexts, however, an
    implicit deference is automatically inserted by a compiler such that the
    value at the designated address is used instead. To illustrate, consider

    A[2][4]

    Note that after both A *and* the first closing square bracket there is
    no dereference.

    Strictly speaking, there is and its result is, as usual, an element of the array, which happens to be another array, which luckily needs no memory read/write (yet) and only pointer arithmetic is needed here.

    Ah, no. I am proposing that

    A[2][4] = A[2][4] + 1

    would parse in exactly the same way as X = X + 1, above. The inner subexpression

    A[2][4]

    appears twice just as X appeared twice and the latter instance would be dereferenced just as the latter X was dereferenced because it is not
    followed by an operator which inhibits dereferences.

    Neither A nor A[2] would be dereferenced (as defined above) at any
    point. The expressions A, A[2] and A[2][4] would manipulate only addresses.


    But with a further dereference you will have to access memory because
    that array element is not an array anymore.

    Yes, a dereference changes the type from 'ref T' to T.

    ...

    Of course, it's not just array indexing. Anything which /produces/ an
    address can have its output fed into anything which /uses/ an address
    and such operators can be combined arbitrarily. For example,

    vectors[1](2).data[3] = y

    Such an expression may be horrendous but illustrates how a programmer
    could combine addresses in any way desired. Only after the y would there
    be a dereference.

    It's probably important to note that the above expression in C produces
    a temporary value (the function return value) that needs to hang around
    for a while in order for .data to be accessed off it. Mechanically
    it needs to be an lvalue, but it's short lived and messing with it is therefore troublesome, hence the standard says modifying the return
    value yields undefined behavior. That is, if that data member is a
    pointer, the expression may be well formed. If data is an array, you
    have UB right there where you attempt to modify its 3rd element.

    That sounds important but I can't parse it. If vectors[1] holds the
    address of a function and that function returns an address why is a
    temporary needed?

    Here's how the expression may be parsed:

    get the /address/ of the 'vectors' array
    because it's followed by "[" don't dereference it
    add 1 * sizeof a vector
    call the function at that address (with 2 as a parameter)
    add the offset of the field called 'data'
    add 3 * sizeof each element of data
    use that address in the assignment

    Each stage produces an address, even the function call.


    (Perhaps it's strange that as programmers we accept the inconsistency
    that some contexts get implicit dereferences and some don't. But we
    would probably not want to write all deref or no-deref points in code.
    So we are where we are.)

    Definitely, you don't have to expose the underlying mechanics when
    it creates unnecessary friction (e.g. in form of verbosity and mental effort).

    Though it's inconsistent. For example,

    X[Y]

    Even though the two names have the same form we dereference one (Y) but
    not the other (X).

    I'm not complaining about that, BTW, just pointing out that that's what
    we as programmers have got used to so a compiler has to deal with it.

    ...

    Consider how C uses its 'address of' operator, & as a prefix.

    &X gets the address of X
    &X[4] gets the address of X[4]
    &X.f gets the address of field f

    ...

    There is, perhaps, an additional oddity that an 'operator' at the
    beginning of a subexpression really applies at the end of that
    subexpression.

    Well, if it helps to read the code, you could use parens, e.g. &(X[4]),
    but they are meaningless here. You could also prohibit large and complex expressions and require them to be broken down into shorter and
    simpler ones with e.g. temporary variables at every step, but
    that (temporaries and low code density) in itself is problematic.
    If you read X[4] into a temporary, taking its (temporary's) address wouldn't give you the address within X[], which is kinda bad.
    I think postfix expressions (and I mean not just postfix ++ and -- but
    all of this subscripting, calling, member accessing) in C are more useful than not.

    I don't think there's a problem. ISTM that

    X&

    is a good way to yield X's address.


    It may be more straightforward for & to be placed at the location where
    the dereference would otherwise have been.

    Assuming for discussion purposes that trailing & and infix & can be
    distinguished (so we don't need to use another symbol) the above
    expressions would become

    X& the address of X
    X[4]& the address of X[4]
    X.f& the address of field f

    Should we also use numeric negation this way, e.g. X[4]- in place of
    -X[4]? That would look pretty awkward to mathy people (not that
    they'd find it an insurmountable obstacle, I hope).

    If you want consistently all unaries to be at the front you'd end up
    with something like

    -x
    &x
    [4]x
    ()x

    :-(

    In reality programmers expect some operators to be prefix and some to be postfix. I didn't invent this!

    ...

    To summarise, there would no longer be the conceptual difference between
    lvalues and rvalues. All identifiers would be considered as producing
    their addresses, never their values. There would instead be contexts in
    which automatic dereference takes place, and the programmer would put &
    in any of those places where the automatic dereference was to be inhibited.

    Then you also need to distinguish pointer arithmetic from non-pointer arithmetic
    if you still want to keep both.
    If a-b now gives me the distance between a and b in memory instead of the numeric difference of the values stored at addresses a and b, it's kinda bad.

    That's not the proposal. An expression such as a - b would use the
    /values/ of both.

    Similarly, I don't always mean a pointer when I write a+1.
    No?

    Ditto. That would parse to 'the /value/ of a' plus 1.

    Maybe you've not understood the proposal. Not blaming you. I should
    reiterate it. A name such as N in these contexts:

    N=
    N(
    N[
    N.
    N$

    would use the address of N. In all other contexts it would result in the
    value stored at N.

    There's nothing novel in that, BTW. It's what programmers expect
    expressions to mean. If anything, this is just a cerrtain way to look at expressions to understand them.

    A useful way to look at this is that /all/ occurrences of N result in
    the address of N and a dereference, and that the above forms will
    require the presence of a dereference operation and will delete that dereference operation. That is, in fact, how I intend to parse it.



    --
    James Harris

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)