• Practice of code generation for elemental procedures

    From Bastiaan Braams@21:1/5 to All on Tue Dec 14 03:48:46 2021
    I am wondering how compiler authors deal with possible code bloat in connection with elemental procedures. If such a procedure has k intent(in) arguments then any subset of those might be arrays in an invocation of the procedure (all arrays conformable),
    which looks like 2^k possibilities right there.
    Here is the reason for my interest. Think of a procedure for some special function of mathematical physics taking arguments (n, x); an integer and a real number. The integer controls some recurrence and for fixed argument (n) the operations involving (x)
    are all identical. If the procedure is going to be called with scalar argument (n) and array argument (x) then there is a lot of efficiency to be gained by putting the operations involving x in an inner loop. On the other hand, if the argument (n) should
    be an array then the loop might as well be put on the outside. Any chance that the compiler will produce such properly optimized code?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Bastiaan Braams on Tue Dec 14 17:23:16 2021
    Bastiaan Braams <bjbraams@gmail.com> schrieb:

    I am wondering how compiler authors deal with possible code
    bloat in connection with elemental procedures.

    By implementing the procedure as scalar and looping.

    An example: you can look over gfortran's shoulders, so to speak,
    by specifying -fdump-tree-original.

    An example:

    module x
    implicit none
    contains
    elemental function ele (a, b, c) result(res)
    real, intent(in) :: a, b, c
    real :: res
    res = a + 2*b + 4*c
    end function ele
    end module x

    program main
    use x
    implicit none
    real :: a(3), b(3), c
    call random_number(a)
    call random_number(b)
    call random_number(c)
    print *,ele(a,b,c)
    end program main

    results in (part of the dump):

    _gfortran_st_write (&dt_parm.2);
    {
    real(kind=4) D.3909;

    D.3909 = c;
    {
    integer(kind=8) S.3;

    S.3 = 1;
    while (1)
    {
    if (S.3 > 3) goto L.1;
    {
    real(kind=4) D.3911;

    D.3911 = ele (&a[S.3 + -1], &b[S.3 + -1], &D.3909);
    _gfortran_transfer_real_write (&dt_parm.2, &D.3911, 4);
    }
    S.3 = S.3 + 1;
    }
    L.1:;
    }
    }
    _gfortran_st_write_done (&dt_parm.2);

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Thomas Koenig on Tue Dec 14 21:17:53 2021
    On Tuesday, December 14, 2021 at 9:23:18 AM UTC-8, Thomas Koenig wrote:
    Bastiaan Braams <bjbr...@gmail.com> schrieb:
    I am wondering how compiler authors deal with possible code
    bloat in connection with elemental procedures.

    By implementing the procedure as scalar and looping.

    In some cases, it would be an optimization to pass an array and
    have it processed inside the routine.

    I do remember about 1973, when the IBM OS/360 compilers
    changed to doing implied-DO loops from outside to inside
    the I/O routines. The new library was backwards compatible,
    but old libraries were not forward compatible.

    In the case of compilers doing global optimization,
    I suppose it is possible to do it in the routine. Only code
    for the actual used combination would be generated.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastiaan Braams@21:1/5 to All on Tue Dec 14 23:21:27 2021
    On Wednesday, December 15, 2021 at 6:17:55 AM UTC+1, gah4 wrote:
    On Tuesday, December 14, 2021 at 9:23:18 AM UTC-8, Thomas Koenig wrote:
    Bastiaan Braams <bjbr...@gmail.com> schrieb:
    I am wondering how compiler authors deal with possible code
    bloat in connection with elemental procedures.

    By implementing the procedure as scalar and looping.
    In some cases, it would be an optimization to pass an array and
    have it processed inside the routine.

    I do remember about 1973, when the IBM OS/360 compilers
    changed to doing implied-DO loops from outside to inside
    the I/O routines. The new library was backwards compatible,
    but old libraries were not forward compatible.

    In the case of compilers doing global optimization,
    I suppose it is possible to do it in the routine. Only code
    for the actual used combination would be generated.
    Thank you Thomas and Glen, this is very informative. So I think that the following code structure is the proper way to deal with the situation mentioned in my original post. Function Foo(n,x) can be written as an elemental function, but this will not
    lead to efficient code for the case that argument (n) is scalar and argument (x) is vector. (Let's say that this case is of special interest.) So then we still write Foo as an elemental function, but we also supply a specific function for the special
    case and compose the two versions into a generic interface.

    module M
    implicit none ; private
    interface Foo
    module procedure Foo_Scalar
    module procedure Foo_Vector
    end interface Foo
    contains
    elemental function Foo_Scalar (n, x) result (r)
    integer, intent (in) :: n
    real, intent (in) :: x
    real :: r
    ! Return Foo[n](x). Elemental function.
    ...
    end function Foo_Scalar
    pure function Foo_Vector (n, x) result (r)
    integer, intent (in) :: n
    real, intent (in) :: x(0:)
    real :: r(0:size(x)-1)
    ! Return Foo[n](x) for scalar argument n and rank-1 array argument x.
    ...
    end function Foo_Vector
    end module M

    Additional versions can be written and included in the generic interface for higher-rank array arguments x.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to bjbr...@gmail.com on Wed Dec 15 00:11:48 2021
    On Tuesday, December 14, 2021 at 11:21:29 PM UTC-8, bjbr...@gmail.com wrote:

    (snip)

    Thank you Thomas and Glen, this is very informative.
    So I think that the following code structure is the proper way to deal
    with the situation mentioned in my original post. Function Foo(n,x)
    can be written as an elemental function, but this will not lead to
    efficient code for the case that argument (n) is scalar and argument (x)
    is vector. (Let's say that this case is of special interest.)
    So then we still write Foo as an elemental function, but we also supply a specific function for the special case and compose the two versions
    into a generic interface.

    I am sometimes surprised how much work, over the years, has been done
    to make function calls faster, even as computers got faster, so there was
    less need for the speed.

    Assuming your function does a reasonable amount of computation,
    the function call overhead should be small enough for most of us.

    The rules for GENERIC are funny, and I was going to guess that what
    you asked wouldn't work. That is, it would be an ambiguity, and so
    not compile.

    But I put the part you wrote, removing the ... lines, into gfortran,
    and it seems to take it.

    So, yes, that is how you do it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to gah4@u.washington.edu on Wed Dec 15 21:24:11 2021
    gah4 <gah4@u.washington.edu> schrieb:
    On Tuesday, December 14, 2021 at 11:21:29 PM UTC-8, bjbr...@gmail.com wrote:

    (snip)

    Thank you Thomas and Glen, this is very informative.
    So I think that the following code structure is the proper way to deal
    with the situation mentioned in my original post. Function Foo(n,x)
    can be written as an elemental function, but this will not lead to
    efficient code for the case that argument (n) is scalar and argument (x)
    is vector. (Let's say that this case is of special interest.)
    So then we still write Foo as an elemental function, but we also supply a
    specific function for the special case and compose the two versions
    into a generic interface.

    I am sometimes surprised how much work, over the years, has been done
    to make function calls faster, even as computers got faster, so there was less need for the speed.

    In languages like C++, almost everything is a function call
    to some virtual function. Reducing that overhead can go a
    long way towards making things faster.

    Also, if anything, bloat has been increasing faster than the
    increase in processor speed in the last few years...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dick@21:1/5 to Thomas Koenig on Wed Dec 15 19:12:24 2021
    On 12/15/21 4:24 PM, Thomas Koenig wrote:

    In languages like C++, almost everything is a function call
    to some virtual function.
    Not necessarily, it depends on your programming style.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastiaan Braams@21:1/5 to Thomas Koenig on Wed Dec 15 22:53:25 2021
    On Wednesday, December 15, 2021 at 10:24:14 PM UTC+1, Thomas Koenig wrote:
    gah4 <ga...@u.washington.edu> schrieb:
    On Tuesday, December 14, 2021 at 11:21:29 PM UTC-8, bjbr...@gmail.com wrote:

    (snip)

    Thank you Thomas and Glen, this is very informative.
    So I think that the following code structure is the proper way to deal
    with the situation mentioned in my original post. Function Foo(n,x)
    can be written as an elemental function, but this will not lead to
    efficient code for the case that argument (n) is scalar and argument (x) >> is vector. (Let's say that this case is of special interest.)
    So then we still write Foo as an elemental function, but we also supply a >> specific function for the special case and compose the two versions
    into a generic interface.

    I am sometimes surprised how much work, over the years, has been done
    to make function calls faster, even as computers got faster, so there was less need for the speed.
    In languages like C++, almost everything is a function call
    to some virtual function. Reducing that overhead can go a
    long way towards making things faster.

    Also, if anything, bloat has been increasing faster than the
    increase in processor speed in the last few years...
    Just to be sure... I was not concerned about the overhead of the function call. Once again the structure of that function Foo, here written as a scalar elemental function. This is the three-term recurrence for Laguerre polynomial r = L[n](x).
    if (n==0) then
    r = 1
    else if (n==1) then
    r = 1-x
    else
    t1 = 1
    t2 = 1-x
    do k = 2, n
    t0 = t1 ; t1 = t2
    t2 = ((2*k-1-x)*t1-(k-1)*t0)/k
    end do
    r = t2
    end if
    Suppose now that this function is called with scalar argument (n) and array argument (x). Then I would think that we really want a loop for each use of the variable x or of the temporaries t0, t1, t2; we don't want one single loop over the
    entire code, with the test on (n) inside the loop. And if Foo is declared elemental and there is not a special version for the case of scalar (n) and array (x), then I expect the compiler to do just that, place the loop outside the if-then-else.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bastiaan Braams@21:1/5 to Thomas Koenig on Wed Dec 15 23:06:05 2021
    On Wednesday, December 15, 2021 at 10:24:14 PM UTC+1, Thomas Koenig wrote:
    gah4 <ga...@u.washington.edu> schrieb:
    On Tuesday, December 14, 2021 at 11:21:29 PM UTC-8, bjbr...@gmail.com wrote:

    (snip)

    Thank you Thomas and Glen, this is very informative.
    So I think that the following code structure is the proper way to deal
    with the situation mentioned in my original post. Function Foo(n,x)
    can be written as an elemental function, but this will not lead to
    efficient code for the case that argument (n) is scalar and argument (x) >> is vector. (Let's say that this case is of special interest.)
    So then we still write Foo as an elemental function, but we also supply a >> specific function for the special case and compose the two versions
    into a generic interface.

    I am sometimes surprised how much work, over the years, has been done
    to make function calls faster, even as computers got faster, so there was less need for the speed.
    In languages like C++, almost everything is a function call
    to some virtual function. Reducing that overhead can go a
    long way towards making things faster.

    Also, if anything, bloat has been increasing faster than the
    increase in processor speed in the last few years...

    Just to be sure... I was not concerned about the overhead of the function call. Once again the structure of that function Foo, here written as a scalar elemental function. This is the three-term recurrence for Laguerre polynomial r = L[n](x).

    if (n==0) then
    r = 1
    else if (n==1) then
    r = 1-x
    else
    t1 = 1
    t2 = 1-x
    do k = 2, n
    t0 = t1 ; t1 = t2
    t2 = ((2*k-1-x)*t1-(k-1)*t0)/k
    end do
    r = t2
    end if

    Suppose now that this function is called with scalar argument (n) and array argument (x). Then I would think that we really want a loop within each branch of the if-then-else construct and maybe within the (do k = 2, n) as well; we don't want one single
    loop over the entire code, with the test on (n) inside the loop. And if Foo is declared elemental and there is not a special version for the case of scalar (n) and array (x), then I expect the compiler to do just that, place the loop outside the if-then-
    else.

    However, to add to my confusion I just did a timing test. One instance where Foo is just an elemental function, no further overloading, and another version where the elemental function is overloaded with a function for arguments scalar (n) and array (x).
    Makes no difference in the timing when using gfortran; I tried it with -O0 and with -O3.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to bjbr...@gmail.com on Wed Dec 15 23:46:41 2021
    On Wednesday, December 15, 2021 at 11:06:07 PM UTC-8, bjbr...@gmail.com wrote:

    (snip, I wrote)

    I am sometimes surprised how much work, over the years, has been done
    to make function calls faster, even as computers got faster, so there was less need for the speed.

    (snip)

    Just to be sure... I was not concerned about the overhead of the function call.
    Once again the structure of that function Foo, here written as a scalar elemental function. This is the three-term recurrence for
    Laguerre polynomial r = L[n](x).

    Oh, that one. Many years ago (Fortran 66 days) I was working on a program written by someone else that way too slow. It computed a bunch of
    integrals with Gaussian quadrature, the first time I knew about that.

    Now, Gaussian quadrature is pretty fast, but it was doing a lot of them.
    As well as remember now, there was a subroutine that would compute
    two of them, then divide, with the second one being a normalization.

    In any case, for reasons similar to what you say, it was recomputing
    the same thing many times. I think to fix it, I separately computed the integrals, moved some things out of a loop, and so eliminated much
    duplication.

    Suppose now that this function is called with scalar argument (n) and
    array argument (x). Then I would think that we really want a loop within
    each branch of the if-then-else construct and maybe within the (do k = 2, n) as well; we don't want one single loop over the entire code, with the test
    on (n) inside the loop. And if Foo is declared elemental and there is not
    a special version for the case of scalar (n) and array (x), then I expect the compiler to do just that, place the loop outside the if-then-else.

    However, to add to my confusion I just did a timing test.
    One instance where Foo is just an elemental function, no further
    overloading, and another version where the elemental function is
    overloaded with a function for arguments scalar (n) and array (x).
    Makes no difference in the timing when using gfortran;
    I tried it with -O0 and with -O3.

    I think that sounds right. The statements outside the inner loop
    are fairly fast, and no so many, so doing them mutliple times
    make a small difference. Branch prediction will mean almost
    no delay to doing the if, that you might have expected.

    If there was a large part of the calculation depending on n,
    but not on x, then the difference would be larger.

    I think your idea is good, but doesn't apply to this example.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Dick on Wed Dec 15 23:19:11 2021
    On Wednesday, December 15, 2021 at 4:12:28 PM UTC-8, Dick wrote:
    On 12/15/21 4:24 PM, Thomas Koenig wrote:

    In languages like C++, almost everything is a function call
    to some virtual function.

    Not necessarily, it depends on your programming style.

    C++ or not, modern programming style more often uses smaller
    functions and more function calls.

    Fortran has a long history of especially large subroutines and
    functions. I used to wonder about this related to IBM S/360
    and S/370 code, where instructions have a 12 bit offset, and
    one normally has a base register to index from. That works well
    for program units up to 4K bytes. For slightly bigger programs,
    compiler will generate two or three base registers. Past that,
    the might do indirect addressing for each, loading an address
    from memory each time.

    It is not so obvious what other systems do with large routines,
    but in general optimizing compilers do better with ones that aren't
    so big.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Van Buskirk@21:1/5 to All on Sat Dec 18 21:08:59 2021
    "gah4" wrote in message news:0ab6410a-af0e-4f5d-a32d-520912de99e3n@googlegroups.com...

    On Wednesday, December 15, 2021 at 11:06:07 PM UTC-8, bjbr...@gmail.com wrote:

    (snip, I wrote)

    Just to be sure... I was not concerned about the overhead of the function call.
    Once again the structure of that function Foo, here written as a scalar elemental function. This is the three-term recurrence for
    Laguerre polynomial r = L[n](x).

    Oh, that one. Many years ago (Fortran 66 days) I was working on a program written by someone else that way too slow. It computed a bunch of
    integrals with Gaussian quadrature, the first time I knew about that.

    Now, Gaussian quadrature is pretty fast, but it was doing a lot of them.
    As well as remember now, there was a subroutine that would compute
    two of them, then divide, with the second one being a normalization.

    In any case, for reasons similar to what you say, it was recomputing
    the same thing many times. I think to fix it, I separately computed the integrals, moved some things out of a loop, and so eliminated much duplication.

    Here you know that the O.P. is not doing Gauss-Laguerre quadrature because
    his function returns only L[n](x) and not also L[n-1](x). I'm not sure what application needs so many values of Laguerre polynomials: one I can think
    of is hydrogenlike atoms, but they need generalized Laguerre polynomials
    and there is no extra parameter in his code either.

    Suppose now that this function is called with scalar argument (n) and
    array argument (x). Then I would think that we really want a loop within each branch of the if-then-else construct and maybe within the (do k =
    2, n)
    as well; we don't want one single loop over the entire code, with the
    test
    on (n) inside the loop. And if Foo is declared elemental and there is
    not
    a special version for the case of scalar (n) and array (x), then I
    expect the
    compiler to do just that, place the loop outside the if-then-else.

    However, to add to my confusion I just did a timing test.
    One instance where Foo is just an elemental function, no further overloading, and another version where the elemental function is
    overloaded with a function for arguments scalar (n) and array (x).
    Makes no difference in the timing when using gfortran;
    I tried it with -O0 and with -O3.

    I think that sounds right. The statements outside the inner loop
    are fairly fast, and no so many, so doing them mutliple times
    make a small difference. Branch prediction will mean almost
    no delay to doing the if, that you might have expected.

    If there was a large part of the calculation depending on n,
    but not on x, then the difference would be larger.

    I think your idea is good, but doesn't apply to this example.

    Well, I looked at some of my code:

    subroutine laguerre(n,alpha,x,y,y1,nsturm)
    integer, intent(in) :: n
    real(wp), intent(in) :: alpha
    real(wp), intent(in) :: x
    real(wp), intent(out) :: y
    real(wp), intent(out) :: y1
    integer, optional :: nsturm
    integer k
    real(wp) y2

    y = 1
    y1 = 0
    if(present(nsturm)) nsturm = 0
    do k = 0, n-1
    y2 = y1
    y1 = y
    y = ((2*k+alpha+1-x)*y1-k*y2)/(k+alpha+1)
    if(present(nsturm)) nsturm = merge(nsturm+1,nsturm, y/=0 .AND. y*y1 <=
    0)
    end do
    end subroutine laguerre

    On comparison I see that there are early-out tests in the O.P.'s code which
    are unnecessary. Of course for his usage there is no need for the Sturm sequence value nor the parameter alpha which makes it generalized rather
    than simple Laguerre. Also he doesn't have to return L[n-1](x) because, as pointed out earlier, he can't be doing Gaussian quadrature.

    But there is a big potential advantage in the array call vs. the elemental call: an x86 processor can perform 4 wide multiply-accumulates on double precision data via two pipelines. Have you checked whether gfortran produces 4-wide AVX code for the array version? Those divisions take a long time. It might be useful to create an array of reciprocals on function entry and then reuse it in the main loop.

    But in any case, don't try to guess what the compiler is doing here but look
    at the assembly code it generates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Van Buskirk@21:1/5 to All on Sat Dec 18 21:56:37 2021
    "gah4" wrote in message news:0ab6410a-af0e-4f5d-a32d-520912de99e3n@googlegroups.com...

    [snip]

    It seems that my reply has not appeared, so here it is again, perhaps as a duplicate:

    "gah4" wrote in message news:0ab6410a-af0e-4f5d-a32d-520912de99e3n@googlegroups.com...

    On Wednesday, December 15, 2021 at 11:06:07 PM UTC-8, bjbr...@gmail.com wrote:

    (snip, I wrote)

    Just to be sure... I was not concerned about the overhead of the function call.
    Once again the structure of that function Foo, here written as a scalar elemental function. This is the three-term recurrence for
    Laguerre polynomial r = L[n](x).

    Oh, that one. Many years ago (Fortran 66 days) I was working on a program written by someone else that way too slow. It computed a bunch of
    integrals with Gaussian quadrature, the first time I knew about that.

    Now, Gaussian quadrature is pretty fast, but it was doing a lot of them.
    As well as remember now, there was a subroutine that would compute
    two of them, then divide, with the second one being a normalization.

    In any case, for reasons similar to what you say, it was recomputing
    the same thing many times. I think to fix it, I separately computed the integrals, moved some things out of a loop, and so eliminated much duplication.

    Here you know that the O.P. is not doing Gauss-Laguerre quadrature because
    his function returns only L[n](x) and not also L[n-1](x). I'm not sure what application needs so many values of Laguerre polynomials: one I can think
    of is hydrogenlike atoms, but they need generalized Laguerre polynomials
    and there is no extra parameter in his code either.

    Suppose now that this function is called with scalar argument (n) and
    array argument (x). Then I would think that we really want a loop within each branch of the if-then-else construct and maybe within the (do k =
    2, n)
    as well; we don't want one single loop over the entire code, with the
    test
    on (n) inside the loop. And if Foo is declared elemental and there is
    not
    a special version for the case of scalar (n) and array (x), then I
    expect the
    compiler to do just that, place the loop outside the if-then-else.

    However, to add to my confusion I just did a timing test.
    One instance where Foo is just an elemental function, no further overloading, and another version where the elemental function is
    overloaded with a function for arguments scalar (n) and array (x).
    Makes no difference in the timing when using gfortran;
    I tried it with -O0 and with -O3.

    I think that sounds right. The statements outside the inner loop
    are fairly fast, and no so many, so doing them mutliple times
    make a small difference. Branch prediction will mean almost
    no delay to doing the if, that you might have expected.

    If there was a large part of the calculation depending on n,
    but not on x, then the difference would be larger.

    I think your idea is good, but doesn't apply to this example.

    Well, I looked at some of my code:

    subroutine laguerre(n,alpha,x,y,y1,nsturm)
    integer, intent(in) :: n
    real(wp), intent(in) :: alpha
    real(wp), intent(in) :: x
    real(wp), intent(out) :: y
    real(wp), intent(out) :: y1
    integer, optional :: nsturm
    integer k
    real(wp) y2

    y = 1
    y1 = 0
    if(present(nsturm)) nsturm = 0
    do k = 0, n-1
    y2 = y1
    y1 = y
    y = ((2*k+alpha+1-x)*y1-k*y2)/(k+alpha+1)
    if(present(nsturm)) nsturm = merge(nsturm+1,nsturm, y/=0 .AND. y*y1 <=
    0)
    end do
    end subroutine laguerre

    On comparison I see that there are early-out tests in the O.P.'s code which
    are unnecessary. Of course for his usage there is no need for the Sturm sequence value nor the parameter alpha which makes it generalized rather
    than simple Laguerre. Also he doesn't have to return L[n-1](x) because, as pointed out earlier, he can't be doing Gaussian quadrature.

    But there is a big potential advantage in the array call vs. the elemental call: an x86 processor can perform 4 wide multiply-accumulates on double precision data via two pipelines. Have you checked whether gfortran produces 4-wide AVX code for the array version? Those divisions take a long time. It might be useful to create an array of reciprocals on function entry and then reuse it in the main loop.

    But in any case, don't try to guess what the compiler is doing here but look
    at the assembly code it generates.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)