• spread vs do

    From evan@21:1/5 to All on Tue Jun 14 02:09:16 2022
    Hi all. I was reading the discussion on
    Why is the Fortran intrinsic function "spread" often slower than explicit iteration
    https://stackoverflow.com/questions/55717904/why-is-the-fortran-intrinsic-function-spread-often-slower-than-explicit-iterat
    and wanted to try it myself. I modified the code provided by Steve Lionel in the answer (see the end of this message) and got the following timings (compiled using gfortran -O3, but similar results with -O0, version 9.4.0)
    Iteration 1 4.42133093
    Spread 1 0.309983253
    Iteration 3 0.212992191
    Spread 3 0.917150021
    So it seems that the answer to the question should you use spread or do is it depends on the dimension that you are iterating over. Is there a consistent optimal code for performing these type of calculations?

    -------------------------------------------------------------------
    module benchmarks
    implicit none
    integer, parameter :: n=500
    integer :: j
    real :: d2(n,n)
    real :: d3(n,n,n)
    contains
    ! Iteration 1
    subroutine benchmark_i1(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    do j = 1, n
    res(j,:,:) = d2*d3(j,:,:)
    end do
    end subroutine
    ! Spread 1
    subroutine benchmark_s1(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    res = d3*spread(d2, 1, n)
    end subroutine
    !Iteration 3
    subroutine benchmark_i3(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    do j = 1, n
    res(:,:,j) = d2*d3(:,:,j)
    end do
    end subroutine
    ! Spread 3
    subroutine benchmark_s3(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    res = d3*spread(d2, 3, n)
    end subroutine
    end module

    program main
    use benchmarks
    real :: tstart,tend
    real :: res(n,n,n)
    call random_number(d2)
    call random_number(d3)
    ! Iteration
    call cpu_time(tstart)
    call benchmark_i1(res,n)
    call cpu_time(tend)
    write(*,*) 'Iteration 1', tend-tstart, sum(res)
    ! Spread
    call cpu_time(tstart)
    call benchmark_s1(res,n)
    call cpu_time(tend)
    write(*,*) 'Spread 1', tend-tstart, sum(res)
    ! Iteration
    call cpu_time(tstart)
    call benchmark_i3(res,n)
    call cpu_time(tend)
    write(*,*) 'Iteration 3', tend-tstart, sum(res)
    ! Spread
    call cpu_time(tstart)
    call benchmark_s3(res,n)
    call cpu_time(tend)
    write(*,*) 'Spread 3', tend-tstart, sum(res)
    end program

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ron Shepard@21:1/5 to evan on Tue Jun 14 10:17:43 2022
    On 6/14/22 4:09 AM, evan wrote:
    Hi all. I was reading the discussion on
    Why is the Fortran intrinsic function "spread" often slower than explicit iteration
    https://stackoverflow.com/questions/55717904/why-is-the-fortran-intrinsic-function-spread-often-slower-than-explicit-iterat
    and wanted to try it myself. I modified the code provided by Steve Lionel in the answer (see the end of this message) and got the following timings (compiled using gfortran -O3, but similar results with -O0, version 9.4.0)
    Iteration 1 4.42133093
    Spread 1 0.309983253
    Iteration 3 0.212992191
    Spread 3 0.917150021
    So it seems that the answer to the question should you use spread or do is it depends on the dimension that you are iterating over. Is there a consistent optimal code for performing these type of calculations?

    Whoever told you that do loops were to be avoided at all costs?

    Many times, the do loops are simpler, for both the compiler and a human,
    to understand. In these cases, the spread() code is literally written so
    that memory allocation is required for an intermediate quantity. The
    compiler can then sometimes realize that the allocation is not
    necessary, and then decide if there are other reasons, such as
    sequential memory accesses, where it would be good to do it that way
    anyway. This is probably one of those cases. Sometimes the compiler is
    getting it right, other times not. It can also depend on the array
    dimensions. For small arrays, it might not matter if you step through an
    array the wrong way, but for big arrays it could be critical.

    This of course will depend on the compiler and which optimizations are
    enabled at the time. It may also change when you update to newer
    versions of the same compiler. There have been a dozen gfortran updates
    since 9.4.0.

    If this is a critical part of your runtime, then this is something that
    you should check periodically. You might include both algorithms in your
    source code, and switch back and forth between them when your testing
    shows that the optimal approach has changed since the last time you
    tested it. This is one of the reasons why a preprocessor that supports conditional compilation is useful in the language.

    $.02 -Ron Shepard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From evan@21:1/5 to Ron Shepard on Tue Jun 14 09:57:44 2022
    On Tuesday, 14 June 2022 at 16:17:49 UTC+1, Ron Shepard wrote:
    Whoever told you that do loops were to be avoided at all costs?

    No one told me that. I am looking for a consistent way to program such calculations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Ron Shepard on Tue Jun 14 09:29:35 2022
    On Tuesday, June 14, 2022 at 8:17:49 AM UTC-7, Ron Shepard wrote:

    (snip)

    Whoever told you that do loops were to be avoided at all costs?

    It seems to be a rumor that has been going on for years.

    Well, at least back to when array expressions, and especially some
    of the more interestnig array intrinsic functions, were added to the
    standard, there is interest in using them.

    Well, even longer ago than that, some would initialize arrays:

    DATA X/1000000*1.234/

    which on many systems writes one million values into the object program,
    and then loads them from disk to run the program, much slower than a
    DO loop, or after it was added, array assignment:

    X = 1.234

    After array expressions were added, compilers were pretty good at
    doing simple assignment at least as well as the DO loop, and maybe
    faster.

    But some took it as a challenge to figure out how to write a complicated
    array expression, to do what could be done in a simple DO loop.

    Now, say you want to find out if an array has any negative elements:

    DO I=1,1000000
    IF(X(I).LT.0) GOTO 10
    END DO

    or

    IF(ANY(X<0)) GOTO 10

    The array expression looks nice, and maybe compilers now will
    figure it out. But for many years the compiler would generate a
    temporary array of LOGICAL values 1000000 element long,
    and then check to see if ANY were .TRUE.. The nice thing about
    the DO loop is that it exits the loop as soon as the first one is found.

    OK, maybe some compilers now will figure that one out. How about:

    IF(ANY(SIN(X)) < 0) GOTO 10

    I suspect you can always make a complicated array expression that
    a compiler won't figure out has an easy loop with an exit test.

    I don't remember now any of the complicated array expressions
    I have seen over the years, but there are complicated ones!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to evan on Tue Jun 14 11:03:29 2022
    On Tuesday, June 14, 2022 at 9:57:47 AM UTC-7, evan wrote:
    On Tuesday, 14 June 2022 at 16:17:49 UTC+1, Ron Shepard wrote:
    Whoever told you that do loops were to be avoided at all costs?

    No one told me that. I am looking for a consistent way to program such calculations.

    There is an old saying, credited to D. Knuth:

    "premature optimization is the root of all evil"

    Unless it makes a significant difference in actual time, as in more time
    than it takes to discuss, write in the way that is most readable.

    As array expressions get more complicated, they get less readable in
    a different way than DO loops. It isn't always so obvious, then, but you
    can decide for yourself, which one you find more readable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ron Shepard@21:1/5 to All on Tue Jun 14 19:32:40 2022
    On 6/14/22 1:03 PM, gah4 wrote:
    [...]
    There is an old saying, credited to D. Knuth:

    "premature optimization is the root of all evil"

    Unless it makes a significant difference in actual time, as in more time
    than it takes to discuss, write in the way that is most readable.

    Yes, clear code is a goal, but there is also algorithm choice. I have
    always interpreted Knuth's comment in that way. If you start with the
    wrong algorithm, say one that scales as N**2 rather than N, and then you
    spend a lot of time trying to optimize register usage and memory access
    and so on for that wrong algorithm, then you are reluctant to throw away
    all of that work and implement the right algorithm.

    In poker, there is a similar expression, "Don't throw good money after
    bad." The idea is that you get invested in a bad hand, then you don't
    want to stop, you want to keep betting on it.

    $.02 -Ron Shepard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From pehache@21:1/5 to All on Thu Jun 16 13:15:42 2022
    Le 14/06/2022 à 11:09, evan a écrit :
    Hi all. I was reading the discussion on
    Why is the Fortran intrinsic function "spread" often slower than explicit iteration

    https://stackoverflow.com/questions/55717904/why-is-the-fortran-intrinsic-function-spread-often-slower-than-explicit-iterat
    and wanted to try it myself. I modified the code provided by Steve Lionel in the
    answer (see the end of this message) and got the following timings (compiled using
    gfortran -O3, but similar results with -O0, version 9.4.0)
    Iteration 1 4.42133093
    Spread 1 0.309983253
    Iteration 3 0.212992191
    Spread 3 0.917150021
    So it seems that the answer to the question should you use spread or do is it depends on the dimension that you are iterating over. Is there a consistent optimal code for performing these type of calculations?

    -------------------------------------------------------------------
    module benchmarks
    implicit none
    integer, parameter :: n=500
    integer :: j
    real :: d2(n,n)
    real :: d3(n,n,n)
    contains
    ! Iteration 1
    subroutine benchmark_i1(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    do j = 1, n
    res(j,:,:) = d2*d3(j,:,:)
    end do
    end subroutine
    ! Spread 1
    subroutine benchmark_s1(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    res = d3*spread(d2, 1, n)
    end subroutine
    !Iteration 3
    subroutine benchmark_i3(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    do j = 1, n
    res(:,:,j) = d2*d3(:,:,j)
    end do
    end subroutine
    ! Spread 3
    subroutine benchmark_s3(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    res = d3*spread(d2, 3, n)
    end subroutine
    end module

    program main
    use benchmarks
    real :: tstart,tend
    real :: res(n,n,n)
    call random_number(d2)
    call random_number(d3)
    ! Iteration
    call cpu_time(tstart)
    call benchmark_i1(res,n)
    call cpu_time(tend)
    write(*,*) 'Iteration 1', tend-tstart, sum(res)
    ! Spread
    call cpu_time(tstart)
    call benchmark_s1(res,n)
    call cpu_time(tend)
    write(*,*) 'Spread 1', tend-tstart, sum(res)
    ! Iteration
    call cpu_time(tstart)
    call benchmark_i3(res,n)
    call cpu_time(tend)
    write(*,*) 'Iteration 3', tend-tstart, sum(res)
    ! Spread
    call cpu_time(tstart)
    call benchmark_s3(res,n)
    call cpu_time(tend)
    write(*,*) 'Spread 3', tend-tstart, sum(res)
    end program

    Your i1 case is the worst one for a good reason : in each pass of the loop
    you are working on res(j,:,:) and d3(j,:,:), which is are non-contiguous
    array slices across arrays of 125 10^6 elements, that is 600 MB. This is
    highly inefficient in terms of memory access.

    If you want to express the very same calculations with loops, the most efficient way is:

    subroutine benchmark_i1(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    do i3 = 1, n
    do i2 = 1, n
    do i1 = 1, n
    res(i1,i2,i3) = d2(i2,i3)*d3(i1,i2,i3)
    end do
    end do
    end do
    end subroutine

    You may replace the inner loop with an array syntax that any not-too-dumb compiler can optimize without needing a temporary array:

    subroutine benchmark_i1(res,n)
    integer n
    real, intent(out) :: res(n,n,n)
    do i3 = 1, n
    do i2 = 1, n
    res(:,i2,i3) = d2(i2,i3)*d3(:,i2,i3)
    end do
    end do
    end subroutine

    In the s1 and s3 cases the code will spend a bit of time to allocate and populate a large temporary array as a result of spread(), but once done
    the compiler knows how to order the hidden loops in the most efficient
    way.

    So generally speaking the most important thing is to know of to order the loops.

    Regarding do loops versus spread() :

    - array syntaxes can be more or less readable than the equivalent do
    loops, depending on the cases. But I tend to find spread() less readable
    and generally prefer do loops. But it's also a matter of personal
    preferences.

    - apart maybe for some trivial cases, spread() will always be less
    efficient than the equivalent do loops, because most of time the compiler
    has to allocate and populate a temporary array (and in addition this can
    result in memory issues if this is a large array)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to pehache on Thu Jun 16 07:55:12 2022
    On Thursday, June 16, 2022 at 6:15:46 AM UTC-7, pehache wrote:

    (snip)
    - array syntaxes can be more or less readable than the equivalent do
    loops, depending on the cases. But I tend to find spread() less readable
    and generally prefer do loops. But it's also a matter of personal preferences.

    - apart maybe for some trivial cases, spread() will always be less
    efficient than the equivalent do loops, because most of time the compiler
    has to allocate and populate a temporary array (and in addition this can result in memory issues if this is a large array)

    Yes, this is the big problem with complicated array expressions.

    I suspect compilers are getting better, but many will use a temporary
    array, even though a person could easily see how to avoid one.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)