From evan@21:1/5 to All on Tue Jun 14 02:09:16 2022
Hi all. I was reading the discussion on
Why is the Fortran intrinsic function "spread" often slower than explicit iteration
and wanted to try it myself. I modified the code provided by Steve Lionel in the answer (see the end of this message) and got the following timings (compiled using gfortran -O3, but similar results with -O0, version 9.4.0)
Iteration 1 4.42133093
Iteration 3 0.212992191
So it seems that the answer to the question should you use spread or do is it depends on the dimension that you are iterating over. Is there a consistent optimal code for performing these type of calculations?

-------------------------------------------------------------------
module benchmarks
implicit none
integer, parameter :: n=500
integer :: j
real :: d2(n,n)
real :: d3(n,n,n)
contains
! Iteration 1
subroutine benchmark_i1(res,n)
integer n
real, intent(out) :: res(n,n,n)
do j = 1, n
res(j,:,:) = d2*d3(j,:,:)
end do
end subroutine
subroutine benchmark_s1(res,n)
integer n
real, intent(out) :: res(n,n,n)
end subroutine
!Iteration 3
subroutine benchmark_i3(res,n)
integer n
real, intent(out) :: res(n,n,n)
do j = 1, n
res(:,:,j) = d2*d3(:,:,j)
end do
end subroutine
subroutine benchmark_s3(res,n)
integer n
real, intent(out) :: res(n,n,n)
end subroutine
end module

program main
use benchmarks
real :: tstart,tend
real :: res(n,n,n)
call random_number(d2)
call random_number(d3)
! Iteration
call cpu_time(tstart)
call benchmark_i1(res,n)
call cpu_time(tend)
write(*,*) 'Iteration 1', tend-tstart, sum(res)
call cpu_time(tstart)
call benchmark_s1(res,n)
call cpu_time(tend)
! Iteration
call cpu_time(tstart)
call benchmark_i3(res,n)
call cpu_time(tend)
write(*,*) 'Iteration 3', tend-tstart, sum(res)
call cpu_time(tstart)
call benchmark_s3(res,n)
call cpu_time(tend)
end program

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Ron Shepard@21:1/5 to evan on Tue Jun 14 10:17:43 2022
On 6/14/22 4:09 AM, evan wrote:
Hi all. I was reading the discussion on
Why is the Fortran intrinsic function "spread" often slower than explicit iteration
and wanted to try it myself. I modified the code provided by Steve Lionel in the answer (see the end of this message) and got the following timings (compiled using gfortran -O3, but similar results with -O0, version 9.4.0)
Iteration 1 4.42133093
Iteration 3 0.212992191
So it seems that the answer to the question should you use spread or do is it depends on the dimension that you are iterating over. Is there a consistent optimal code for performing these type of calculations?

Whoever told you that do loops were to be avoided at all costs?

Many times, the do loops are simpler, for both the compiler and a human,
to understand. In these cases, the spread() code is literally written so
that memory allocation is required for an intermediate quantity. The
compiler can then sometimes realize that the allocation is not
necessary, and then decide if there are other reasons, such as
sequential memory accesses, where it would be good to do it that way
anyway. This is probably one of those cases. Sometimes the compiler is
getting it right, other times not. It can also depend on the array
dimensions. For small arrays, it might not matter if you step through an
array the wrong way, but for big arrays it could be critical.

This of course will depend on the compiler and which optimizations are
enabled at the time. It may also change when you update to newer
versions of the same compiler. There have been a dozen gfortran updates
since 9.4.0.

If this is a critical part of your runtime, then this is something that
you should check periodically. You might include both algorithms in your
source code, and switch back and forth between them when your testing
shows that the optimal approach has changed since the last time you
tested it. This is one of the reasons why a preprocessor that supports conditional compilation is useful in the language.

\$.02 -Ron Shepard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From evan@21:1/5 to Ron Shepard on Tue Jun 14 09:57:44 2022
On Tuesday, 14 June 2022 at 16:17:49 UTC+1, Ron Shepard wrote:
Whoever told you that do loops were to be avoided at all costs?

No one told me that. I am looking for a consistent way to program such calculations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From gah4@21:1/5 to Ron Shepard on Tue Jun 14 09:29:35 2022
On Tuesday, June 14, 2022 at 8:17:49 AM UTC-7, Ron Shepard wrote:

(snip)

Whoever told you that do loops were to be avoided at all costs?

It seems to be a rumor that has been going on for years.

Well, at least back to when array expressions, and especially some
of the more interestnig array intrinsic functions, were added to the
standard, there is interest in using them.

Well, even longer ago than that, some would initialize arrays:

DATA X/1000000*1.234/

which on many systems writes one million values into the object program,
and then loads them from disk to run the program, much slower than a
DO loop, or after it was added, array assignment:

X = 1.234

After array expressions were added, compilers were pretty good at
doing simple assignment at least as well as the DO loop, and maybe
faster.

But some took it as a challenge to figure out how to write a complicated
array expression, to do what could be done in a simple DO loop.

Now, say you want to find out if an array has any negative elements:

DO I=1,1000000
IF(X(I).LT.0) GOTO 10
END DO

or

IF(ANY(X<0)) GOTO 10

The array expression looks nice, and maybe compilers now will
figure it out. But for many years the compiler would generate a
temporary array of LOGICAL values 1000000 element long,
and then check to see if ANY were .TRUE.. The nice thing about
the DO loop is that it exits the loop as soon as the first one is found.

OK, maybe some compilers now will figure that one out. How about:

IF(ANY(SIN(X)) < 0) GOTO 10

I suspect you can always make a complicated array expression that
a compiler won't figure out has an easy loop with an exit test.

I don't remember now any of the complicated array expressions
I have seen over the years, but there are complicated ones!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From gah4@21:1/5 to evan on Tue Jun 14 11:03:29 2022
On Tuesday, June 14, 2022 at 9:57:47 AM UTC-7, evan wrote:
On Tuesday, 14 June 2022 at 16:17:49 UTC+1, Ron Shepard wrote:
Whoever told you that do loops were to be avoided at all costs?

No one told me that. I am looking for a consistent way to program such calculations.

There is an old saying, credited to D. Knuth:

"premature optimization is the root of all evil"

Unless it makes a significant difference in actual time, as in more time
than it takes to discuss, write in the way that is most readable.

As array expressions get more complicated, they get less readable in
a different way than DO loops. It isn't always so obvious, then, but you
can decide for yourself, which one you find more readable.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From Ron Shepard@21:1/5 to All on Tue Jun 14 19:32:40 2022
On 6/14/22 1:03 PM, gah4 wrote:
[...]
There is an old saying, credited to D. Knuth:

"premature optimization is the root of all evil"

Unless it makes a significant difference in actual time, as in more time
than it takes to discuss, write in the way that is most readable.

Yes, clear code is a goal, but there is also algorithm choice. I have
always interpreted Knuth's comment in that way. If you start with the
wrong algorithm, say one that scales as N**2 rather than N, and then you
spend a lot of time trying to optimize register usage and memory access
and so on for that wrong algorithm, then you are reluctant to throw away
all of that work and implement the right algorithm.

In poker, there is a similar expression, "Don't throw good money after
bad." The idea is that you get invested in a bad hand, then you don't
want to stop, you want to keep betting on it.

\$.02 -Ron Shepard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From pehache@21:1/5 to All on Thu Jun 16 13:15:42 2022
Le 14/06/2022 à 11:09, evan a écrit :
Hi all. I was reading the discussion on
Why is the Fortran intrinsic function "spread" often slower than explicit iteration

and wanted to try it myself. I modified the code provided by Steve Lionel in the
answer (see the end of this message) and got the following timings (compiled using
gfortran -O3, but similar results with -O0, version 9.4.0)
Iteration 1 4.42133093
Iteration 3 0.212992191
So it seems that the answer to the question should you use spread or do is it depends on the dimension that you are iterating over. Is there a consistent optimal code for performing these type of calculations?

-------------------------------------------------------------------
module benchmarks
implicit none
integer, parameter :: n=500
integer :: j
real :: d2(n,n)
real :: d3(n,n,n)
contains
! Iteration 1
subroutine benchmark_i1(res,n)
integer n
real, intent(out) :: res(n,n,n)
do j = 1, n
res(j,:,:) = d2*d3(j,:,:)
end do
end subroutine
subroutine benchmark_s1(res,n)
integer n
real, intent(out) :: res(n,n,n)
end subroutine
!Iteration 3
subroutine benchmark_i3(res,n)
integer n
real, intent(out) :: res(n,n,n)
do j = 1, n
res(:,:,j) = d2*d3(:,:,j)
end do
end subroutine
subroutine benchmark_s3(res,n)
integer n
real, intent(out) :: res(n,n,n)
end subroutine
end module

program main
use benchmarks
real :: tstart,tend
real :: res(n,n,n)
call random_number(d2)
call random_number(d3)
! Iteration
call cpu_time(tstart)
call benchmark_i1(res,n)
call cpu_time(tend)
write(*,*) 'Iteration 1', tend-tstart, sum(res)
call cpu_time(tstart)
call benchmark_s1(res,n)
call cpu_time(tend)
! Iteration
call cpu_time(tstart)
call benchmark_i3(res,n)
call cpu_time(tend)
write(*,*) 'Iteration 3', tend-tstart, sum(res)
call cpu_time(tstart)
call benchmark_s3(res,n)
call cpu_time(tend)
end program

Your i1 case is the worst one for a good reason : in each pass of the loop
you are working on res(j,:,:) and d3(j,:,:), which is are non-contiguous
array slices across arrays of 125 10^6 elements, that is 600 MB. This is
highly inefficient in terms of memory access.

If you want to express the very same calculations with loops, the most efficient way is:

subroutine benchmark_i1(res,n)
integer n
real, intent(out) :: res(n,n,n)
do i3 = 1, n
do i2 = 1, n
do i1 = 1, n
res(i1,i2,i3) = d2(i2,i3)*d3(i1,i2,i3)
end do
end do
end do
end subroutine

You may replace the inner loop with an array syntax that any not-too-dumb compiler can optimize without needing a temporary array:

subroutine benchmark_i1(res,n)
integer n
real, intent(out) :: res(n,n,n)
do i3 = 1, n
do i2 = 1, n
res(:,i2,i3) = d2(i2,i3)*d3(:,i2,i3)
end do
end do
end subroutine

In the s1 and s3 cases the code will spend a bit of time to allocate and populate a large temporary array as a result of spread(), but once done
the compiler knows how to order the hidden loops in the most efficient
way.

So generally speaking the most important thing is to know of to order the loops.

Regarding do loops versus spread() :

- array syntaxes can be more or less readable than the equivalent do
loops, depending on the cases. But I tend to find spread() less readable
and generally prefer do loops. But it's also a matter of personal
preferences.

- apart maybe for some trivial cases, spread() will always be less
efficient than the equivalent do loops, because most of time the compiler
has to allocate and populate a temporary array (and in addition this can
result in memory issues if this is a large array)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
• From gah4@21:1/5 to pehache on Thu Jun 16 07:55:12 2022
On Thursday, June 16, 2022 at 6:15:46 AM UTC-7, pehache wrote:

(snip)
- array syntaxes can be more or less readable than the equivalent do
loops, depending on the cases. But I tend to find spread() less readable
and generally prefer do loops. But it's also a matter of personal preferences.

- apart maybe for some trivial cases, spread() will always be less
efficient than the equivalent do loops, because most of time the compiler
has to allocate and populate a temporary array (and in addition this can result in memory issues if this is a large array)

Yes, this is the big problem with complicated array expressions.

I suspect compilers are getting better, but many will use a temporary
array, even though a person could easily see how to avoid one.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)