• Speed of writing formatted matrices of floats in Fortran and C++

    From Beliavsky@21:1/5 to All on Fri May 6 05:03:42 2022
    I am finding that writing 100x100 matrices of floats is about 25% faster with gfortran than g++ using printf on Windows and 4 times faster with gfortran than g++ on WSL2. I wonder what people see on Linux and if the C++ performance can be improved (or if
    the comparison is invalid for some reason). The codes and scripts are at https://github.com/Beliavsky/Formatted_output_speed . Maybe Fortran benefits because a whole row of a matrix can be written with

    write (iu,"(*(1x,f0.6)") x(i,:)

    instead of looping over each element.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Ram@21:1/5 to Beliavsky on Fri May 6 14:05:22 2022
    Beliavsky <beliavsky@aol.com> writes:
    I am finding that writing 100x100 matrices of floats is about
    25% faster with gfortran than g++ using printf on Windows

    Console output can be very slow. If one is writing to a
    console, one should make sure to use the same console system
    for all languages tested. Writing to a file on disk often
    will be faster.

    I/O usually is deemed to be slow, so I think that the
    overhead of additional instructions for a loop in C++
    should be negligible compared to time for memory
    accesses, serialization of float values, and I/O.

    In addition to times for input and output, times for
    accesses to memory outside of cache memory also play a role.
    These could be increased if a matrix is stored distributed
    over memory areas that are distant from each other, in
    contrast to a matrix that is stored in a single block.
    Regular patterns of memory access are faster than
    irregular patterns.

    This might depend on how matrices are implemented in your
    C++ program.

    An L1 cache reference might take 0.5 ns, an L2 cache
    reference 7 ns, while a main memory reference might take 100
    ns. When optimizing, it helps to be aware of the orders of
    magnitude.

    CPUS HAVE A HIERARCHICAL CACHE SYSTEM
    From a 2014 talk by Chandler Carruth

    One cycle on a 3 GHz processor 1 ns
    L1 cache reference 0.5 ns
    Branch mispredict 5 ns
    L2 cache reference 7 ns 14x L1 cache
    Mutex lock/unlock 25 ns
    Main memory reference 100 ns 20xL2, 200xL1 Compress 1K bytes with Snappy 3,000 ns
    Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
    Read 4K randomly from SSD 150,000 ns 0.15 ms
    Read 1 MB sequentially from memory 250,000 ns 0.25 ms
    Round trip within same datacenter 500,000 ns 0.5 ms
    Read 1 MB sequentially From SSD 1,000,000 ns 1 ms 4x memory
    Disk seek 10,000,000 ns 10 ms 20xdatacen. RT
    Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80xmem.,20xSSD
    Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ron Shepard@21:1/5 to Beliavsky on Fri May 6 11:29:06 2022
    On 5/6/22 7:03 AM, Beliavsky wrote:
    I am finding that writing 100x100 matrices of floats is about 25% faster with gfortran than g++ using printf on Windows and 4 times faster with gfortran than g++ on WSL2. I wonder what people see on Linux and if the C++ performance can be improved (or
    if the comparison is invalid for some reason). The codes and scripts are at https://github.com/Beliavsky/Formatted_output_speed . Maybe Fortran benefits because a whole row of a matrix can be written with

    write (iu,"(*(1x,f0.6)") x(i,:)

    instead of looping over each element.

    In addition to treating the i/o list as a sequence of scalars or as a
    vector, there is also the question, in both languages, of how often the
    format string is parsed. With FORMAT statements, compilers would
    typically parse the format strings at compile time, so no run time
    overhead occurred for that. Then when f77 allowed format strings with
    character variables, there was sometimes significant differences in i/o
    costs between the two approaches. This was because the format string was
    parsed anew for each execution of the write statement. Then over time,
    this was optimized by compilers. First, the literal strings and
    character parameters were singled out and parsed at compile time the
    same way as format statements. Then the compilers started recognizing
    when variable strings were unchanged between write statements, and
    optimized that parsing at run time.

    I was in a computer users group back in the 1980s. Over a period of a
    few years, there were frequent discussions about the costs of the
    different fortran compilers on the IBM mainframe machines. Some users
    were pushing for the compilers that supported f77, others wanted to keep
    the old compilers because at first they were more efficient. My codes
    would run for hours at a time and then print a few pages of output, so
    my concern was the run time efficiency of the linear algebra, do loop executions, and so on. Someone in another group presented results that
    showed the f77 compiler was about 8x slower than the old f66+ compiler.
    He always showed the ratios, not the actual times. This went on for
    months. Then just by chance, someone asked him what were his actual
    times. It turned out that they were executions of a few seconds each.
    His jobs did a little bit of calculation and i/o, and then printed out
    listings that were a couple hundred pages long. His computer times were
    all dominated by i/o costs, which at that time were slower for the f77
    compiler than they were for the old f66+ compiler. So people like me,
    running jobs for hours at a time, were being held hostage by people
    running codes that consumed a few seconds of time, and it was over
    trivial issues like i/o library runtime performance.

    The joys of shared computing on mainframes.

    $.02 -Ron Shepard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Beliavsky@21:1/5 to Ron Shepard on Fri May 6 09:43:15 2022
    On Friday, May 6, 2022 at 12:29:12 PM UTC-4, Ron Shepard wrote:
    On 5/6/22 7:03 AM, Beliavsky wrote:
    I am finding that writing 100x100 matrices of floats is about 25% faster with gfortran than g++ using printf on Windows and 4 times faster with gfortran than g++ on WSL2. I wonder what people see on Linux and if the C++ performance can be improved (
    or if the comparison is invalid for some reason). The codes and scripts are at https://github.com/Beliavsky/Formatted_output_speed . Maybe Fortran benefits because a whole row of a matrix can be written with

    write (iu,"(*(1x,f0.6)") x(i,:)

    instead of looping over each element.
    In addition to treating the i/o list as a sequence of scalars or as a vector, there is also the question, in both languages, of how often the format string is parsed. With FORMAT statements, compilers would
    typically parse the format strings at compile time, so no run time
    overhead occurred for that. Then when f77 allowed format strings with character variables, there was sometimes significant differences in i/o costs between the two approaches. This was because the format string was parsed anew for each execution of the write statement. Then over time,
    this was optimized by compilers. First, the literal strings and
    character parameters were singled out and parsed at compile time the
    same way as format statements. Then the compilers started recognizing
    when variable strings were unchanged between write statements, and
    optimized that parsing at run time.

    I was in a computer users group back in the 1980s. Over a period of a
    few years, there were frequent discussions about the costs of the
    different fortran compilers on the IBM mainframe machines. Some users
    were pushing for the compilers that supported f77, others wanted to keep
    the old compilers because at first they were more efficient. My codes
    would run for hours at a time and then print a few pages of output, so
    my concern was the run time efficiency of the linear algebra, do loop executions, and so on. Someone in another group presented results that showed the f77 compiler was about 8x slower than the old f66+ compiler.
    He always showed the ratios, not the actual times. This went on for
    months. Then just by chance, someone asked him what were his actual
    times. It turned out that they were executions of a few seconds each.
    His jobs did a little bit of calculation and i/o, and then printed out listings that were a couple hundred pages long. His computer times were
    all dominated by i/o costs, which at that time were slower for the f77 compiler than they were for the old f66+ compiler. So people like me, running jobs for hours at a time, were being held hostage by people
    running codes that consumed a few seconds of time, and it was over
    trivial issues like i/o library runtime performance.

    The joys of shared computing on mainframes.

    $.02 -Ron Shepard

    Great story. Please consider also posting it at https://fortran-lang.discourse.group/t/anecdotal-fortran/704 .

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)