• poor performance while processing a file one byte a time

    From Mateusz Viste@21:1/5 to All on Thu Jan 20 15:16:22 2022
    Hello,

    I am processing some files using php. Basically I read every byte of
    the file and perform a simple operation on it to compute a sum.

    My initial implementation was in C, but now I am trying re-doing the
    same in PHP. This is how my PHP code looks like:


    function fn($fname) {
    $fd = fopen($fname, 'rb');
    if ($fd === false) return(0);

    $result = 0;

    while (!feof($fd)) {

    $buff = fread($fd, 1024 * 1024);

    foreach (str_split($buff) as $b) {
    $result += ord($b);
    $result &= 0xffff;
    }
    }

    fclose($fd);
    return($result);
    }

    It works, but it is really slow (approximately 100x slower than the
    original C code). I know that I should not expect much performance
    from interpreted PHP code, but still - is there any trick I could use to
    speed this up?

    I have also tried to replace str_split() and ord() with unpack('C*'),
    but it was even slower. Anything else I could try?


    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John-Paul Stewart@21:1/5 to Mateusz Viste on Thu Jan 20 11:08:23 2022
    On 2022-01-20 09:16, Mateusz Viste wrote:
    Hello,

    I am processing some files using php. Basically I read every byte of
    the file and perform a simple operation on it to compute a sum.

    My initial implementation was in C, but now I am trying re-doing the
    same in PHP. This is how my PHP code looks like:


    function fn($fname) {
    $fd = fopen($fname, 'rb');
    if ($fd === false) return(0);

    $result = 0;

    while (!feof($fd)) {

    $buff = fread($fd, 1024 * 1024);

    foreach (str_split($buff) as $b) {
    $result += ord($b);
    $result &= 0xffff;
    }
    }

    fclose($fd);
    return($result);
    }

    It works, but it is really slow (approximately 100x slower than the
    original C code). I know that I should not expect much performance
    from interpreted PHP code, but still - is there any trick I could use to speed this up?

    Your foreach(str_split) line is an obvious place to start. str_split()
    creates an array from a string. In your case, the input buffer is
    1024*1024 bytes long, so you're splitting that megabyte string and (re-)creating an array of more than a million elements for _each
    iteration of the loop_. (Which will be a million+ times.) Why? The
    very first thing you should consider is pulling that out and doing it
    just once:

    $buff = fread($fd, 1024 * 1024);
    $whatever = str_split($buff);
    foreach ($whatever as $b)
    ...

    (There's nothing specific to PHP about that advice either. It's equally applicable to C, although modern C compilers _may_ make that
    optimization for you.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 17:23:31 2022
    On Thu, 20 Jan 2022 11:08:23 -0500
    John-Paul Stewart <jpstewart@personalprojects.net> wrote:

    function fn($fname) {
    $fd = fopen($fname, 'rb');
    if ($fd === false) return(0);

    $result = 0;

    while (!feof($fd)) {

    $buff = fread($fd, 1024 * 1024);

    foreach (str_split($buff) as $b) {
    $result += ord($b);
    $result &= 0xffff;
    }
    }

    fclose($fd);
    return($result);
    }

    Your foreach(str_split) line is an obvious place to start.
    str_split() creates an array from a string. In your case, the input
    buffer is 1024*1024 bytes long, so you're splitting that megabyte
    string and (re-)creating an array of more than a million elements for
    _each iteration of the loop_. (Which will be a million+ times.)

    Are you really sure about that? My understanding is that the foreach()
    argument is processed only once... Isn't that the case?

    Why? The very first thing you should consider is pulling that out
    and doing it just once:

    $buff = fread($fd, 1024 * 1024);
    $whatever = str_split($buff);
    foreach ($whatever as $b)

    Initially I had such version of the code, but it was throwing such
    error on the str_split() line:

    "PHP Fatal error: Allowed memory size of 134217728 bytes exhausted
    (tried to allocate 4096 bytes)"

    I reduced the fread() buffer to 64K and the error went away. But the
    speed is still the same. Code is this:

    while (!feof($fd)) {

    $buffstr = fread($fd, 64 * 1024);
    $buffarr = str_split($buffstr);

    foreach ($buffarr as $b) {
    $result += ord($b);
    $result &= 0xffff;
    }
    }

    (There's nothing specific to PHP about that advice either. It's
    equally applicable to C, although modern C compilers _may_ make that optimization for you.)

    That's hardly applicable in C, since there is no "foreach" in C. There
    is a "for", and it's initialization argument is processed exactly once.


    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 17:37:03 2022
    On Thu, 20 Jan 2022 11:08:23 -0500
    John-Paul Stewart <jpstewart@personalprojects.net> wrote:

    Your foreach(str_split) line is an obvious place to start.
    str_split() creates an array from a string. In your case, the input
    buffer is 1024*1024 bytes long, so you're splitting that megabyte
    string and (re-)creating an array of more than a million elements for
    _each iteration of the loop_. (Which will be a million+ times.)

    Okay, I have checked it, and I can confirm now that you are mistaken in
    your belief. See this program:

    <?php

    function getArr() {
    echo "getArr() call\n";
    return array(1, 2, 3);
    }

    foreach (getArr() as $i) {
    echo "{$i}\n";
    }



    And here is its output:

    $ php t.php
    getArr() call
    1
    2
    3

    The foreach() initialization is clearly processed only once.

    Any other ideas?


    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John-Paul Stewart@21:1/5 to Mateusz Viste on Thu Jan 20 12:38:06 2022
    On 2022-01-20 11:23, Mateusz Viste wrote:

    (There's nothing specific to PHP about that advice either. It's
    equally applicable to C, although modern C compilers _may_ make that
    optimization for you.)

    That's hardly applicable in C, since there is no "foreach" in C. There
    is a "for", and it's initialization argument is processed exactly once.

    I was speaking more broadly about pulling invariant code out of any and
    all loops regardless of where it appears in said loop, rather than
    relying on the interpreter or compiler to handle it for you.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John-Paul Stewart@21:1/5 to Mateusz Viste on Thu Jan 20 12:41:58 2022
    On 2022-01-20 11:37, Mateusz Viste wrote:

    The foreach() initialization is clearly processed only once.

    Any other ideas?

    My next question is why create the array at all?

    You can just use a simple for loop to iterate over the string, with
    $buff[$i] to access it character by character. That would avoid the
    overhead (of both memory use and computation time) that's involved in
    creating the associative array.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to Arno Welzel on Thu Jan 20 19:45:24 2022
    On Thu, 20 Jan 2022 19:23:49 +0100
    Arno Welzel <usenet@arnowelzel.de> wrote:
    It works, but it is really slow (approximately 100x slower than the original C code). I know that I should not expect much performance
    from interpreted PHP code, but still - is there any trick I could
    use to speed this up?

    By *not* using arrays.

    That what I figured, but it did not find any way to not using them. :)

    Just use the string as it is:

    $len = strlen($buff);
    $pos = 0;
    while ($pos < $len) {
    $result += ord($buff[$pos]);
    $result &= 0xffff;
    $pos++;
    }

    Yes, this is faster, but only by 10%. If I understand it right, the
    "address a string like an array" is costly in PHP, and it must emulate
    an array-like under the hood anyway. I was hoping there might be a
    different approach that would be faster by at least an order of
    magnitude... But if PHP doesn't have any faster construct for this kind
    of things, I will live with it.

    However this may still be quite slow on larger files. Since it seems
    you want to create some kind of checksum based on the file content,
    you may want to use something else like hash_file(), sha1_file() or md5_file() and use the result of these calls instead processing the
    whole file content in a loop.

    Sadly, that won't work. The kind of checksum I am computing is not
    supported by PHP, hence why I do it byte by byte myself. Another
    solution is to do it with my C code by system()-calling it from PHP,
    but that's really ugly. At this point I'd rather stick with a slow, 100%
    PHP solution.


    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 19:47:21 2022
    On Thu, 20 Jan 2022 12:41:58 -0500
    John-Paul Stewart <jpstewart@personalprojects.net> wrote:

    On 2022-01-20 11:37, Mateusz Viste wrote:

    The foreach() initialization is clearly processed only once.

    Any other ideas?

    My next question is why create the array at all?

    You can just use a simple for loop to iterate over the string, with
    $buff[$i] to access it character by character. That would avoid the
    overhead (of both memory use and computation time) that's involved in creating the associative array.

    Yes, I am aware, and that is indeed also something I had tested, but it
    doesn't appear to be significantly faster than the array version.
    Apparently accessing a string's bytes in an array-like fashion is very
    costly.

    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arno Welzel@21:1/5 to All on Thu Jan 20 19:23:49 2022
    Mateusz Viste:

    Hello,

    I am processing some files using php. Basically I read every byte of
    the file and perform a simple operation on it to compute a sum.

    My initial implementation was in C, but now I am trying re-doing the
    same in PHP. This is how my PHP code looks like:


    function fn($fname) {
    $fd = fopen($fname, 'rb');
    if ($fd === false) return(0);

    $result = 0;

    while (!feof($fd)) {

    $buff = fread($fd, 1024 * 1024);

    foreach (str_split($buff) as $b) {
    $result += ord($b);
    $result &= 0xffff;
    }
    }

    fclose($fd);
    return($result);
    }

    It works, but it is really slow (approximately 100x slower than the
    original C code). I know that I should not expect much performance
    from interpreted PHP code, but still - is there any trick I could use to speed this up?

    By *not* using arrays.

    str_split() creates an array based on a string: <https://www.php.net/manual/en/function.str-split.php>

    I have also tried to replace str_split() and ord() with unpack('C*'),
    but it was even slower. Anything else I could try?

    Just use the string as it is:

    function fn($fname) {
    $fd = fopen($fname, 'rb');
    if ($fd === false) return(0);

    $result = 0;

    while (!feof($fd)) {

    $buff = fread($fd, 1024 * 1024);

    $len = strlen($buff);
    $pos = 0;
    while ($pos < $len) {
    $result += ord($buff[$pos]);
    $result &= 0xffff;
    $pos++;
    }
    }

    fclose($fd);
    return($result);
    }

    However this may still be quite slow on larger files. Since it seems you
    want to create some kind of checksum based on the file content, you may
    want to use something else like hash_file(), sha1_file() or md5_file()
    and use the result of these calls instead processing the whole file
    content in a loop.


    --
    Arno Welzel
    https://arnowelzel.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 19:49:05 2022
    On Thu, 20 Jan 2022 12:38:06 -0500
    John-Paul Stewart <jpstewart@personalprojects.net> wrote:

    On 2022-01-20 11:23, Mateusz Viste wrote:

    (There's nothing specific to PHP about that advice either. It's
    equally applicable to C, although modern C compilers _may_ make
    that optimization for you.)

    That's hardly applicable in C, since there is no "foreach" in C.
    There is a "for", and it's initialization argument is processed
    exactly once.

    I was speaking more broadly about pulling invariant code out of any
    and all loops regardless of where it appears in said loop, rather than relying on the interpreter or compiler to handle it for you.

    Ah, so you were answering a question that wasn't asked, and that is
    irrelevant to the case at hand. Okay, that's fair. It's what the
    usenet is all about after all. ;-)

    Mateusz

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arno Welzel@21:1/5 to All on Thu Jan 20 20:27:32 2022
    Mateusz Viste:

    On Thu, 20 Jan 2022 19:23:49 +0100
    Arno Welzel <usenet@arnowelzel.de> wrote:
    [...]
    However this may still be quite slow on larger files. Since it seems
    you want to create some kind of checksum based on the file content,
    you may want to use something else like hash_file(), sha1_file() or
    md5_file() and use the result of these calls instead processing the
    whole file content in a loop.

    Sadly, that won't work. The kind of checksum I am computing is not
    supported by PHP, hence why I do it byte by byte myself. Another
    solution is to do it with my C code by system()-calling it from PHP,
    but that's really ugly. At this point I'd rather stick with a slow, 100%
    PHP solution.

    Or you create your own PHP extension which can then be used in the
    script ;-)


    --
    Arno Welzel
    https://arnowelzel.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)