Forum: >>> Magnum BBS <<<

poor performance while processing a file one byte a time

From Mateusz Viste@21:1/5 to All on Thu Jan 20 15:16:22 2022

Hello,

I am processing some files using php. Basically I read every byte of
the file and perform a simple operation on it to compute a sum.

My initial implementation was in C, but now I am trying re-doing the
same in PHP. This is how my PHP code looks like:

function fn($fname) {
$fd = fopen($fname, 'rb');
if ($fd === false) return(0);

$result = 0;

while (!feof($fd)) {

$buff = fread($fd, 1024 * 1024);

foreach (str_split($buff) as $b) {
$result += ord($b);
$result &= 0xffff;
}
}

fclose($fd);
return($result);
}

It works, but it is really slow (approximately 100x slower than the
original C code). I know that I should not expect much performance
from interpreted PHP code, but still - is there any trick I could use to
speed this up?

I have also tried to replace str_split() and ord() with unpack('C*'),
but it was even slower. Anything else I could try?

Mateusz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John-Paul Stewart@21:1/5 to Mateusz Viste on Thu Jan 20 11:08:23 2022

On 2022-01-20 09:16, Mateusz Viste wrote:

Hello,

I am processing some files using php. Basically I read every byte of
the file and perform a simple operation on it to compute a sum.

My initial implementation was in C, but now I am trying re-doing the
same in PHP. This is how my PHP code looks like:

function fn($fname) {
$fd = fopen($fname, 'rb');
if ($fd === false) return(0);

$result = 0;

while (!feof($fd)) {

$buff = fread($fd, 1024 * 1024);

foreach (str_split($buff) as $b) {
$result += ord($b);
$result &= 0xffff;
}
}

fclose($fd);
return($result);
}

It works, but it is really slow (approximately 100x slower than the
original C code). I know that I should not expect much performance
from interpreted PHP code, but still - is there any trick I could use to speed this up?

Your foreach(str_split) line is an obvious place to start. str_split()
creates an array from a string. In your case, the input buffer is
1024*1024 bytes long, so you're splitting that megabyte string and (re-)creating an array of more than a million elements for _each
iteration of the loop_. (Which will be a million+ times.) Why? The
very first thing you should consider is pulling that out and doing it
just once:

$buff = fread($fd, 1024 * 1024);
$whatever = str_split($buff);
foreach ($whatever as $b)
...

(There's nothing specific to PHP about that advice either. It's equally applicable to C, although modern C compilers _may_ make that
optimization for you.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 17:23:31 2022

On Thu, 20 Jan 2022 11:08:23 -0500
John-Paul Stewart <jpstewart@personalprojects.net> wrote:

function fn($fname) {
$fd = fopen($fname, 'rb');
if ($fd === false) return(0);

$result = 0;

while (!feof($fd)) {

$buff = fread($fd, 1024 * 1024);

foreach (str_split($buff) as $b) {
$result += ord($b);
$result &= 0xffff;
}
}

fclose($fd);
return($result);
}

Your foreach(str_split) line is an obvious place to start.
str_split() creates an array from a string. In your case, the input
buffer is 1024*1024 bytes long, so you're splitting that megabyte
string and (re-)creating an array of more than a million elements for
_each iteration of the loop_. (Which will be a million+ times.)

Are you really sure about that? My understanding is that the foreach()
argument is processed only once... Isn't that the case?

Why? The very first thing you should consider is pulling that out
and doing it just once:

$buff = fread($fd, 1024 * 1024);
$whatever = str_split($buff);
foreach ($whatever as $b)

Initially I had such version of the code, but it was throwing such
error on the str_split() line:

"PHP Fatal error: Allowed memory size of 134217728 bytes exhausted
(tried to allocate 4096 bytes)"

I reduced the fread() buffer to 64K and the error went away. But the
speed is still the same. Code is this:

while (!feof($fd)) {

$buffstr = fread($fd, 64 * 1024);
$buffarr = str_split($buffstr);

foreach ($buffarr as $b) {
$result += ord($b);
$result &= 0xffff;
}
}

(There's nothing specific to PHP about that advice either. It's
equally applicable to C, although modern C compilers _may_ make that optimization for you.)

That's hardly applicable in C, since there is no "foreach" in C. There
is a "for", and it's initialization argument is processed exactly once.

Mateusz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 17:37:03 2022

On Thu, 20 Jan 2022 11:08:23 -0500
John-Paul Stewart <jpstewart@personalprojects.net> wrote:

Your foreach(str_split) line is an obvious place to start.
str_split() creates an array from a string. In your case, the input
buffer is 1024*1024 bytes long, so you're splitting that megabyte
string and (re-)creating an array of more than a million elements for
_each iteration of the loop_. (Which will be a million+ times.)

Okay, I have checked it, and I can confirm now that you are mistaken in
your belief. See this program:

<?php

function getArr() {
echo "getArr() call\n";
return array(1, 2, 3);
}

foreach (getArr() as $i) {
echo "{$i}\n";
}

And here is its output:

$ php t.php
getArr() call
1
2
3

The foreach() initialization is clearly processed only once.

Any other ideas?

Mateusz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John-Paul Stewart@21:1/5 to Mateusz Viste on Thu Jan 20 12:38:06 2022

On 2022-01-20 11:23, Mateusz Viste wrote:

(There's nothing specific to PHP about that advice either. It's
equally applicable to C, although modern C compilers _may_ make that
optimization for you.)

That's hardly applicable in C, since there is no "foreach" in C. There
is a "for", and it's initialization argument is processed exactly once.

I was speaking more broadly about pulling invariant code out of any and
all loops regardless of where it appears in said loop, rather than
relying on the interpreter or compiler to handle it for you.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John-Paul Stewart@21:1/5 to Mateusz Viste on Thu Jan 20 12:41:58 2022

On 2022-01-20 11:37, Mateusz Viste wrote:

The foreach() initialization is clearly processed only once.

Any other ideas?

My next question is why create the array at all?

You can just use a simple for loop to iterate over the string, with
$buff[$i] to access it character by character. That would avoid the
overhead (of both memory use and computation time) that's involved in
creating the associative array.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mateusz Viste@21:1/5 to Arno Welzel on Thu Jan 20 19:45:24 2022

On Thu, 20 Jan 2022 19:23:49 +0100
Arno Welzel <usenet@arnowelzel.de> wrote:

It works, but it is really slow (approximately 100x slower than the original C code). I know that I should not expect much performance
from interpreted PHP code, but still - is there any trick I could
use to speed this up?

By *not* using arrays.

That what I figured, but it did not find any way to not using them. :)

Just use the string as it is:

$len = strlen($buff);
$pos = 0;
while ($pos < $len) {
$result += ord($buff[$pos]);
$result &= 0xffff;
$pos++;
}

Yes, this is faster, but only by 10%. If I understand it right, the
"address a string like an array" is costly in PHP, and it must emulate
an array-like under the hood anyway. I was hoping there might be a
different approach that would be faster by at least an order of
magnitude... But if PHP doesn't have any faster construct for this kind
of things, I will live with it.

However this may still be quite slow on larger files. Since it seems
you want to create some kind of checksum based on the file content,
you may want to use something else like hash_file(), sha1_file() or md5_file() and use the result of these calls instead processing the
whole file content in a loop.

Sadly, that won't work. The kind of checksum I am computing is not
supported by PHP, hence why I do it byte by byte myself. Another
solution is to do it with my C code by system()-calling it from PHP,
but that's really ugly. At this point I'd rather stick with a slow, 100%
PHP solution.

Mateusz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 19:47:21 2022

On Thu, 20 Jan 2022 12:41:58 -0500
John-Paul Stewart <jpstewart@personalprojects.net> wrote:

On 2022-01-20 11:37, Mateusz Viste wrote:

The foreach() initialization is clearly processed only once.

Any other ideas?

My next question is why create the array at all?

You can just use a simple for loop to iterate over the string, with
$buff[$i] to access it character by character. That would avoid the
overhead (of both memory use and computation time) that's involved in creating the associative array.

Yes, I am aware, and that is indeed also something I had tested, but it
doesn't appear to be significantly faster than the array version.
Apparently accessing a string's bytes in an array-like fashion is very
costly.

Mateusz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arno Welzel@21:1/5 to All on Thu Jan 20 19:23:49 2022

Mateusz Viste:

Hello,

I am processing some files using php. Basically I read every byte of
the file and perform a simple operation on it to compute a sum.

My initial implementation was in C, but now I am trying re-doing the
same in PHP. This is how my PHP code looks like:

function fn($fname) {
$fd = fopen($fname, 'rb');
if ($fd === false) return(0);

$result = 0;

while (!feof($fd)) {

$buff = fread($fd, 1024 * 1024);

foreach (str_split($buff) as $b) {
$result += ord($b);
$result &= 0xffff;
}
}

fclose($fd);
return($result);
}

It works, but it is really slow (approximately 100x slower than the
original C code). I know that I should not expect much performance
from interpreted PHP code, but still - is there any trick I could use to speed this up?

By *not* using arrays.

str_split() creates an array based on a string: <https://www.php.net/manual/en/function.str-split.php>

I have also tried to replace str_split() and ord() with unpack('C*'),
but it was even slower. Anything else I could try?

Just use the string as it is:

function fn($fname) {
$fd = fopen($fname, 'rb');
if ($fd === false) return(0);

$result = 0;

while (!feof($fd)) {

$buff = fread($fd, 1024 * 1024);

$len = strlen($buff);
$pos = 0;
while ($pos < $len) {
$result += ord($buff[$pos]);
$result &= 0xffff;
$pos++;
}
}

fclose($fd);
return($result);
}

However this may still be quite slow on larger files. Since it seems you
want to create some kind of checksum based on the file content, you may
want to use something else like hash_file(), sha1_file() or md5_file()
and use the result of these calls instead processing the whole file
content in a loop.

--
Arno Welzel
https://arnowelzel.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mateusz Viste@21:1/5 to John-Paul Stewart on Thu Jan 20 19:49:05 2022

On Thu, 20 Jan 2022 12:38:06 -0500
John-Paul Stewart <jpstewart@personalprojects.net> wrote:

On 2022-01-20 11:23, Mateusz Viste wrote:

(There's nothing specific to PHP about that advice either. It's
equally applicable to C, although modern C compilers _may_ make
that optimization for you.)

That's hardly applicable in C, since there is no "foreach" in C.
There is a "for", and it's initialization argument is processed
exactly once.

I was speaking more broadly about pulling invariant code out of any
and all loops regardless of where it appears in said loop, rather than relying on the interpreter or compiler to handle it for you.

Ah, so you were answering a question that wasn't asked, and that is
irrelevant to the case at hand. Okay, that's fair. It's what the
usenet is all about after all. ;-)

Mateusz

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Arno Welzel@21:1/5 to All on Thu Jan 20 20:27:32 2022

Mateusz Viste:

On Thu, 20 Jan 2022 19:23:49 +0100
Arno Welzel <usenet@arnowelzel.de> wrote:

[...]

However this may still be quite slow on larger files. Since it seems
you want to create some kind of checksum based on the file content,
you may want to use something else like hash_file(), sha1_file() or
md5_file() and use the result of these calls instead processing the
whole file content in a loop.

Sadly, that won't work. The kind of checksum I am computing is not
supported by PHP, hence why I do it byte by byte myself. Another
solution is to do it with my C code by system()-calling it from PHP,
but that's really ugly. At this point I'd rather stick with a slow, 100%
PHP solution.

Or you create your own PHP extension which can then be used in the
script ;-)

--
Arno Welzel
https://arnowelzel.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sat Dec 21 03:08:36 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Smith
  Sat Dec 21 02:40:27 2024
  from Here via SSH
- Bob Worm
  Fri Dec 20 21:12:29 2024
  from Wales, Uk via Telnet
- Bob Worm
  Fri Dec 20 15:21:07 2024
  from Wales, Uk via Telnet
- Johnnyv
  Fri Dec 20 15:17:33 2024
  from Bilbao, Spain via Raw
- Keyop
  Thu Dec 19 23:12:34 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Thu Dec 19 19:31:18 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Thu Dec 19 08:59:46 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	379
Nodes:	16 (2 / 14)
Uptime:	72:08:07
Calls:	8,084
Calls today:	2
Files:	13,069
Messages:	5,850,068

poor performance while processing a file one byte a time

Who's Online

Recent Visitors

System Info