Forum: >>> Magnum BBS <<<

Byte-offset of lines in a text file

From Janis Papanagnou@21:1/5 to All on Mon Apr 3 13:03:07 2023

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

Even though that code is fast enough for my (MB sized) files using
Kornshell's pattern seek-redirections to locate the newlines in the
file seems to be significantly faster than the 'read' based approach

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done

On a 320 MB test file the first script requires ~8 seconds and the
second one ~0.3 seconds.

Janis

[*] Does 'sed' maybe support such a function? Or is there any other
standard tool I missed?

[**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Mon Apr 3 13:32:01 2023

On 03.04.2023 13:03, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]

Don't use the second variant (the one using 3<#$'\n' ), it is
*not* running reliably, as I noticed with more tests! The first
one (using read -u3 ) runs just fine, though.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to Lew Pitcher on Mon Apr 3 14:13:12 2023

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

Hi, Janis

On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

[snip]

Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.

I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.

I don't have an explanation for this behaviour.

Well, I have an observation, that may lead to an explanation.

My lorem_ipsum.txt file has a number of "blank" lines, the first of which
is at displacement 648. Your script properly reports all the lines that
follow that blank line. It appears that, somehow, your script ignores everything before the first blank line.

[1] Your script resulted in

[snip]

09:50 $ ./bo.ksh lorem_ipsum.txt
0
649
710

[snip]

09:50 $
and egrep tells me
09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
0

[snip]

636
648

My blank line, above

649
710

[snip]

Hope this helps in the diagnosis
--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to Janis Papanagnou on Mon Apr 3 13:56:16 2023

Hi, Janis

On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

[snip]

Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.

I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.

I don't have an explanation for this behaviour.

[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done
09:50 $ ./bo.ksh lorem_ipsum.txt
0
649
710
767
832
896
957
1019
1085
1152
1153
1213
1277
1340
1401
1464
1530
1594
1660
1724
1788
1850
1913
1925
1926
2478
2479
09:50 $
and egrep tells me
09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
0
57
113
170
229
285
341
396
452
513
574
636
648
649
710
767
832
896
957
1019
1085
1152
1153
1213
1277
1340
1401
1464
1530
1594
1660
1724
1788
1850
1913
1925
1926
2478
2479
09:52 $
The first 12 lines of my test file are
09:55 $ head -12 lorem_ipsum.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Phasellus ultricies, risus sed consectetur mattis, orci
leo eleifend nisl, quis lobortis urna enim at diam. Duis
placerat ac orci ut cursus. Morbi commodo purus et dapibus
lobortis. Maecenas at ante lectus. Duis semper magna in
nisi accumsan pharetra. Mauris porttitor lorem erat, ac
condimentum quam faucibus dictum. Cras et tortor orci.
Quisque fringilla porttitor semper. Nunc imperdiet enim
est, tristique maximus nunc convallis sagittis. Pellentesque
cursus odio elit, ac viverra tortor varius quis. In bibendum
viverra turpis, ut eleifend lectus malesuada at. Vivamus quis
orci nulla.
09:55 $

--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to Lew Pitcher on Mon Apr 3 14:18:59 2023

On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

Hi, Janis

On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

[snip]

Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.

I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.

I don't have an explanation for this behaviour.

[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done

Awwwwww fsck!

I copied the wrong script. Your followup noted that this version
had problems.

I'll retry with the correct script.

Sorry to have been a nuisance :-(

Retesting with the /correct/ script shows that it duplicated
the results of my egrep pipe. It looks like this script is a
winner.

Thanks for the education; I learned something new today. :-)
--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Mon Apr 3 14:28:19 2023

In article <u0edfh$2utkp$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 03.04.2023 13:03, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]

Don't use the second variant (the one using 3<#$'\n' ), it is
*not* running reliably, as I noticed with more tests! The first
one (using read -u3 ) runs just fine, though.

I don't know what the overall goal is, but wouldn't this be easier:

$ awk '{ print tot+0;tot += length + 1 }' file

Note that this works fine in Unix, because in Unix bytes are bytes.
It might need updating to work correctly under DOS/Windows. Or VMS...

Or z/OS...

--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/ModernXtian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lew Pitcher@21:1/5 to Lew Pitcher on Mon Apr 3 14:16:05 2023

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

Hi, Janis

On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

[snip]

Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.

I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.

I don't have an explanation for this behaviour.

[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done

Awwwwww fsck!

I copied the wrong script. Your followup noted that this version
had problems.

I'll retry with the correct script.

Sorry to have been a nuisance :-(
--
Lew Pitcher
"In Skills We Trust"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Mon Apr 3 19:36:59 2023

On 03.04.2023 16:28, Kenny McCormack wrote:

In article <u0edfh$2utkp$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 03.04.2023 13:03, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]

Don't use the second variant (the one using 3<#$'\n' ), it is
*not* running reliably, as I noticed with more tests! The first
one (using read -u3 ) runs just fine, though.

I don't know what the overall goal is,

The goal was to create an index file.[*]

but wouldn't this be easier:

$ awk '{ print tot+0;tot += length + 1 }' file

Note that this works fine in Unix, because in Unix bytes are bytes.
It might need updating to work correctly under DOS/Windows. Or VMS...

Or z/OS...

Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
variant; I like it for its readability.

Janis

[*] The task actually is (in a Javascript context) to use a low-level Javascript file access function without loading the whole (huge) file
into memory; this function was suggested to me, and it requires such
an index that I have to create beforehand. (I thought there'd be some
standard tool for such a (standard-?) task available.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lew Pitcher on Mon Apr 3 19:49:46 2023

On 03.04.2023 16:18, Lew Pitcher wrote:

On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

Hi, Janis

On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file >>>> to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I >>>> wrote this quick hack[**] that I share here in case anyone's interested >>>>
#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file >>>> #
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

[snip]

Your script intrigued me. While I don't normally use kornshell, I decided >>> to try the script out to see what it did.

I have a multi-line lorem ipsum test file that I fed to your script, and >>> it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.

I don't have an explanation for this behaviour.

[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file >>> #
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done

Awwwwww fsck!

I copied the wrong script. Your followup noted that this version
had problems.

I'll retry with the correct script.

Sorry to have been a nuisance :-(

You haven't been. I appreciate any tests and feedback.

Retesting with the /correct/ script shows that it duplicated
the results of my egrep pipe. It looks like this script is a
winner.

You seem to have been using the 3<#$'\n' based variant? - And it
works? - Still not reliably in my environment. I'll have to examine
that further.

Thanks for the education; I learned something new today. :-)

Thanks for your tests! (And for the overall confirmation.) I'm a bit
reluctant when using Kornshell's newer "redirection" operators; they
seem to not be reliable as I experienced [in my environment] in the
past. (Maybe it's advisable to test and confirm that in Martijn's
ksh93u+m, which generally seems to be much more reliable.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lew Pitcher on Mon Apr 3 19:56:16 2023

On 03.04.2023 16:16, Lew Pitcher wrote:

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

I copied the wrong script. Your followup noted that this version
had problems.

I'll retry with the correct script.

Argh! - And I missed this post.

So you made the same observation that I made. - Thanks!

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Spiros Bousbouras@21:1/5 to Janis Papanagnou on Tue Apr 4 02:45:34 2023

On Mon, 3 Apr 2023 19:36:59 +0200
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:

On 03.04.2023 16:28, Kenny McCormack wrote:

I don't know what the overall goal is,

The goal was to create an index file.[*]

but wouldn't this be easier:

$ awk '{ print tot+0;tot += length + 1 }' file

Note that this works fine in Unix, because in Unix bytes are bytes.
It might need updating to work correctly under DOS/Windows. Or VMS...

Or z/OS...

Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
variant; I like it for its readability.

A generalisation is

awk -v nob=$(echo | wc -c) '{ print tot+0 ; tot += length + nob }' file

But I haven't tested it on a system where the newline sequence is different than the single LF byte. And there are (or used to be) operating systems
where there is no notion of newline sequence and files are made of
records where each record is a line.

A different consideration : is it ok if the output is the same regardless
of whether the input ends in a newline sequence or not ? With the above
awk scripts it will be the same.

--
As someone once joked, "It's easier to prove the Riemann hypothesis than to get someone to read your proof!"
http://empslocal.ex.ac.uk/people/staff/mrwatkin/zeta/RHproofs.htm

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Janis Papanagnou on Tue Apr 4 09:27:59 2023

On 03.04.2023 13:03, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

[...]

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

Just occurred to me; to shorten that a bit and avoid duplicate pieces
of code...

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

while echo $( 3<# ) ; read -u3
do :
done

[...]

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jalen Q@21:1/5 to Janis Papanagnou on Tue Apr 4 22:09:18 2023

On Monday, April 3, 2023 at 6:03:13 AM UTC-5, Janis Papanagnou wrote:

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

Even though that code is fast enough for my (MB sized) files using Kornshell's pattern seek-redirections to locate the newlines in the
file seems to be significantly faster than the 'read' based approach

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done

On a 320 MB test file the first script requires ~8 seconds and the
second one ~0.3 seconds.

Janis

[*] Does 'sed' maybe support such a function? Or is there any other
standard tool I missed?

[**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).

hjuuuyyyy77y

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Lew Pitcher on Thu Apr 6 07:49:46 2023

On 03.04.2023 16:18, Lew Pitcher wrote:

Retesting with the /correct/ script shows that it duplicated
the results of my egrep pipe. It looks like this script is a
winner.

Not really. The shell's read-loop is slow and the egrep/awk pipe
seems to be a lot faster. As long as I cannot make the shell's
pattern seek functional and reliable it makes sense to stay with
the pipe. Wasn't aware of grep's '-b' option; thanks for that!

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun Apr 28 20:37:53 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun Apr 28 20:37:37 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Mon Apr 29 19:16:32 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Mon Apr 29 09:04:47 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	297
Nodes:	16 (2 / 14)
Uptime:	23:15:54
Calls:	6,668
Calls today:	2
Files:	12,216
Messages:	5,337,458

Byte-offset of lines in a text file

Who's Online

Recent Visitors

System Info