I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]
Hi, Janis
On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
[snip]
Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.
I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.
I don't have an explanation for this behaviour.
[1] Your script resulted in[snip]
09:50 $ ./bo.ksh lorem_ipsum.txt[snip]
0
649
710
09:50 $[snip]
and egrep tells me
09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
0
636My blank line, above
648
649[snip]
710
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:
Hi, Janis
On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
[snip]
Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.
I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.
I don't have an explanation for this behaviour.
[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done
Awwwwww fsck!
I copied the wrong script. Your followup noted that this version
had problems.
I'll retry with the correct script.
Sorry to have been a nuisance :-(
On 03.04.2023 13:03, Janis Papanagnou wrote:
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]
Don't use the second variant (the one using 3<#$'\n' ), it is
*not* running reliably, as I noticed with more tests! The first
one (using read -u3 ) runs just fine, though.
Hi, Janis
On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
[snip]
Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.
I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.
I don't have an explanation for this behaviour.
[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done
In article <u0edfh$2utkp$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 03.04.2023 13:03, Janis Papanagnou wrote:
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]
Don't use the second variant (the one using 3<#$'\n' ), it is
*not* running reliably, as I noticed with more tests! The first
one (using read -u3 ) runs just fine, though.
I don't know what the overall goal is,
but wouldn't this be easier:
$ awk '{ print tot+0;tot += length + 1 }' file
Note that this works fine in Unix, because in Unix bytes are bytes.
It might need updating to work correctly under DOS/Windows. Or VMS...
Or z/OS...
On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:
On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:
Hi, Janis
On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
I just needed to determine the byte-offsets of all lines in a text file >>>> to create an index file.
On a quick search I couldn't find any Unix tool/shell solution[*] so I >>>> wrote this quick hack[**] that I share here in case anyone's interested >>>>
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file >>>> #
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
[snip]
Your script intrigued me. While I don't normally use kornshell, I decided >>> to try the script out to see what it did.
I have a multi-line lorem ipsum test file that I fed to your script, and >>> it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.
I don't have an explanation for this behaviour.
[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file >>> #
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done
Awwwwww fsck!
I copied the wrong script. Your followup noted that this version
had problems.
I'll retry with the correct script.
Sorry to have been a nuisance :-(
Retesting with the /correct/ script shows that it duplicated
the results of my egrep pipe. It looks like this script is a
winner.
Thanks for the education; I learned something new today. :-)
On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:
I copied the wrong script. Your followup noted that this version
had problems.
I'll retry with the correct script.
On 03.04.2023 16:28, Kenny McCormack wrote:
I don't know what the overall goal is,
The goal was to create an index file.[*]
but wouldn't this be easier:
$ awk '{ print tot+0;tot += length + 1 }' file
Note that this works fine in Unix, because in Unix bytes are bytes.
It might need updating to work correctly under DOS/Windows. Or VMS...
Or z/OS...
Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
variant; I like it for its readability.
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
[...]
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
[...]
I just needed to determine the byte-offsets of all lines in a text file
to create an index file.
On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while read -u3
do echo $( 3<# )
done
Even though that code is fast enough for my (MB sized) files using Kornshell's pattern seek-redirections to locate the newlines in the
file seems to be significantly faster than the 'read' based approach
#!/bin/ksh
# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename
f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"
echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done
On a 320 MB test file the first script requires ~8 seconds and thehjuuuyyyy77y
second one ~0.3 seconds.
Janis
[*] Does 'sed' maybe support such a function? Or is there any other
standard tool I missed?
[**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).
Retesting with the /correct/ script shows that it duplicated
the results of my egrep pipe. It looks like this script is a
winner.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 297 |
Nodes: | 16 (2 / 14) |
Uptime: | 23:15:54 |
Calls: | 6,668 |
Calls today: | 2 |
Files: | 12,216 |
Messages: | 5,337,458 |