• Byte-offset of lines in a text file

    From Janis Papanagnou@21:1/5 to All on Mon Apr 3 13:03:07 2023
    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I
    wrote this quick hack[**] that I share here in case anyone's interested

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done


    Even though that code is fast enough for my (MB sized) files using
    Kornshell's pattern seek-redirections to locate the newlines in the
    file seems to be significantly faster than the 'read' based approach

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while 3<#$'\n'
    do echo $( 3<# )
    done


    On a 320 MB test file the first script requires ~8 seconds and the
    second one ~0.3 seconds.

    Janis

    [*] Does 'sed' maybe support such a function? Or is there any other
    standard tool I missed?

    [**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Mon Apr 3 13:32:01 2023
    On 03.04.2023 13:03, Janis Papanagnou wrote:
    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.
    [...]

    Don't use the second variant (the one using 3<#$'\n' ), it is
    *not* running reliably, as I noticed with more tests! The first
    one (using read -u3 ) runs just fine, though.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Lew Pitcher on Mon Apr 3 14:13:12 2023
    On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

    Hi, Janis

    On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I
    wrote this quick hack[**] that I share here in case anyone's interested

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done

    [snip]

    Your script intrigued me. While I don't normally use kornshell, I decided
    to try the script out to see what it did.

    I have a multi-line lorem ipsum test file that I fed to your script, and
    it came up with some funny numbers. Specifically, it missed the first
    few lines of the file. I double-checked, both visually and with grep[1]
    egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    and it appears that your script somehow ignores the first ~600 bytes
    of my test file.

    I don't have an explanation for this behaviour.

    Well, I have an observation, that may lead to an explanation.

    My lorem_ipsum.txt file has a number of "blank" lines, the first of which
    is at displacement 648. Your script properly reports all the lines that
    follow that blank line. It appears that, somehow, your script ignores everything before the first blank line.

    [1] Your script resulted in
    [snip]
    09:50 $ ./bo.ksh lorem_ipsum.txt
    0
    649
    710
    [snip]
    09:50 $
    and egrep tells me
    09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    0
    [snip]
    636
    648
    My blank line, above
    649
    710
    [snip]

    Hope this helps in the diagnosis
    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Janis Papanagnou on Mon Apr 3 13:56:16 2023
    Hi, Janis

    On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I
    wrote this quick hack[**] that I share here in case anyone's interested

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done

    [snip]

    Your script intrigued me. While I don't normally use kornshell, I decided
    to try the script out to see what it did.

    I have a multi-line lorem ipsum test file that I fed to your script, and
    it came up with some funny numbers. Specifically, it missed the first
    few lines of the file. I double-checked, both visually and with grep[1]
    egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    and it appears that your script somehow ignores the first ~600 bytes
    of my test file.

    I don't have an explanation for this behaviour.

    [1] Your script resulted in
    09:50 $ cat ./bo.ksh
    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while 3<#$'\n'
    do echo $( 3<# )
    done
    09:50 $ ./bo.ksh lorem_ipsum.txt
    0
    649
    710
    767
    832
    896
    957
    1019
    1085
    1152
    1153
    1213
    1277
    1340
    1401
    1464
    1530
    1594
    1660
    1724
    1788
    1850
    1913
    1925
    1926
    2478
    2479
    09:50 $
    and egrep tells me
    09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    0
    57
    113
    170
    229
    285
    341
    396
    452
    513
    574
    636
    648
    649
    710
    767
    832
    896
    957
    1019
    1085
    1152
    1153
    1213
    1277
    1340
    1401
    1464
    1530
    1594
    1660
    1724
    1788
    1850
    1913
    1925
    1926
    2478
    2479
    09:52 $
    The first 12 lines of my test file are
    09:55 $ head -12 lorem_ipsum.txt
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Phasellus ultricies, risus sed consectetur mattis, orci
    leo eleifend nisl, quis lobortis urna enim at diam. Duis
    placerat ac orci ut cursus. Morbi commodo purus et dapibus
    lobortis. Maecenas at ante lectus. Duis semper magna in
    nisi accumsan pharetra. Mauris porttitor lorem erat, ac
    condimentum quam faucibus dictum. Cras et tortor orci.
    Quisque fringilla porttitor semper. Nunc imperdiet enim
    est, tristique maximus nunc convallis sagittis. Pellentesque
    cursus odio elit, ac viverra tortor varius quis. In bibendum
    viverra turpis, ut eleifend lectus malesuada at. Vivamus quis
    orci nulla.
    09:55 $


    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Lew Pitcher on Mon Apr 3 14:18:59 2023
    On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:

    On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

    Hi, Janis

    On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I
    wrote this quick hack[**] that I share here in case anyone's interested

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done

    [snip]

    Your script intrigued me. While I don't normally use kornshell, I decided
    to try the script out to see what it did.

    I have a multi-line lorem ipsum test file that I fed to your script, and
    it came up with some funny numbers. Specifically, it missed the first
    few lines of the file. I double-checked, both visually and with grep[1]
    egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    and it appears that your script somehow ignores the first ~600 bytes
    of my test file.

    I don't have an explanation for this behaviour.

    [1] Your script resulted in
    09:50 $ cat ./bo.ksh
    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while 3<#$'\n'
    do echo $( 3<# )
    done

    Awwwwww fsck!

    I copied the wrong script. Your followup noted that this version
    had problems.

    I'll retry with the correct script.

    Sorry to have been a nuisance :-(

    Retesting with the /correct/ script shows that it duplicated
    the results of my egrep pipe. It looks like this script is a
    winner.

    Thanks for the education; I learned something new today. :-)
    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to janis_papanagnou+ng@hotmail.com on Mon Apr 3 14:28:19 2023
    In article <u0edfh$2utkp$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 03.04.2023 13:03, Janis Papanagnou wrote:
    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.
    [...]

    Don't use the second variant (the one using 3<#$'\n' ), it is
    *not* running reliably, as I noticed with more tests! The first
    one (using read -u3 ) runs just fine, though.

    I don't know what the overall goal is, but wouldn't this be easier:

    $ awk '{ print tot+0;tot += length + 1 }' file

    Note that this works fine in Unix, because in Unix bytes are bytes.
    It might need updating to work correctly under DOS/Windows. Or VMS...

    Or z/OS...

    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/ModernXtian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Lew Pitcher on Mon Apr 3 14:16:05 2023
    On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

    Hi, Janis

    On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I
    wrote this quick hack[**] that I share here in case anyone's interested

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done

    [snip]

    Your script intrigued me. While I don't normally use kornshell, I decided
    to try the script out to see what it did.

    I have a multi-line lorem ipsum test file that I fed to your script, and
    it came up with some funny numbers. Specifically, it missed the first
    few lines of the file. I double-checked, both visually and with grep[1]
    egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    and it appears that your script somehow ignores the first ~600 bytes
    of my test file.

    I don't have an explanation for this behaviour.

    [1] Your script resulted in
    09:50 $ cat ./bo.ksh
    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while 3<#$'\n'
    do echo $( 3<# )
    done

    Awwwwww fsck!

    I copied the wrong script. Your followup noted that this version
    had problems.

    I'll retry with the correct script.

    Sorry to have been a nuisance :-(
    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Mon Apr 3 19:36:59 2023
    On 03.04.2023 16:28, Kenny McCormack wrote:
    In article <u0edfh$2utkp$1@dont-email.me>,
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 03.04.2023 13:03, Janis Papanagnou wrote:
    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.
    [...]

    Don't use the second variant (the one using 3<#$'\n' ), it is
    *not* running reliably, as I noticed with more tests! The first
    one (using read -u3 ) runs just fine, though.

    I don't know what the overall goal is,

    The goal was to create an index file.[*]

    but wouldn't this be easier:

    $ awk '{ print tot+0;tot += length + 1 }' file

    Note that this works fine in Unix, because in Unix bytes are bytes.
    It might need updating to work correctly under DOS/Windows. Or VMS...

    Or z/OS...

    Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
    variant; I like it for its readability.

    Janis

    [*] The task actually is (in a Javascript context) to use a low-level Javascript file access function without loading the whole (huge) file
    into memory; this function was suggested to me, and it requires such
    an index that I have to create beforehand. (I thought there'd be some
    standard tool for such a (standard-?) task available.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lew Pitcher on Mon Apr 3 19:49:46 2023
    On 03.04.2023 16:18, Lew Pitcher wrote:
    On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:

    On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

    Hi, Janis

    On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

    I just needed to determine the byte-offsets of all lines in a text file >>>> to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I >>>> wrote this quick hack[**] that I share here in case anyone's interested >>>>
    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file >>>> #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done

    [snip]

    Your script intrigued me. While I don't normally use kornshell, I decided >>> to try the script out to see what it did.

    I have a multi-line lorem ipsum test file that I fed to your script, and >>> it came up with some funny numbers. Specifically, it missed the first
    few lines of the file. I double-checked, both visually and with grep[1]
    egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
    and it appears that your script somehow ignores the first ~600 bytes
    of my test file.

    I don't have an explanation for this behaviour.

    [1] Your script resulted in
    09:50 $ cat ./bo.ksh
    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file >>> #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while 3<#$'\n'
    do echo $( 3<# )
    done

    Awwwwww fsck!

    I copied the wrong script. Your followup noted that this version
    had problems.

    I'll retry with the correct script.

    Sorry to have been a nuisance :-(

    You haven't been. I appreciate any tests and feedback.


    Retesting with the /correct/ script shows that it duplicated
    the results of my egrep pipe. It looks like this script is a
    winner.

    You seem to have been using the 3<#$'\n' based variant? - And it
    works? - Still not reliably in my environment. I'll have to examine
    that further.


    Thanks for the education; I learned something new today. :-)

    Thanks for your tests! (And for the overall confirmation.) I'm a bit
    reluctant when using Kornshell's newer "redirection" operators; they
    seem to not be reliable as I experienced [in my environment] in the
    past. (Maybe it's advisable to test and confirm that in Martijn's
    ksh93u+m, which generally seems to be much more reliable.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lew Pitcher on Mon Apr 3 19:56:16 2023
    On 03.04.2023 16:16, Lew Pitcher wrote:
    On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:


    I copied the wrong script. Your followup noted that this version
    had problems.

    I'll retry with the correct script.

    Argh! - And I missed this post.

    So you made the same observation that I made. - Thanks!

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Janis Papanagnou on Tue Apr 4 02:45:34 2023
    On Mon, 3 Apr 2023 19:36:59 +0200
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 03.04.2023 16:28, Kenny McCormack wrote:
    I don't know what the overall goal is,

    The goal was to create an index file.[*]

    but wouldn't this be easier:

    $ awk '{ print tot+0;tot += length + 1 }' file

    Note that this works fine in Unix, because in Unix bytes are bytes.
    It might need updating to work correctly under DOS/Windows. Or VMS...

    Or z/OS...

    Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
    variant; I like it for its readability.

    A generalisation is

    awk -v nob=$(echo | wc -c) '{ print tot+0 ; tot += length + nob }' file

    But I haven't tested it on a system where the newline sequence is different than the single LF byte. And there are (or used to be) operating systems
    where there is no notion of newline sequence and files are made of
    records where each record is a line.

    A different consideration : is it ok if the output is the same regardless
    of whether the input ends in a newline sequence or not ? With the above
    awk scripts it will be the same.

    --
    As someone once joked, "It's easier to prove the Riemann hypothesis than to get someone to read your proof!"
    http://empslocal.ex.ac.uk/people/staff/mrwatkin/zeta/RHproofs.htm

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Janis Papanagnou on Tue Apr 4 09:27:59 2023
    On 03.04.2023 13:03, Janis Papanagnou wrote:
    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    [...]

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done

    Just occurred to me; to shorten that a bit and avoid duplicate pieces
    of code...

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    while echo $( 3<# ) ; read -u3
    do :
    done


    [...]

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jalen Q@21:1/5 to Janis Papanagnou on Tue Apr 4 22:09:18 2023
    On Monday, April 3, 2023 at 6:03:13 AM UTC-5, Janis Papanagnou wrote:
    I just needed to determine the byte-offsets of all lines in a text file
    to create an index file.

    On a quick search I couldn't find any Unix tool/shell solution[*] so I
    wrote this quick hack[**] that I share here in case anyone's interested

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while read -u3
    do echo $( 3<# )
    done


    Even though that code is fast enough for my (MB sized) files using Kornshell's pattern seek-redirections to locate the newlines in the
    file seems to be significantly faster than the 'read' based approach

    #!/bin/ksh

    # byteoffset - create a byte-offset list for the lines in a given file
    #
    # Usage: byteoffset filename

    f=${1:?"Usage: ${0##*/} filename"}
    exec 3<"$f"

    echo $( 3<# )
    while 3<#$'\n'
    do echo $( 3<# )
    done


    On a 320 MB test file the first script requires ~8 seconds and the
    second one ~0.3 seconds.

    Janis

    [*] Does 'sed' maybe support such a function? Or is there any other
    standard tool I missed?

    [**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).
    hjuuuyyyy77y

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Lew Pitcher on Thu Apr 6 07:49:46 2023
    On 03.04.2023 16:18, Lew Pitcher wrote:

    Retesting with the /correct/ script shows that it duplicated
    the results of my egrep pipe. It looks like this script is a
    winner.

    Not really. The shell's read-loop is slow and the egrep/awk pipe
    seems to be a lot faster. As long as I cannot make the shell's
    pattern seek functional and reliable it makes sense to stay with
    the pipe. Wasn't aware of grep's '-b' option; thanks for that!

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)