• Pipe cleanup of text - help needed

    From Java Jive@21:1/5 to All on Sun Aug 1 18:02:49 2021
    XPost: alt.os.linux

    I have an archive of scanned documents which I need to index. A typical
    sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ...
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ...
    Unknown Person's Notebook - End 5.png
    Unknown Person's Notebook - Insert 00a.png
    Unknown Person's Notebook - Insert 00b.png
    Unknown Person's Notebook - Insert 01.png
    Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
    India.png
    Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
    Dekklan, India.png
    Unknown Person's Notebook - Insert 03 - 17800208.png
    Unknown Person's Notebook - Insert 04.png
    Unknown Person's Notebook - Insert 05.png
    Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
    Unknown Person's Notebook - Insert 07a.png
    Unknown Person's Notebook - Insert 07b.png
    Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
    Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
    Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 10.png
    Unknown Person's Notebook - Insert 11a.png
    Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 12.png
    Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
    Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
    Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
    (partly completed).png
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 1.png
    ...
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 6.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
    Unknown Person's Notebook.txt

    Original output for ls -1pr <etc>

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    Unknown Person's Notebook - 02.png
    Unknown Person's Notebook - 03.png
    Unknown Person's Notebook - 04.png
    Unknown Person's Notebook - 05.png
    Unknown Person's Notebook - 06.png
    Unknown Person's Notebook - 07.png
    Unknown Person's Notebook - 08.png
    Unknown Person's Notebook - 09.png
    Unknown Person's Notebook - 10.png
    Unknown Person's Notebook - 11.png
    Unknown Person's Notebook - 12.png
    Unknown Person's Notebook - 13.png
    Unknown Person's Notebook - 14.png
    Unknown Person's Notebook - 15.png
    Unknown Person's Notebook - 16.png
    Unknown Person's Notebook - 17.png
    Unknown Person's Notebook - 18.png
    Unknown Person's Notebook - 19.png
    Unknown Person's Notebook - 20.png
    Unknown Person's Notebook - 21.png
    Unknown Person's Notebook - 22.png
    Unknown Person's Notebook - 23.png
    Unknown Person's Notebook - 24.png
    Unknown Person's Notebook - 25.png
    Unknown Person's Notebook - 26.png
    Unknown Person's Notebook - 27.png
    Unknown Person's Notebook - 28.png
    Unknown Person's Notebook - 29.png
    Unknown Person's Notebook - 30.png
    Unknown Person's Notebook - 31.png
    Unknown Person's Notebook - 32.png
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    Unknown Person's Notebook - End 1.png
    Unknown Person's Notebook - End 2.png
    Unknown Person's Notebook - End 3.png
    Unknown Person's Notebook - End 4.png
    Unknown Person's Notebook - End 5.png
    Unknown Person's Notebook - Insert 00a.png
    Unknown Person's Notebook - Insert 00b.png
    Unknown Person's Notebook - Insert 01.png
    Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
    India.png
    Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
    Dekklan, India.png
    Unknown Person's Notebook - Insert 03 - 17800208.png
    Unknown Person's Notebook - Insert 04.png
    Unknown Person's Notebook - Insert 05.png
    Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
    Unknown Person's Notebook - Insert 07a.png
    Unknown Person's Notebook - Insert 07b.png
    Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
    Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
    Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 10.png
    Unknown Person's Notebook - Insert 11a.png
    Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 12.png
    Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
    Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
    Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
    (partly completed).png
    Unknown Person's Notebook - Insert 15a - 'La Poèsie didactique des Hébreu' - 1.png
    Unknown Person's Notebook - Insert 15a - 'La Poèsie didactique des Hébreu' - 2.png
    Unknown Person's Notebook - Insert 15a - 'La Poèsie didactique des Hébreu' - 3.png
    Unknown Person's Notebook - Insert 15a - 'La Poèsie didactique des Hébreu' - 4.png
    Unknown Person's Notebook - Insert 15a - 'La Poèsie didactique des Hébreu' - 5.png
    Unknown Person's Notebook - Insert 15a - 'La Poèsie didactique des Hébreu' - 6.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
    Unknown Person's Notebook.txt

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Java Jive on Sun Aug 1 18:26:05 2021
    Java Jive <java@evij.com.invalid> writes:
    On 01/08/2021 17:02, Java Jive wrote:
    snip <

    WTF is Java.Jive@f1.n221.z2.fidonet.fi and why is he duplicating my
    posts here?

    Someone is running a broken fido/usenet gateway.

    Injection-Info: gioia.aioe.org; logging-data="10598"; posting-host="F7FIqN6dkowTZ1CLxZIWTQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";

    They’re injecting via aioe.org, so I guess complaints to there.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Java Jive on Sun Aug 1 11:57:40 2021
    XPost: alt.os.linux

    On 8/1/21 11:02 AM, Java Jive wrote:
    I want to clean this up so that only the first and last of each
    section are output, separated by a single line containing just '...'.

    It's not just /first/ and /last/ line of a group. There also seems to
    be a component on a minimum number of lines. E.g. "Genealogy Of Job" is
    only two lines, but you aren't inserting "..." between the first and
    last member of the group.

    Can anyone suggest a way of doing this by piping the output through
    awk or sed on the fly, rather than having to write a program to
    post-process the index?

    I don't see a way to do this in the 90 seconds that I've looked at it.
    However I do see a thread that might be worth pulling at. Maybe someone
    else, perhaps the OP, will see the next step.

    I would be inclined drop the last item (term?) from the base file name,
    with the intention of turning this:

    Unknown Person's Notebook - End 0
    Unknown Person's Notebook - End 1
    Unknown Person's Notebook - End 2
    Unknown Person's Notebook - End 3
    Unknown Person's Notebook - End 4
    Unknown Person's Notebook - End 5

    Into this:

    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End

    This seems like something you could run through uniq (-c) to have a
    start at finding ""duplicate / incremental parts ~> bases of file names.

    You could probably use that as information to drive a decision to
    truncate the output or not.

    I feel like this may need multiple passes through the input; one to
    identify when things need to be abbreviated / truncated and another as
    the source of the data to be abbreviated / truncated or not. This means
    that it's not exactly conducive to a typical STDIN -> STDOUT like filter.

    The next thing to think about is trying to leverage sed's hold space and
    doing a comparison of the current line to the hold space. -- I don't
    do this often enough to know how to do this. But, this probably does
    have the advantage of being able to do this in a single pass.

    Seeing as how this plays on coparing adjacent lines of text, it will
    almost certainly be predicated on the list being sorted.

    However, you can't blindly strip off the file extension (and last part
    of the name). Lest you combine file-1.png, file-2.jpg, and file-3.gif.

    You really seem to be talking about something that can dynamically allow
    for one element in a series of file (base) names differ and
    conditionally truncate them. But you don't want to truncate file-1.{png,jpg,gif} where the base name is the same but the extension
    is the only part that differs.

    This seems like a non-trival problem for simply parsing text.



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Java Jive@21:1/5 to Java Jive on Sun Aug 1 18:20:17 2021
    On 01/08/2021 17:02, Java Jive wrote:
    snip <

    WTF is Java.Jive@f1.n221.z2.fidonet.fi and why is he duplicating my
    posts here?

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Java Jive on Sun Aug 1 16:16:23 2021
    XPost: alt.os.linux

    Java Jive wrote:
    I have an archive of scanned documents which I need to index. A typical sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ...
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ...
    Unknown Person's Notebook - End 5.png

    Awk (Gawk) has the ability to store things in arrays.

    For example, in awk, I can reverse the order of lines in
    a text file. A file with lines 1..10 can be emitted in
    order 10..1. This requires the usage of an array in memory,
    which grows as the file (or piped input) is acquired, then
    the memory array is dumped in the END() clause of the program.
    In such a situation, a 10GB text file cannot be processed
    by a 2GB RAM machine. "A person has to know their limits."

    We might also have to decide what to do about

    Unknown Person's Notebook - 1.png
    ...
    Unknown Person's Notebook - 33.png

    or the multiple iterator case (which is "easy" from
    a sorting perspective, but how do we know which
    iterator is the least significant one). Maybe the
    controlling iterator is the one on the right.

    Unknown 01 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png
    Unknown 02 Person's Notebook - 01.png
    Unknown 02 Person's Notebook - 02.png
    Unknown 03 Person's Notebook - 01.png
    Unknown 03 Person's Notebook - 02.png

    output:

    Unknown 01 Person's Notebook - 01.png Group 01
    ...
    Unknown 03 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png Group 02
    ...
    Unknown 03 Person's Notebook - 02.png

    You could scan for digits from the right, and
    assume the operator is logically minded. Or something.

    The version of Gawk I traditionally use, only knows
    ASCII. I don't know what the latest evolution is, in terms
    of, say, UTF-8. Part of the problem, is the notion of
    a character being one byte wide, and what does the
    Gawk program do when the characters are variable width.
    One side effect, is the runtime could be considerably
    slower. Or, the memory array representation could be
    "very inefficient" and four times larger than normal.
    Sorta like how some image editing programs now use
    absurdly wide internal representations.

    The first part of any program, is "a complete specification".
    The effort to write the program goes up exponentially,
    if the program specification is "dribbling in". For example,
    one of my attempts to tame some ls -R output, ran into
    character set problems. And my solution at the time, was
    to delete the offending files ("save as web page complete"
    was the source of the bad file names).

    Awk can store the entire input in memory, if you want it to.

    *******

    I'll offer these two.

    find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
    find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt

    The "dirlist" is a succinct summary, with less detail
    than you would like.

    But it also didn't require writing a program.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Gregorie@21:1/5 to Java Jive on Sun Aug 1 20:50:52 2021
    On Sun, 01 Aug 2021 18:02:49 +0200, Java Jive wrote:

    I have an archive of scanned documents which I need to index. A typical sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png ....
    Unknown Person's Notebook - 33.png Unknown Person's Notebook - End 0.png
    ....
    Unknown Person's Notebook - End 5.png Unknown Person's Notebook - Insert 00a.png Unknown Person's Notebook - Insert 00b.png Unknown Person's
    Notebook - Insert 01.png Unknown Person's Notebook - Insert 02a - Sketch
    Of Monument, Dekklan, India.png Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop, Dekklan, India.png Unknown Person's
    Notebook - Insert 03 - 17800208.png Unknown Person's Notebook - Insert
    04.png Unknown Person's Notebook - Insert 05.png Unknown Person's
    Notebook - Insert 06 - Sketch Of Crocodile.png Unknown Person's Notebook
    - Insert 07a.png Unknown Person's Notebook - Insert 07b.png Unknown
    Person's Notebook - Insert 08 - Sketch Of Boat.png Unknown Person's
    Notebook - Insert 09a - Sketch Of Building.png Unknown Person's Notebook
    - Insert 09b - Fragment Of Writing.png Unknown Person's Notebook -
    Insert 10.png Unknown Person's Notebook - Insert 11a.png Unknown
    Person's Notebook - Insert 11b - Fragment Of Writing.png Unknown
    Person's Notebook - Insert 12.png Unknown Person's Notebook - Insert 13
    - Sketch Of Bird.png Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png Unknown Person's Notebook - Insert 14b - Sketch Of
    Ancient Building (partly completed).png Unknown Person's Notebook -
    Insert 15a - ''La Poèsie didactique des Hébreu' - 1.png ....
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 6.png Unknown Person's Notebook - Insert 15b - Genealogy Of
    Job - 1.png Unknown Person's Notebook - Insert 15b - Genealogy Of Job -
    2.png Unknown Person's Notebook.txt

    Original output for ls -1pr <etc>

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png Unknown Person's Notebook - 02.png
    Unknown Person's Notebook - 03.png Unknown Person's Notebook - 04.png
    Unknown Person's Notebook - 05.png Unknown Person's Notebook - 06.png


    I generally handle this by removing the (first) part of the filename that matches the directory that contains that group of files, but so far,
    haven't needed to auto-trim any filenames.

    ---
    I initially wrote a fairly short PHP script that built HTML index pages
    on the fly as I navigated through the archive.

    This worked fairly well until I decided that the best index for an image
    file was a thumbnail image captioned with the image filename. Doing all
    this in PHP and letting it call Image Magik to generate the thumbnails on
    the fly when the directory was entered worked, but was tooth-achingly
    slow if there weres more than one or two image files in the directory.

    At this point, I rewrote the indexing and thumbnail generating code in
    Java and run it as an automatic overnight process. Unless I've slung a
    big bunch of new photos into the archive it typically runs in under 5
    minutes each night. It could be quicker if it didn't rescan the entire
    archive looking for added, deleted and changed image files and generating/ deleting/regenerating thumbnails as needed. It uses date comparison to recognise thumbnails that need regenaration: if the original image is
    more recent than the th
  • From Java Jive@21:1/5 to Richard Kettlewell on Sun Aug 1 22:44:50 2021
    On 01/08/2021 18:26, Richard Kettlewell wrote:
    Java Jive <java@evij.com.invalid> writes:
    On 01/08/2021 17:02, Java Jive wrote:
    snip <

    WTF is Java.Jive@f1.n221.z2.fidonet.fi and why is he duplicating my
    posts here?

    Someone is running a broken fido/usenet gateway.

    Injection-Info: gioia.aioe.org; logging-data="10598"; posting-host="F7FIqN6dkowTZ1CLxZIWTQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";

    They’re injecting via aioe.org, so I guess complaints to there.

    Done, thanks for your explanation, we'll have to wait and see what the
    result of the complaint is.

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Java Jive@21:1/5 to Java Jive on Sun Aug 1 23:16:21 2021
    XPost: alt.os.linux

    On 01/08/2021 18:02, Java Jive wrote:

    I have an archive of scanned documents which I need to index.  A typical sample output of ls is appended.  I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'.  Can anyone suggest a way of doing this by piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ....
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ....
    Unknown Person's Notebook - End 5.png
    Unknown Person's Notebook - Insert 00a.png
    Unknown Person's Notebook - Insert 00b.png
    Unknown Person's Notebook - Insert 01.png
    Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan, India.png
    Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop, Dekklan, India.png
    Unknown Person's Notebook - Insert 03 - 17800208.png
    Unknown Person's Notebook - Insert 04.png
    Unknown Person's Notebook - Insert 05.png
    Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
    Unknown Person's Notebook - Insert 07a.png
    Unknown Person's Notebook - Insert 07b.png
    Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
    Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
    Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 10.png
    Unknown Person's Notebook - Insert 11a.png
    Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 12.png
    Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
    Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
    Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
    (partly completed).png
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 1.png
    ....
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 6.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
    Unknown Person's Notebook.txt

    Thanks Grant & Paul. To clarify:

    There's no point in putting '...' in between Genealogy Of Job - 1 & 2
    because there's nothing missing and it would make the index longer, not shorter. The minimum series length that it's worthwhile for is 3.

    There's only ever one iterator in operation at a time, and it's always
    the last number in the filename.

    So how would I truncate the current line in awk or sed, $0 in the
    former, and hold it for comparison to the following lines until there's
    a mismatch? I've used sed for very simple s/pattern/replace/ type
    operations, but it's inner workings are something of a mystery. I've
    only ever done the simplest things in awk.

    I can see exactly how I would write a shell program to do this with the
    input read from a file dump of ls output, but I can't help feeling there
    must be a better way of doing it on the fly.

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From William Unruh@21:1/5 to Paul on Sun Aug 1 22:28:25 2021
    XPost: alt.os.linux

    On 2021-08-01, Paul <nospam@needed.invalid> wrote:
    Java Jive wrote:
    I have an archive of scanned documents which I need to index. A typical
    sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Each section means what?

    If youjust want the first and last

    awk ' {T=$0; if ( NR==1) print T; }
    END {if ( NR>2 ) then print "..."; print T} '



    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ...
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ...
    Unknown Person's Notebook - End 5.png

    Awk (Gawk) has the ability to store things in arrays.

    For example, in awk, I can reverse the order of lines in
    a text file. A file with lines 1..10 can be emitted in
    order 10..1. This requires the usage of an array in memory,
    which grows as the file (or piped input) is acquired, then
    the memory array is dumped in the END() clause of the program.
    In such a situation, a 10GB text file cannot be processed
    by a 2GB RAM machine. "A person has to know their limits."

    We might also have to decide what to do about

    Unknown Person's Notebook - 1.png
    ...
    Unknown Person's Notebook - 33.png

    or the multiple iterator case (which is "easy" from
    a sorting perspective, but how do we know which
    iterator is the least significant one). Maybe the
    controlling iterator is the one on the right.

    Unknown 01 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png
    Unknown 02 Person's Notebook - 01.png
    Unknown 02 Person's Notebook - 02.png
    Unknown 03 Person's Notebook - 01.png
    Unknown 03 Person's Notebook - 02.png

    output:

    Unknown 01 Person's Notebook - 01.png Group 01
    ...
    Unknown 03 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png Group 02
    ...
    Unknown 03 Person's Notebook - 02.png

    You could scan for digits from the right, and
    assume the operator is logically minded. Or something.

    The version of Gawk I traditionally use, only knows
    ASCII. I don't know what the latest evolution is, in terms
    of, say, UTF-8. Part of the problem, is the notion of
    a character being one byte wide, and what does the
    Gawk program do when the characters are variable width.
    One side effect, is the runtime could be considerably
    slower. Or, the memory array representation could be
    "very inefficient" and four times larger than normal.
    Sorta like how some image editing programs now use
    absurdly wide internal representations.

    The first part of any program, is "a complete specification".
    The effort to write the program goes up exponentially,
    if the program specification is "dribbling in". For example,
    one of my attempts to tame some ls -R output, ran into
    character set problems. And my solution at the time, was
    to delete the offending files ("save as web page complete"
    was the source of the bad file names).

    Awk can store the entire input in memory, if you want it to.

    *******

    I'll offer these two.

    find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
    find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt

    The "dirlist" is a succinct summary, with less detail
    than you would like.

    But it also didn't require writing a program.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Java Jive@21:1/5 to William Unruh on Mon Aug 2 00:14:24 2021
    XPost: alt.os.linux

    On 01/08/2021 23:28, William Unruh wrote:
    On 2021-08-01, Paul <nospam@needed.invalid> wrote:
    Java Jive wrote:
    I have an archive of scanned documents which I need to index. A typical >>> sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Each section means what?

    If youjust want the first and last

    awk ' {T=$0; if ( NR==1) print T; }
    END {if ( NR>2 ) then print "..."; print T} '

    Yes, that gives me the first and last line in each directory listing, as follows ...

    ~ # ls -1pR Family\ History/Unknown | awk '{T=$0; if (NR==1) print T};
    END {if (NR > 2) print "..."; print T}'
    Family History/Unknown:
    ...
    Unknown Person's Notebook.txt

    ... which is a start and may help me by example work out the sort of
    thing I want, for which many thanks. However, what I was really after
    was the truncation of long lists of essentially the same filename where
    only the page number at the end varies, to give something like this ...

    Family History/Unknown:
    Blah-blah 01
    ...
    Blah-blah 55
    Widgetry 1
    ...
    Widgetry 6

    ... etc, so obviously some sort of comparison between the current line
    and the previous line, or the first of the current series of similar
    lines, is required.

    It's late here in the UK, so I'm off to bed now, but I'll have a closer
    look at your example to-morrow and see if I can extend it to do what I want.

    Thanks again.

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Java Jive on Sun Aug 1 18:57:40 2021
    On 8/1/21 11:02 AM, Java Jive wrote:
    I want to clean this up so that only the first and last of each
    section are output, separated by a single line containing just '...'.

    It's not just /first/ and /last/ line of a group. There also seems to
    be a component on a minimum number of lines. E.g. "Genealogy Of Job" is
    only two lines, but you aren't inserting "..." between the first and
    last member of the group.

    Can anyone suggest a way of doing this by piping the output through
    awk or sed on the fly, rather than having to write a program to
    post-process the index?

    I don't see a way to do this in the 90 seconds that I've looked at it.
    However I do see a thread that might be worth pulling at. Maybe someone
    else, perhaps the OP, will see the next step.

    I would be inclined drop the last item (term?) from the base file name,
    with the intention of turning this:

    Unknown Person's Notebook - End 0
    Unknown Person's Notebook - End 1
    Unknown Person's Notebook - End 2
    Unknown Person's Notebook - End 3
    Unknown Person's Notebook - End 4
    Unknown Person's Notebook - End 5

    Into this:

    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End
    Unknown Person's Notebook - End

    This seems like something you could run through uniq (-c) to have a
    start at finding ""duplicate / incremental parts ~> bases of file names.

    You could probably use that as information to drive a decision to
    truncate the output or not.

    I feel like this may need multiple passes through the input; one to
    identify when things need to be abbreviated / truncated and another as
    the source of the data to be abbreviated / truncated or not. This means
    that it's not exactly conducive to a typical STDIN -> STDOUT like filter.

    The next thing to think about is trying to leverage sed's hold space and
    doing a comparison of the current line to the hold space. -- I don't
    do this often enough to know how to do this. But, this probably does
    have the advantage of being able to do this in a single pass.

    Seeing as how this plays on coparing adjacent lines of text, it will
    almost certainly be predicated on the list being sorted.

    However, you can't blindly strip off the file extension (and last part
    of the name). Lest you combine file-1.png, file-2.jpg, and file-3.gif.

    You really seem to be talking about something that can dynamically allow
    for one element in a series of file (base) names differ and
    conditionally truncate them. But you don't want to truncate file-1.{png,jpg,gif} where the base name is the same but the extension
    is the only part that differs.

    This seems like a non-trival problem for simply parsing text.



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Java Jive on Sun Aug 1 21:16:23 2021
    Java Jive wrote:
    I have an archive of scanned documents which I need to index. A typical sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ...
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ...
    Unknown Person's Notebook - End 5.png

    Awk (Gawk) has the ability to store things in arrays.

    For example, in awk, I can reverse the order of lines in
    a text file. A file with lines 1..10 can be emitted in
    order 10..1. This requires the usage of an array in memory,
    which grows as the file (or piped input) is acquired, then
    the memory array is dumped in the END() clause of the program.
    In such a situation, a 10GB text file cannot be processed
    by a 2GB RAM machine. "A person has to know their limits."

    We might also have to decide what to do about

    Unknown Person's Notebook - 1.png
    ...
    Unknown Person's Notebook - 33.png

    or the multiple iterator case (which is "easy" from
    a sorting perspective, but how do we know which
    iterator is the least significant one). Maybe the
    controlling iterator is the one on the right.

    Unknown 01 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png
    Unknown 02 Person's Notebook - 01.png
    Unknown 02 Person's Notebook - 02.png
    Unknown 03 Person's Notebook - 01.png
    Unknown 03 Person's Notebook - 02.png

    output:

    Unknown 01 Person's Notebook - 01.png Group 01
    ...
    Unknown 03 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png Group 02
    ...
    Unknown 03 Person's Notebook - 02.png

    You could scan for digits from the right, and
    assume the operator is logically minded. Or something.

    The version of Gawk I traditionally use, only knows
    ASCII. I don't know what the latest evolution is, in terms
    of, say, UTF-8. Part of the problem, is the notion of
    a character being one byte wide, and what does the
    Gawk program do when the characters are variable width.
    One side effect, is the runtime could be considerably
    slower. Or, the memory array representation could be
    "very inefficient" and four times larger than normal.
    Sorta like how some image editing programs now use
    absurdly wide internal representations.

    The first part of any program, is "a complete specification".
    The effort to write the program goes up exponentially,
    if the program specification is "dribbling in". For example,
    one of my attempts to tame some ls -R output, ran into
    character set problems. And my solution at the time, was
    to delete the offending files ("save as web page complete"
    was the source of the bad file names).

    Awk can store the entire input in memory, if you want it to.

    *******

    I'll offer these two.

    find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
    find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt

    The "dirlist" is a succinct summary, with less detail
    than you would like.

    But it also didn't require writing a program.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Gregorie@21:1/5 to Java Jive on Sun Aug 1 21:50:52 2021
    On Sun, 01 Aug 2021 18:02:49 +0200, Java Jive wrote:

    I have an archive of scanned documents which I need to index. A typical sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png ....
    Unknown Person's Notebook - 33.png Unknown Person's Notebook - End 0.png
    ....
    Unknown Person's Notebook - End 5.png Unknown Person's Notebook - Insert 00a.png Unknown Person's Notebook - Insert 00b.png Unknown Person's
    Notebook - Insert 01.png Unknown Person's Notebook - Insert 02a - Sketch
    Of Monument, Dekklan, India.png Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop, Dekklan, India.png Unknown Person's
    Notebook - Insert 03 - 17800208.png Unknown Person's Notebook - Insert
    04.png Unknown Person's Notebook - Insert 05.png Unknown Person's
    Notebook - Insert 06 - Sketch Of Crocodile.png Unknown Person's Notebook
    - Insert 07a.png Unknown Person's Notebook - Insert 07b.png Unknown
    Person's Notebook - Insert 08 - Sketch Of Boat.png Unknown Person's
    Notebook - Insert 09a - Sketch Of Building.png Unknown Person's Notebook
    - Insert 09b - Fragment Of Writing.png Unknown Person's Notebook -
    Insert 10.png Unknown Person's Notebook - Insert 11a.png Unknown
    Person's Notebook - Insert 11b - Fragment Of Writing.png Unknown
    Person's Notebook - Insert 12.png Unknown Person's Notebook - Insert 13
    - Sketch Of Bird.png Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png Unknown Person's Notebook - Insert 14b - Sketch Of
    Ancient Building (partly completed).png Unknown Person's Notebook -
    Insert 15a - ''La Poèsie didactique des Hébreu' - 1.png ....
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 6.png Unknown Person's Notebook - Insert 15b - Genealogy Of
    Job - 1.png Unknown Person's Notebook - Insert 15b - Genealogy Of Job -
    2.png Unknown Person's Notebook.txt

    Original output for ls -1pr <etc>

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png Unknown Person's Notebook - 02.png
    Unknown Person's Notebook - 03.png Unknown Person's Notebook - 04.png
    Unknown Person's Notebook - 05.png Unknown Person's Notebook - 06.png


    I generally handle this by removing the (first) part of the filename that matches the directory that contains that group of files, but so far,
    haven't needed to auto-trim any filenames.

    - ---
    I initially wrote a fairly short PHP script that built HTML index pages
    on the fly as I navigated through the archive.

    This worked fairly well until I decided that the best index for an image
    file was a thumbnail image captioned with the image filename. Doing all
    this in PHP and letting it call Image Magik to generate the thumbnails on
    the fly when the directory was entered worked, but was tooth-achingly
    slow if there weres more than one or two image files in the directory.

    At this point, I rewrote the indexing and thumbnail generating code in
    Java and run it as an automatic overnight process. Unless I've slung a
    big bunch of new photos into the archive it typically runs in under 5
    minutes each night. It could be quicker if it didn't rescan the entire
    archive looking for added, deleted and changed image files and generating/ deleting/regenerating thumbnails as needed. It uses date comparison to recognise thumbnails that need regenaration: if the original image is
    more recent than the thumbnail, the latter gets regenerated.

    Overall I'm very pleased with the result. Index page are created on the
    fly as requested using PHP but, now it doesn't need to generate thunbnail images, its fast enough to not annoy me when looking for stuff in the
    archive.

    Occasionally the overnight process generates a duff thumbnail which
    requires manual reconstruction with GIMP. Thats because its designed to
    work on .JPG files, so meeting a PNG or GIF image, or the non-standard JPG images from some NASA sourcescan create a duff thumbnail because the Java
    image conversion library doesn't auto-recognise image formats at all well
    and it trips over JPG images that don't exactly follow the JPEG
    specification.


    --
    Martin | martin at
    Gregorie | gregorie dot org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From William Unruh@21:1/5 to Paul on Sun Aug 1 23:28:25 2021
    On 2021-08-01, Paul <nospam@needed.invalid> wrote:
    Java Jive wrote:
    I have an archive of scanned documents which I need to index. A typical
    sample output of ls is appended. I want to clean this up so that only
    the first and last of each section are output, separated by a single
    line containing just '...'. Can anyone suggest a way of doing this by
    piping the output through awk or sed on the fly, rather than having to
    write a program to post-process the index?

    Each section means what?

    If youjust want the first and last

    awk ' {T=$0; if ( NR==1) print T; }
    END {if ( NR>2 ) then print "..."; print T} '



    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ...
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ...
    Unknown Person's Notebook - End 5.png

    Awk (Gawk) has the ability to store things in arrays.

    For example, in awk, I can reverse the order of lines in
    a text file. A file with lines 1..10 can be emitted in
    order 10..1. This requires the usage of an array in memory,
    which grows as the file (or piped input) is acquired, then
    the memory array is dumped in the END() clause of the program.
    In such a situation, a 10GB text file cannot be processed
    by a 2GB RAM machine. "A person has to know their limits."

    We might also have to decide what to do about

    Unknown Person's Notebook - 1.png
    ...
    Unknown Person's Notebook - 33.png

    or the multiple iterator case (which is "easy" from
    a sorting perspective, but how do we know which
    iterator is the least significant one). Maybe the
    controlling iterator is the one on the right.

    Unknown 01 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png
    Unknown 02 Person's Notebook - 01.png
    Unknown 02 Person's Notebook - 02.png
    Unknown 03 Person's Notebook - 01.png
    Unknown 03 Person's Notebook - 02.png

    output:

    Unknown 01 Person's Notebook - 01.png Group 01
    ...
    Unknown 03 Person's Notebook - 01.png
    Unknown 01 Person's Notebook - 02.png Group 02
    ...
    Unknown 03 Person's Notebook - 02.png

    You could scan for digits from the right, and
    assume the operator is logically minded. Or something.

    The version of Gawk I traditionally use, only knows
    ASCII. I don't know what the latest evolution is, in terms
    of, say, UTF-8. Part of the problem, is the notion of
    a character being one byte wide, and what does the
    Gawk program do when the characters are variable width.
    One side effect, is the runtime could be considerably
    slower. Or, the memory array representation could be
    "very inefficient" and four times larger than normal.
    Sorta like how some image editing programs now use
    absurdly wide internal representations.

    The first part of any program, is "a complete specification".
    The effort to write the program goes up exponentially,
    if the program specification is "dribbling in". For example,
    one of my attempts to tame some ls -R output, ran into
    character set problems. And my solution at the time, was
    to delete the offending files ("save as web page complete"
    was the source of the bad file names).

    Awk can store the entire input in memory, if you want it to.

    *******

    I'll offer these two.

    find /media/FOREIGN -type d -exec ls -al -1 -d {} + > dirlist.txt
    find /media/FOREIGN -type f -exec ls -al -1 {} + > filelist.txt

    The "dirlist" is a succinct summary, with less detail
    than you would like.

    But it also didn't require writing a program.

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Java Jive on Mon Aug 2 05:00:58 2021
    XPost: alt.os.linux

    Java Jive wrote:
    On 01/08/2021 18:02, Java Jive wrote:

    I have an archive of scanned documents which I need to index. A
    typical sample output of ls is appended. I want to clean this up so
    that only the first and last of each section are output, separated by
    a single line containing just '...'. Can anyone suggest a way of
    doing this by piping the output through awk or sed on the fly, rather
    than having to write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ....
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ....
    Unknown Person's Notebook - End 5.png
    Unknown Person's Notebook - Insert 00a.png
    Unknown Person's Notebook - Insert 00b.png
    Unknown Person's Notebook - Insert 01.png
    Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
    India.png
    Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
    Dekklan, India.png
    Unknown Person's Notebook - Insert 03 - 17800208.png
    Unknown Person's Notebook - Insert 04.png
    Unknown Person's Notebook - Insert 05.png
    Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
    Unknown Person's Notebook - Insert 07a.png
    Unknown Person's Notebook - Insert 07b.png
    Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
    Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
    Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 10.png
    Unknown Person's Notebook - Insert 11a.png
    Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 12.png
    Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
    Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
    Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
    (partly completed).png
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 1.png
    ....
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 6.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
    Unknown Person's Notebook.txt

    Thanks Grant & Paul. To clarify:

    There's no point in putting '...' in between Genealogy Of Job - 1 & 2
    because there's nothing missing and it would make the index longer, not shorter. The minimum series length that it's worthwhile for is 3.

    There's only ever one iterator in operation at a time, and it's always
    the last number in the filename.

    So how would I truncate the current line in awk or sed, $0 in the
    former, and hold it for comparison to the following lines until there's
    a mismatch? I've used sed for very simple s/pattern/replace/ type operations, but it's inner workings are something of a mystery. I've
    only ever done the simplest things in awk.

    I can see exactly how I would write a shell program to do this with the
    input read from a file dump of ls output, but I can't help feeling there
    must be a better way of doing it on the fly.


    Not thoroughly tested. Will show some awk syntax, no guarantees
    it meets the specs :-)

    ********************************* redund.awk ******************************
    # howtorun
    # gawk -f redund.awk inputfile.txt > outputfile.txt
    # ls-like-program-piped-to | gawk -f redund.awk

    # I usually put data samples inline like this, so I can stare at
    # them while writing snippers for stuff.

    # Unknown Person's Notebook - 01.png
    # Unknown Person's Notebook - 02.png
    # Unknown Person's Notebook - 03.png
    # Unknown Person's Notebook - End 0.png
    # Unknown Person's Notebook - End 1.png
    # Unknown Person's Notebook - End 2.png
    # Unknown Person's Notebook - Insert 00a.png
    # Unknown Person's Notebook - Insert 00b.png
    # Unknown Person's Notebook - Insert 01.png
    # Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan, India, 30451.png
    # 000001
    # 000002
    # 000003
    # 000004.png.jpg # not compressible with the others

    # Test some commands first, copy stuff from Internet, etc
    #
    # gawk '{match($0,/[0-9]{6}/);print substr($0,RSTART,RLENGTH)}' Input_file

    # gawk "{match($0,/[0-9]/);print substr($0,RSTART,RLENGTH)}" Works for the first digit only

    # gawk "{match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)}" Works to detect first instance

    # gawk "{match($0,/[[:digit:]]+/,arr);print arr[1] " " arr[2]}" Probably gawk5 only, cannot use

    # for(i=length($0);i>0;i--) x=x substr($0,i,1); A way to reverse a string, not needed

    BEGIN {
    FS = "." # peel off extension, if present, using $0 processing
    oldok = 0
    oldroot = "" # not a problem with oldok false
    }

    { # check the end of $1 for digits
    # match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)
    # By using a field separator of ".", we must disqualify NF>2 cases like 000004.png.jpg
    # split() can be used instead of FS and $0 processing, for more general programming solutions
    # Since the field separator is used very little in this program, it can be "wasted" like this.

    ok = (NF<=2) # boolean for string with compressible digits on end, initial determination

    match($1, /[[:digit:]]+/ ) # side effect... sets RSTART RLENGTH
    ok = (RSTART+RLENGTH-1 == length($1)) && ok # ok true is equal to 1, false is 0
    root = substr($1,1,RSTART-1) # Empty string for filename "000001"
    # print ok " " $1 " \"" root "\"" # the usual debug statement

    if ( root == oldroot && ok == 1 && oldok == 1) {
    cntr++
    }

    if ( root != oldroot || ok == 0) { # new assignment
    # Check processing of stuff in buffer
    if (oldok == 1) {
    if (cntr > 2) {
    print "..."
    }
    if (cntr > 1) {
    print oldstr
    }
    # cntr = 1 has already been printed
    }
    cntr = 1
    print $0 # opening stanza of a potential compression
    }

    # bookkeeping
    oldroot = root
    oldok = ok
    oldstr = $0

    # When I make doodles like the following in the source, it means I'm
    # struggling with the if-then-else order and making the code
    # as succinct as possible. This table started me out on the
    # wrong leg, and it took a second try to make a better if-then-else

    # root ok oldroot oldok cntr oldstr
    # xxx yyy dump previous if oldok,cntr,oldstr, define cntr = 1, print opening line
    # xxx 1 xxx 1 increment cntr

    }
    ********************************* end redund.awk ******************************

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Java Jive on Mon Aug 2 10:00:58 2021
    Java Jive wrote:
    On 01/08/2021 18:02, Java Jive wrote:

    I have an archive of scanned documents which I need to index. A
    typical sample output of ls is appended. I want to clean this up so
    that only the first and last of each section are output, separated by
    a single line containing just '...'. Can anyone suggest a way of
    doing this by piping the output through awk or sed on the fly, rather
    than having to write a program to post-process the index?

    Desired:

    Family History/Unknown/Unknown Person's Notebook:
    Unknown Person's Notebook - 01.png
    ....
    Unknown Person's Notebook - 33.png
    Unknown Person's Notebook - End 0.png
    ....
    Unknown Person's Notebook - End 5.png
    Unknown Person's Notebook - Insert 00a.png
    Unknown Person's Notebook - Insert 00b.png
    Unknown Person's Notebook - Insert 01.png
    Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan,
    India.png
    Unknown Person's Notebook - Insert 02b - Sketch Of Monument & Outcrop,
    Dekklan, India.png
    Unknown Person's Notebook - Insert 03 - 17800208.png
    Unknown Person's Notebook - Insert 04.png
    Unknown Person's Notebook - Insert 05.png
    Unknown Person's Notebook - Insert 06 - Sketch Of Crocodile.png
    Unknown Person's Notebook - Insert 07a.png
    Unknown Person's Notebook - Insert 07b.png
    Unknown Person's Notebook - Insert 08 - Sketch Of Boat.png
    Unknown Person's Notebook - Insert 09a - Sketch Of Building.png
    Unknown Person's Notebook - Insert 09b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 10.png
    Unknown Person's Notebook - Insert 11a.png
    Unknown Person's Notebook - Insert 11b - Fragment Of Writing.png
    Unknown Person's Notebook - Insert 12.png
    Unknown Person's Notebook - Insert 13 - Sketch Of Bird.png
    Unknown Person's Notebook - Insert 14a - Sketch Of Ancient Ruins.png
    Unknown Person's Notebook - Insert 14b - Sketch Of Ancient Building
    (partly completed).png
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 1.png
    ....
    Unknown Person's Notebook - Insert 15a - ''La Poèsie didactique des
    Hébreu' - 6.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 1.png
    Unknown Person's Notebook - Insert 15b - Genealogy Of Job - 2.png
    Unknown Person's Notebook.txt

    Thanks Grant & Paul. To clarify:

    There's no point in putting '...' in between Genealogy Of Job - 1 & 2
    because there's nothing missing and it would make the index longer, not shorter. The minimum series length that it's worthwhile for is 3.

    There's only ever one iterator in operation at a time, and it's always
    the last number in the filename.

    So how would I truncate the current line in awk or sed, $0 in the
    former, and hold it for comparison to the following lines until there's
    a mismatch? I've used sed for very simple s/pattern/replace/ type operations, but it's inner workings are something of a mystery. I've
    only ever done the simplest things in awk.

    I can see exactly how I would write a shell program to do this with the
    input read from a file dump of ls output, but I can't help feeling there
    must be a better way of doing it on the fly.


    Not thoroughly tested. Will show some awk syntax, no guarantees
    it meets the specs :-)

    ********************************* redund.awk ******************************
    # howtorun
    # gawk -f redund.awk inputfile.txt > outputfile.txt
    # ls-like-program-piped-to | gawk -f redund.awk

    # I usually put data samples inline like this, so I can stare at
    # them while writing snippers for stuff.

    # Unknown Person's Notebook - 01.png
    # Unknown Person's Notebook - 02.png
    # Unknown Person's Notebook - 03.png
    # Unknown Person's Notebook - End 0.png
    # Unknown Person's Notebook - End 1.png
    # Unknown Person's Notebook - End 2.png
    # Unknown Person's Notebook - Insert 00a.png
    # Unknown Person's Notebook - Insert 00b.png
    # Unknown Person's Notebook - Insert 01.png
    # Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan, India, 30451.png
    # 000001
    # 000002
    # 000003
    # 000004.png.jpg # not compressible with the others

    # Test some commands first, copy stuff from Internet, etc
    #
    # gawk '{match($0,/[0-9]{6}/);print substr($0,RSTART,RLENGTH)}' Input_file

    # gawk "{match($0,/[0-9]/);print substr($0,RSTART,RLENGTH)}" Works for the first digit only

    # gawk "{match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)}" Works to detect first instance

    # gawk "{match($0,/[[:digit:]]+/,arr);print arr[1] " " arr[2]}" Probably gawk5 only, cannot use

    # for(i=length($0);i>0;i--) x=x substr($0,i,1); A way to reverse a string, not needed

    BEGIN {
    FS = "." # peel off extension, if present, using $0 processing
    oldok = 0
    oldroot = "" # not a problem with oldok false
    }

    { # check the end of $1 for digits
    # match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)
    # By using a field separator of ".", we must disqualify NF>2 cases like 000004.png.jpg
    # split() can be used instead of FS and $0 processing, for more general programming solutions
    # Since the field separator is used very little in this program, it can be "wasted" like this.

    ok = (NF<=2) # boolean for string with compressible digits on end, initial determination

    match($1, /[[:digit:]]+/ ) # side effect... sets RSTART RLENGTH
    ok = (RSTART+RLENGTH-1 == length($1)) && ok # ok true is equal to 1, false is 0
    root = substr($1,1,RSTART-1) # Empty string for filename "000001"
    # print ok " " $1 " \"" root "\"" # the usual debug statement

    if ( root == oldroot && ok == 1 && oldok == 1) {
    cntr++
    }

    if ( root != oldroot || ok == 0) { # new assignment
    # Check processing of stuff in buffer
    if (oldok == 1) {
    if (cntr > 2) {
    print "..."
    }
    if (cntr > 1) {
    print oldstr
    }
    # cntr = 1 has already been printed
    }
    cntr = 1
    print $0 # opening stanza of a potential compression
    }

    # bookkeeping
    oldroot = root
    oldok = ok
    oldstr = $0

    # When I make doodles like the following in the source, it means I'm
    # struggling with the if-then-else order and making the code
    # as succinct as possible. This table started me out on the
    # wrong leg, and it took a second try to make a better if-then-else

    # root ok oldroot oldok cntr oldstr
    # xxx yyy dump previous if oldok,cntr,oldstr, define cntr = 1, print opening line
    # xxx 1 xxx 1 increment cntr

    }
    ********************************* end redund.awk ******************************

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Java Jive@21:1/5 to Paul on Tue Aug 3 01:14:16 2021
    XPost: alt.os.linux

    On 02/08/2021 10:00, Paul wrote:

    Not thoroughly tested. Will show some awk syntax, no guarantees
    it meets the specs :-)

    ********************************* redund.awk ******************************
    # howtorun
    # gawk -f redund.awk inputfile.txt > outputfile.txt
    # ls-like-program-piped-to | gawk -f redund.awk

    # I usually put data samples inline like this, so I can stare at
    # them while writing snippers for stuff.

    # Unknown Person's Notebook - 01.png
    # Unknown Person's Notebook - 02.png
    # Unknown Person's Notebook - 03.png
    # Unknown Person's Notebook - End 0.png
    # Unknown Person's Notebook - End 1.png
    # Unknown Person's Notebook - End 2.png
    # Unknown Person's Notebook - Insert 00a.png
    # Unknown Person's Notebook - Insert 00b.png
    # Unknown Person's Notebook - Insert 01.png
    # Unknown Person's Notebook - Insert 02a - Sketch Of Monument, Dekklan, India, 30451.png
    # 000001
    # 000002
    # 000003
    # 000004.png.jpg      # not compressible with the others

    # Test some commands first, copy stuff from Internet, etc
    #
    # gawk '{match($0,/[0-9]{6}/);print substr($0,RSTART,RLENGTH)}'
    Input_file

    # gawk "{match($0,/[0-9]/);print substr($0,RSTART,RLENGTH)}"
    Works for the first digit only

    # gawk "{match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)}"
    Works to detect first instance

    # gawk "{match($0,/[[:digit:]]+/,arr);print arr[1] " " arr[2]}"
    Probably gawk5 only, cannot use

    # for(i=length($0);i>0;i--) x=x substr($0,i,1);                      A
    way to reverse a string, not needed

    BEGIN {
      FS = "."     # peel off extension, if present, using $0 processing
      oldok = 0
      oldroot = "" # not a problem with oldok false
    }

    { # check the end of $1 for digits
      # match($0,/[[:digit:]]+/);print substr($0,RSTART,RLENGTH)
      # By using a field separator of ".", we must disqualify NF>2 cases
    like 000004.png.jpg
      # split() can be used instead of FS and $0 processing, for more
    general programming solutions
      # Since the field separator is used very little in this program, it
    can be "wasted" like this.

      ok = (NF<=2)   # boolean for string with compressible digits on end, initial determination

      match($1, /[[:digit:]]+/ )  # side effect... sets RSTART RLENGTH
      ok =  (RSTART+RLENGTH-1 == length($1)) && ok       # ok true is equal
    to 1, false is 0
      root = substr($1,1,RSTART-1)                       # Empty string for
    filename "000001"
      # print ok " " $1 " \"" root "\""                  # the usual debug
    statement

      if ( root == oldroot && ok == 1 && oldok == 1) {
         cntr++
      }

      if ( root != oldroot || ok == 0) { # new assignment
         # Check processing of stuff in buffer
         if (oldok == 1) {
            if (cntr > 2) {
               print "..."
            }
            if (cntr > 1) {
               print oldstr
            }
            # cntr = 1 has already been printed
         }
         cntr = 1
         print $0  # opening stanza of a potential compression
      }

      # bookkeeping
      oldroot = root
      oldok = ok
      oldstr = $0

    # When I make doodles like the following in the source, it means I'm
    # struggling with the if-then-else order and making the code
    # as succinct as possible. This table started me out on the
    # wrong leg, and it took a second try to make a better if-then-else

    # root  ok   oldroot  oldok  cntr oldstr
    # xxx        yyy                          dump previous if
    oldok,cntr,oldstr, define cntr = 1, print opening line
    # xxx   1    xxx      1                   increment cntr

    }
    ********************************* end redund.awk ******************************

    Thanks very much, you and Willian Unruh have been a great help, and your example above was very nearly perfect. However, I realised that it
    relied on the files not having a 'dot' earlier in the filenames, and
    when I searched through all the files to check whether that was true, I
    found some files created by others that did have more than one 'dot', so
    I resolved to rewrite it, using the same general approach as yourself.

    Below is what I came up with, and is working pretty well. It's been
    amended to deal with some situations I hadn't originally anticipated:

    Notebooks with numbered pages sometimes had single pages in a run which
    were blank, and so not worth scanning, and for these I created dummy
    text files stating that the pages were blank, to make it clear that no
    pages containing data were omitted from the scanning process. This
    meant that other filename extensions had to be allowed to match.

    Because some filenames included brackets, it was necessary to escape
    these before creating the RE to do the matching.

    Some documents with numbered pages included extra notes that needed
    separate scans because they were on the back of the previous page,
    resulting in this sort of thing ...
    Blah-blah - 5.png
    Blah-blah - 5 Note.png
    ... and I've adapted it to deal with that as well. This has had the unfortunate side-effect of overcompressing some of the test data that I
    gave before, but elsewhere saves so much work that I've decided to keep
    it in.

    Apart from that, it all looks good:

    #!/bin/awk
    ##########
    # An AWK program to make an index for our Family History ########################################################

    BEGIN
    {
    # Init variables
    ################
    # Ensure whole line is one field
    FS="\n";
    # Current line pattern RE
    cPattern="";
    # Previous Line
    pLine="";
    # Count of similar lines
    count=0;
    }

    {
    # Test for being within an existing numbered run
    if( (cPattern != "") && match($0, cPattern) )
    # Yes, so increase counter and remember this line
    # in case it turns out to be the last in this run
    {
    count++;
    pLine = $0;
    }
    else
    # Not part of existing numbered run
    {
    # So first exit what is now the previous numbered run
    if( pLine != "" )
    {
    if( count > 2 )
    print "...";
    print pLine;
    pLine = "";
    }
    # And we need to print this line anyway
    print $0;
    # Now test for a new numbered run. Some files created
    # by others have more than one dot '.' in the filename,
    # so we must be certain to match only to the last.
    if ( match( $0, /^.*[ #px][[:digit:]]{1,3}\.[^.]+$/) )
    # Potentially the start of a new numbered run
    # so re-initialise for this new one
    {
    # Escape brackets for new RE
    cPattern = gensub( /([][(){}])/, "\\\\\\1", "g", $0 );
    # Build new RE for matching following input lines
    cPattern = gensub( /(^.*[ #px])[[:digit:]]{1,3}(\.[^.]+)$/,
    "\\1[0-9]{1,3}[^.]*\\.[^.]+", "g", cPattern );
    count = 1;
    }
    else
    {
    # Not part of a numbered run, reinitialise ready for next
    cPattern="";
    count = 0;
    }
    }
    }

    Results:

    Family History/Unknown/Unknown Person's Notebook:

    Unknown Person's Notebook/Unknown Person's Notebook - 01.png
    ...
    Unknown Person's Notebook/Unknown Person's Notebook - 33.png
    Unknown Person's Notebook/Unknown Person's Notebook - End 0.png
    ...
    Unknown Person's Notebook/Unknown Person's Notebook - End 5.png
    Unknown Person's Notebook/Unknown Person's Notebook - Insert 00a.png
    Unknown Person's Notebook/Unknown Person's Notebook - Insert 00b.png
    Unknown Person's Notebook/Unknown Person's Notebook - Insert 01.png
    ...
    Unknown Person's Notebook/Unknown Person's Notebook - Insert 15b -
    Genealogy Of Job - 2.png
    Unknown Person's Notebook/Unknown Person's Notebook.txt

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Gregorie@21:1/5 to All on Tue Aug 3 01:03:22 2021
    XPost: alt.os.linux

    On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:

    Years ago, when I was getting into awk, I found the O'Reilly book,
    "sed & awk", subtitled "UNIX Power Tools" to be really helpful.

    I think it explains how awk works, how to use it and shows what it can do better than anything else I've found. It contains a lot of non-trivial
    example code too.

    That's not to knock the awk manpage, which is a good reference guide,
    specially for the various built-in functions, just that I think the book explains the way to structure awk scripts rather better than the manpage.


    --
    Martin | martin at
    Gregorie | gregorie dot org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Martin Gregorie on Tue Aug 3 00:12:48 2021
    XPost: alt.os.linux

    Martin Gregorie wrote:
    On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:

    Years ago, when I was getting into awk, I found the O'Reilly book,
    "sed & awk", subtitled "UNIX Power Tools" to be really helpful.

    I think it explains how awk works, how to use it and shows what it can do better than anything else I've found. It contains a lot of non-trivial example code too.

    That's not to knock the awk manpage, which is a good reference guide, specially for the various built-in functions, just that I think the book explains the way to structure awk scripts rather better than the manpage.


    --
    Martin | martin at
    Gregorie | gregorie dot org


    For zero dollars, you can get Arnold Robbins "Gawk.pdf",
    which is all the instruction manual you need. Many a happy hour
    spent flipping through that. There are at least three versions
    of the manual, for Gawk3, Gawk4, and Gawk5 (just got a copy a
    couple days ago, so don't know if Gawk5 is distributed as a package
    yet).

    I also have the gray book, which I bought in 1991. Written
    by the three guys A, W, and K. ISBN 0-201-07981-X.

    It's far from a perfect language. Sometimes it crushed the problem
    you're working on. And other times, it's the problem (take sorting
    as an example of migraine-induction).

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul@21:1/5 to Martin Gregorie on Tue Aug 3 05:12:48 2021
    Martin Gregorie wrote:
    On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:

    Years ago, when I was getting into awk, I found the O'Reilly book,
    "sed & awk", subtitled "UNIX Power Tools" to be really helpful.

    I think it explains how awk works, how to use it and shows what it can do better than anything else I've found. It contains a lot of non-trivial example code too.

    That's not to knock the awk manpage, which is a good reference guide, specially for the various built-in functions, just that I think the book explains the way to structure awk scripts rather better than the manpage.


    --
    Martin | martin at
    Gregorie | gregorie dot org


    For zero dollars, you can get Arnold Robbins "Gawk.pdf",
    which is all the instruction manual you need. Many a happy hour
    spent flipping through that. There are at least three versions
    of the manual, for Gawk3, Gawk4, and Gawk5 (just got a copy a
    couple days ago, so don't know if Gawk5 is distributed as a package
    yet).

    I also have the gray book, which I bought in 1991. Written
    by the three guys A, W, and K. ISBN 0-201-07981-X.

    It's far from a perfect language. Sometimes it crushed the problem
    you're working on. And other times, it's the problem (take sorting
    as an example of migraine-induction).

    Paul

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Gregorie@21:1/5 to All on Tue Aug 3 02:03:22 2021
    On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:

    Years ago, when I was getting into awk, I found the O'Reilly book,
    "sed & awk", subtitled "UNIX Power Tools" to be really helpful.

    I think it explains how awk works, how to use it and shows what it can do better than anything else I've found. It contains a lot of non-trivial
    example code too.

    That's not to knock the awk manpage, which is a good reference guide,
    specially for the various built-in functions, just that I think the book explains the way to structure awk scripts rather better than the manpage.


    --
    Martin | martin at
    Gregorie | gregorie dot org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Java Jive@21:1/5 to Paul on Tue Aug 3 11:57:18 2021
    XPost: alt.os.linux

    On 03/08/2021 05:12, Paul wrote:

    Martin Gregorie wrote:

    On Tue, 03 Aug 2021 01:14:16 +0100, Java Jive wrote:

    Years ago, when I was getting into awk, I found the O'Reilly book,
    "sed & awk", subtitled "UNIX Power Tools" to be really helpful.
    I think it explains how awk works, how to use it and shows what it can
    do better than anything else I've found. It contains a lot of
    non-trivial example code too.

    That's not to knock the awk manpage, which is a good reference guide,
    specially for the various built-in functions, just that I think the
    book explains the way to structure awk scripts rather better than the
    manpage.

    For zero dollars, you can get Arnold Robbins "Gawk.pdf",
    which is all the instruction manual you need.

    Yes, I found that online, and it's been most useful over the last couple
    of days.

    --

    Fake news kills!

    I may be contacted via the contact address given on my website:
    www.macfh.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to nospam@needed.invalid on Thu Aug 5 20:54:37 2021
    XPost: alt.os.linux

    In article <seafo1$9me$1@dont-email.me>, Paul <nospam@needed.invalid> wrote: ...
    It's far from a perfect language.

    Well, nothing is. But I have very (very!) rarely found any job in AWK's general problem domain (i.e., I wouldn't use it to write an OS, for example) that AWK wasn't the best tool for.

    Sometimes it crushed the problem you're working on. And other times, it's
    the problem (take sorting as an example of migraine-induction).

    Surprised to hear you say this. GAWK has several built-in sorting capabilities. Maybe a review of the fine manual is in order?

    BTW, once, long ago, before GAWK had these capabilities, I did code up a
    qsort in AWK code. Worked pretty well, but I wouldn't recommend it to anyone...

    --
    The difference between communism and capitalism?
    In capitalism, man exploits man. In communism, it's the other way around.

    - Daniel Bell, The End of Ideology (1960) -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to nospam@needed.invalid on Thu Aug 5 21:54:37 2021
    In article <seafo1$9me$1@dont-email.me>, Paul <nospam@needed.invalid> wrote: ....
    It's far from a perfect language.

    Well, nothing is. But I have very (very!) rarely found any job in AWK's general problem domain (i.e., I wouldn't use it to write an OS, for example) that AWK wasn't the best tool for.

    Sometimes it crushed the problem you're working on. And other times, it's
    the problem (take sorting as an example of migraine-induction).

    Surprised to hear you say this. GAWK has several built-in sorting capabilities. Maybe a review of the fine manual is in order?

    BTW, once, long ago, before GAWK had these capabilities, I did code up a
    qsort in AWK code. Worked pretty well, but I wouldn't recommend it to anyone...

    --
    The difference between communism and capitalism?
    In capitalism, man exploits man. In communism, it's the other way around.

    - Daniel Bell, The End of Ideology (1960) -

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Java Jive on Wed Aug 18 18:20:04 2021
    On Sun, 1 Aug 2021 22:44:50 +0100
    Java Jive <java@evij.com.invalid> wrote:
    On 01/08/2021 18:26, Richard Kettlewell wrote:
    Java Jive <java@evij.com.invalid> writes:
    On 01/08/2021 17:02, Java Jive wrote:
    snip <

    WTF is Java.Jive@f1.n221.z2.fidonet.fi and why is he duplicating my
    posts here?

    Someone is running a broken fido/usenet gateway.

    Injection-Info: gioia.aioe.org; logging-data="10598"; posting-host="F7FIqN6dkowTZ1CLxZIWTQ.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";

    They’re injecting via aioe.org, so I guess complaints to there.

    Done, thanks for your explanation, we'll have to wait and see what the
    result of the complaint is.

    I discovered today that some administrator or representative or something of aioe posts and reads news.admin.peering so , if private emails do not help (which they don't seem to have done) , perhaps posting on that group might help. Personally I don't want to spend any (more) time on this now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Java Jive on Wed Aug 18 19:20:04 2021
    On Sun, 1 Aug 2021 22:44:50 +0100
    Java Jive <java@evij.com.invalid> wrote:
    On 01/08/2021 18:26, Richard Kettlewell wrote:
    Java Jive <java@evij.com.invalid> writes:
    On 01/08/2021 17:02, Java Jive wrote:
    snip <

    WTF is Java.Jive@f1.n221.z2.fidonet.fi and why is he duplicating my
    posts here?

    Someone is running a broken fido/usenet gateway.

    Injection-Info: gioia.aioe.org; logging-data="10598"; posting-host="F7FIqN6dkowTZ1CLxZIWTQ.user.gioia.aioe.org";
    mail-complaints-to="abuse@aioe.org";

    They’re injecting via aioe.org, so I guess complaints to there.

    Done, thanks for your explanation, we'll have to wait and see what the
    result of the complaint is.

    I discovered today that some administrator or representative or something of aioe posts and reads news.admin.peering so , if private emails do not help (which they don't seem to have done) , perhaps posting on that group might help. Personally I don't want to spend any (more) time on this now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)