• Scraping and Less buffering

    From frogger@21:1/5 to All on Sat Mar 19 17:57:09 2022
    Hello all.

    I wrote a shell scraper of a news website and one of the options is to
    keep re-accessing the initial webpage (while loop) at regular intervals,
    grab all news links and scrape text of the ones which are new. This
    option keeps the script running for days straight. When stdout is
    redirected to a *file*, it works as expected.

    Instead if we pipe output to `less', there is a buffering stand. When
    `less' buffering is full at about 8-64KB, the whole while loop in the
    script hangs and only continues to run when we scroll down `less'
    display buffer. But if it hangs for a few hours, it means the scraping
    tool will only resume scraping some hours later, too, failing to scrape
    a lot of news from the initial webpage during that time.

    I tried piping to `less -B -b4096' and `less -B -b-1' to no avail:

    -B or --auto-buffers
    By default, when data is read from a pipe, buffers are allocated
    automatically as needed. If a large amount of data is read from
    the pipe, this can cause a large amount of memory to be allo‐
    cated. The -B option disables this automatic allocation of buf‐
    fers for pipes, so that only 64 KB (or the amount of space spec‐
    ified by the -b option) is used for the pipe.

    -bn or --buffers=n
    Specifies the amount of buffer space less will use for each
    file, in units of kilobytes (1024 bytes). By default 64 KB of
    buffer space is used for each file (unless the file is a pipe; see
    the -B option). The -b option specifies instead that n kilobytes
    of buffer space should be used for each file. If n is -1, buffer
    space is unlimited; that is, the entire file can be read into
    memory.


    I use GNU coreutils and Arch Linux. My understanding is that `less
    -Bb-1' should have worked but there is some internal buffering system
    that still holds. Any suggestion? I know of `stdbuf', tried a little but
    could not make it work. If I was not clear, please let me know.

    Thanks,
    JSN

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to frogger on Sat Mar 19 22:06:19 2022
    frogger <somebody@invalid.com> writes:

    I wrote a shell scraper of a news website and one of the options is to
    keep re-accessing the initial webpage (while loop) at regular
    intervals, grab all news links and scrape text of the ones which are
    new. This option keeps the script running for days straight. When
    stdout is redirected to a *file*, it works as expected.

    Instead if we pipe output to `less', there is a buffering stand. When
    `less' buffering is full at about 8-64KB, the whole while loop in the
    script hangs and only continues to run when we scroll down `less'
    display buffer. But if it hangs for a few hours, it means the scraping
    tool will only resume scraping some hours later, too, failing to
    scrape a lot of news from the initial webpage during that time.

    I'm not 100% sure what you want, but does:

    $ scraper >file & less file

    and then using the F command do something like you want?

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From frogger@21:1/5 to Ben Bacarisse on Sat Mar 19 19:41:17 2022
    On 19/03/2022 19:06, Ben Bacarisse wrote:
    frogger <somebody@invalid.com> writes:

    I wrote a shell scraper of a news website and one of the options is to
    keep re-accessing the initial webpage (while loop) at regular
    intervals, grab all news links and scrape text of the ones which are
    new. This option keeps the script running for days straight. When
    stdout is redirected to a *file*, it works as expected.

    Instead if we pipe output to `less', there is a buffering stand. When
    `less' buffering is full at about 8-64KB, the whole while loop in the
    script hangs and only continues to run when we scroll down `less'
    display buffer. But if it hangs for a few hours, it means the scraping
    tool will only resume scraping some hours later, too, failing to
    scrape a lot of news from the initial webpage during that time.

    I'm not 100% sure what you want, but does:

    $ scraper >file & less file

    and then using the F command do something like you want?


    Hey Ben!

    That is *exactly* what I am trying just now. Was not sure it would be a
    good solution but it seems straightforward that is, as it was so obvious
    to you.

    So what I just did and works is:

    scraperFunction >file & tail -f file | less

    Also added a trap inside the script to kill forks and remove temp file.

    Thanks, Ben!
    JSN

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)