Forum: >>> Magnum BBS <<<

Scraping and Less buffering

From frogger@21:1/5 to All on Sat Mar 19 17:57:09 2022

Hello all.

I wrote a shell scraper of a news website and one of the options is to
keep re-accessing the initial webpage (while loop) at regular intervals,
grab all news links and scrape text of the ones which are new. This
option keeps the script running for days straight. When stdout is
redirected to a *file*, it works as expected.

Instead if we pipe output to `less', there is a buffering stand. When
`less' buffering is full at about 8-64KB, the whole while loop in the
script hangs and only continues to run when we scroll down `less'
display buffer. But if it hangs for a few hours, it means the scraping
tool will only resume scraping some hours later, too, failing to scrape
a lot of news from the initial webpage during that time.

I tried piping to `less -B -b4096' and `less -B -b-1' to no avail:

-B or --auto-buffers
By default, when data is read from a pipe, buffers are allocated
automatically as needed. If a large amount of data is read from
the pipe, this can cause a large amount of memory to be allo‐
cated. The -B option disables this automatic allocation of buf‐
fers for pipes, so that only 64 KB (or the amount of space spec‐
ified by the -b option) is used for the pipe.

-bn or --buffers=n
Specifies the amount of buffer space less will use for each
file, in units of kilobytes (1024 bytes). By default 64 KB of
buffer space is used for each file (unless the file is a pipe; see
the -B option). The -b option specifies instead that n kilobytes
of buffer space should be used for each file. If n is -1, buffer
space is unlimited; that is, the entire file can be read into
memory.

I use GNU coreutils and Arch Linux. My understanding is that `less
-Bb-1' should have worked but there is some internal buffering system
that still holds. Any suggestion? I know of `stdbuf', tried a little but
could not make it work. If I was not clear, please let me know.

Thanks,
JSN

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to frogger on Sat Mar 19 22:06:19 2022

frogger <somebody@invalid.com> writes:

I wrote a shell scraper of a news website and one of the options is to
keep re-accessing the initial webpage (while loop) at regular
intervals, grab all news links and scrape text of the ones which are
new. This option keeps the script running for days straight. When
stdout is redirected to a *file*, it works as expected.

Instead if we pipe output to `less', there is a buffering stand. When
`less' buffering is full at about 8-64KB, the whole while loop in the
script hangs and only continues to run when we scroll down `less'
display buffer. But if it hangs for a few hours, it means the scraping
tool will only resume scraping some hours later, too, failing to
scrape a lot of news from the initial webpage during that time.

I'm not 100% sure what you want, but does:

$ scraper >file & less file

and then using the F command do something like you want?

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From frogger@21:1/5 to Ben Bacarisse on Sat Mar 19 19:41:17 2022

On 19/03/2022 19:06, Ben Bacarisse wrote:

frogger <somebody@invalid.com> writes:

I wrote a shell scraper of a news website and one of the options is to
keep re-accessing the initial webpage (while loop) at regular
intervals, grab all news links and scrape text of the ones which are
new. This option keeps the script running for days straight. When
stdout is redirected to a *file*, it works as expected.

Instead if we pipe output to `less', there is a buffering stand. When
`less' buffering is full at about 8-64KB, the whole while loop in the
script hangs and only continues to run when we scroll down `less'
display buffer. But if it hangs for a few hours, it means the scraping
tool will only resume scraping some hours later, too, failing to
scrape a lot of news from the initial webpage during that time.

I'm not 100% sure what you want, but does:

$ scraper >file & less file

and then using the F command do something like you want?

Hey Ben!

That is *exactly* what I am trying just now. Was not sure it would be a
good solution but it seems straightforward that is, as it was so obvious
to you.

So what I just did and works is:

scraperFunction >file & tail -f file | less

Also added a trap inside the script to kill forks and remove temp file.

Thanks, Ben!
JSN

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gwylbert
  Mon Mar 3 07:18:40 2025
  from Sydney, Nsw via Telnet
- Spudboy
  Mon Mar 3 02:14:42 2025
  from Central, Il via Telnet
- Keyop
  Sun Mar 2 21:22:07 2025
  from Huddersfield, West Yorkshire via SSH
- Guest
  Sun Mar 2 20:28:11 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Guest
  Sun Mar 2 18:48:33 2025
  from Guest via Telnet
- Keyop
  Sun Mar 2 16:20:09 2025
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun Mar 2 16:18:49 2025
  from Huddersfield, West Yorkshire via SSH
- Segfaultedwinter
  Sun Mar 2 16:04:24 2025
  from Austria via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	422
Nodes:	16 (3 / 13)
Uptime:	196:20:34
Calls:	8,951
Calls today:	2
Files:	13,352
Messages:	5,992,474

Scraping and Less buffering

Who's Online

Recent Visitors

System Info