• Re: The size of pipes

    From Janis Papanagnou@21:1/5 to Felix Palmen on Sun Apr 23 18:34:25 2023
    On 23.04.2023 18:21, Felix Palmen wrote:

    This won't be a concern here. You need the whole data to sort something,
    so the sort utility must read until EOF anyways before doing its work.

    See my recent reply on a different view.

    So, the real concern is whether you'll have enough RAM.

    Not if sorting is (alternatively or also) done over files.

    Even at times when 640k was considered immense memory by some, much
    larger data sets had been sorted even then. (Speaking about real OS
    computers, not about toys). In earth-bound computers there usually
    was much more disk/drum/tape memory than kernel memory available.
    Even if that's "legacy" the principles are still the same. - Unless
    responsible folks start putting everything into a global memory
    cloud. :-/

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Felix Palmen@21:1/5 to All on Sun Apr 23 18:21:44 2023
    * Kenny McCormack <gazelle@shell.xmission.com>:
    David W. Hodgins <dwhodgins@nomail.afraid.org> wrote:
    ...
    Keep in mind. When sorting a file, the last line in the input may end up >>becoming the first line in the output. The sort can not write anything to >>the pipe or output file until it's sorted the entire input. With a pipe, >>the temporary file is in ram rather then being a named file on disk.

    This actually raises an interesting point. Pipes are not infinite in size, and they could, theoretically block if enough is written on the write end [...]
    Something to keep in mind if you ever decide to sort very large files in a pipeline. And it is probably a better idea not to do so; to sort it all at once, using multiple key specifications on the command line.

    This won't be a concern here. You need the whole data to sort something,
    so the sort utility must read until EOF anyways before doing its work.
    So, the real concern is whether you'll have enough RAM.

    The only alternative would be to sort on the file contents. I don't know whether some sort utility can do that (it certainly would create other
    issues when sorting by "text lines" of very different lengths), but
    that's not possible with pipes anyways, they can't be seeked.

    --
    Dipl.-Inform. Felix Palmen <felix@palmen-it.de> ,.//..........
    {web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de
    {pgp public key} http://palmen-it.de/pub.txt // """""""""""
    {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David W. Hodgins@21:1/5 to Felix Palmen on Sun Apr 23 12:35:14 2023
    On Sun, 23 Apr 2023 12:21:44 -0400, Felix Palmen <felix@palmen-it.de> wrote:
    This won't be a concern here. You need the whole data to sort something,
    so the sort utility must read until EOF anyways before doing its work.
    So, the real concern is whether you'll have enough RAM.

    Yes the sort has to read the entire input file before it can write anything
    to the output file as the last record read may have to be the first one written.

    The way that's handled in low ram systems is to use temporary files, where it sorts chunks into each temporary file and then merges the temporary files to create the final output file.

    By default the temporary files are stored in /tmp, which on most systems is
    now a virtual file system kept in ram.

    Either ensure /tmp is mounted on a disk files system with enough free space
    or instruct sort to use another directory.

    See "man sort" for the -T (aka --temporary-directory=DIR) option.

    Regards, Dave Hodgins

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Felix Palmen@21:1/5 to All on Sun Apr 23 18:58:41 2023
    * Janis Papanagnou <janis_papanagnou+ng@hotmail.com>:
    On 23.04.2023 18:21, Felix Palmen wrote:

    This won't be a concern here. You need the whole data to sort something,
    so the sort utility must read until EOF anyways before doing its work.

    See my recent reply on a different view.

    So, even if it starts working on "chunks", this won't change anything:
    the data from the pipe must be read in order to work with it, so the
    size of the pipe won't be a problem here.

    It seems the idea assuming this was that the whole data to be sorted
    must fit into the pipe buffer. But this isn't the case.

    So, the real concern is whether you'll have enough RAM.

    Not if sorting is (alternatively or also) done over files.

    Sure this *can* be done, that's why I mentioned the possibility. I
    wasn't aware sort utils these days actually do it.

    --
    Dipl.-Inform. Felix Palmen <felix@palmen-it.de> ,.//..........
    {web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de
    {pgp public key} http://palmen-it.de/pub.txt // """""""""""
    {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David W. Hodgins@21:1/5 to Felix Palmen on Sun Apr 23 13:46:56 2023
    On Sun, 23 Apr 2023 12:58:41 -0400, Felix Palmen <felix@palmen-it.de> wrote:
    It seems the idea assuming this was that the whole data to be sorted
    must fit into the pipe buffer. But this isn't the case.

    As the last line of the input file(s) may be the first line of the final output,
    all of the data must be sorted before anything is written to the pipe.

    Either all of the data has to fit in ram, or it has to be sorted in chunks
    with those chunks stored on disk, and then the chunks are then merged to produce the output.

    The coreutils package's sort can use temporary (unamed) files as needed in the directory specified by the TMPDIR environment variable. (/tmp on most systems). They wont show up in ls as they are unnamed.

    It's not clear from the man page if it will always use temporary files or only if instructed to. So checking the source ... https://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/sort.c;h=8ca7a88c48ec07eccd952b14739e427721466c5d;hb=HEAD

    If I'm reading it right, it always uses temporary files doing a sort/merge. Given that it started in 1988, it's not surprising that it's designed to work in a low ram environment.

    So if you're in a low ram environment either ensure the $TMPDIR directory is not in ram, or include the --temporary-directory=DIR to specify another directory that is on a disk file system with enough free space.

    Regards, Dave Hodgins

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Felix Palmen@21:1/5 to All on Sun Apr 23 19:29:50 2023
    * Janis Papanagnou <janis_papanagnou+ng@hotmail.com>:
    s/doing/finishing/

    Agreed.

    It boils down to this; sorting can _start_ sorting with fewer data
    (something like a pipe-full), it can also _continue_ sorting with
    more parts of data, and to _finish_ sorting it naturally must have
    had all data available.

    All correct, but I really doubt the relevance of the parantheses. The
    size of the pipe will never be of much interest (except maybe for
    performance), mostly because you can't seek a pipe anyways.

    --
    Dipl.-Inform. Felix Palmen <felix@palmen-it.de> ,.//..........
    {web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de
    {pgp public key} http://palmen-it.de/pub.txt // """""""""""
    {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Felix Palmen on Sun Apr 23 19:16:25 2023
    On 23.04.2023 18:58, Felix Palmen wrote:
    * Janis Papanagnou <janis_papanagnou+ng@hotmail.com>:
    On 23.04.2023 18:21, Felix Palmen wrote:

    This won't be a concern here. You need the whole data to sort something, >>> so the sort utility must read until EOF anyways before doing its work.

    s/doing/finishing/

    See my recent reply on a different view.

    So, even if it starts working on "chunks", this won't change anything:
    the data from the pipe must be read in order to work with it, so the
    size of the pipe won't be a problem here.

    It seems the idea assuming this was that the whole data to be sorted
    must fit into the pipe buffer. But this isn't the case.

    It boils down to this; sorting can _start_ sorting with fewer data
    (something like a pipe-full), it can also _continue_ sorting with
    more parts of data, and to _finish_ sorting it naturally must have
    had all data available.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Felix Palmen on Sun Apr 23 19:53:36 2023
    On 23.04.2023 19:29, Felix Palmen wrote:
    * Janis Papanagnou <janis_papanagnou+ng@hotmail.com>:
    s/doing/finishing/

    Agreed.

    It boils down to this; sorting can _start_ sorting with fewer data
    (something like a pipe-full), it can also _continue_ sorting with
    more parts of data, and to _finish_ sorting it naturally must have
    had all data available.

    All correct, but I really doubt the relevance of the parantheses.

    It's here just to demonstrate a magnitude, no less, no more.

    But I seem to recall - faint memories from 4 decades ago - that
    I/O-buffer size (similar to pipe-buffer size) was part of the
    rationale about why to use such values and how to dimension it
    (for optimum processing speed, yes, for performance as you say
    below).

    The
    size of the pipe will never be of much interest (except maybe for performance), mostly because you can't seek a pipe anyways.

    Seeking on the pipe isn't necessary since the pipe is just the
    transfer medium, unstructured per se, with data likely even
    truncated at the front or rear (because of octet-transmission,
    not data-record processing). You'll anyway have it transferred
    into a structured memory structure.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Janis Papanagnou on Sun Apr 23 18:03:36 2023
    On Sun, 23 Apr 2023 19:16:25 +0200
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    On 23.04.2023 18:58, Felix Palmen wrote:
    * Janis Papanagnou <janis_papanagnou+ng@hotmail.com>:
    On 23.04.2023 18:21, Felix Palmen wrote:

    This won't be a concern here. You need the whole data to sort something, >>> so the sort utility must read until EOF anyways before doing its work.

    s/doing/finishing/

    See my recent reply on a different view.

    So, even if it starts working on "chunks", this won't change anything:
    the data from the pipe must be read in order to work with it, so the
    size of the pipe won't be a problem here.

    It seems the idea assuming this was that the whole data to be sorted
    must fit into the pipe buffer. But this isn't the case.

    It boils down to this; sorting can _start_ sorting with fewer data
    (something like a pipe-full), it can also _continue_ sorting with
    more parts of data, and to _finish_ sorting it naturally must have
    had all data available.

    I think Kenny was worried in <u23fpe$2opsm$1@news.xmission.com>
    and <u23ito$2osbe$1@news.xmission.com> about a deadlock situation where
    no progress gets made because of low pipes capacity. I can't think of
    a scenario where this can happen even if sort interleaves sorting and
    reading from a pipe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to David W. Hodgins on Sun Apr 23 20:17:47 2023
    On 23.04.2023 19:46, David W. Hodgins wrote:

    If I'm reading it right, it always uses temporary files doing a sort/merge. Given that it started in 1988, it's not surprising that it's designed to
    work in a low ram environment.

    Some test-run[*] finished here...

    The data created and fed into 'sort' is larger than my free RAM.

    $ time seq 1000000000 -1 1 | sort -n | N=1 is-sorted
    0

    real 58m8.18s
    user 54m18.16s
    sys 1m49.34s

    Janis

    [*] 'is-sorted' is an awk script, and "0" means it's okay (=sorted),

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Felix Palmen@21:1/5 to All on Mon Apr 24 19:07:04 2023
    * Kaz Kylheku <864-117-4973@kylheku.com>:
    In fact, I suspect, a pipe doesn't have to store anything. It can be a
    pure rendezvous. The write() call can block until the reader performs a read(), or vice versa, at which time MIN(read_size, write_size) bytes
    can be transferred directly between their respective buffers, that value
    then being returned from the read and write.

    Yes. IIRC, L4 uses some similar mechanism for IPC. It needs support from
    the scheduler of course. And to make it most efficient, the size should
    be agreed upon on both sides, so that won't work with typical pipe
    semantics.

    --
    Dipl.-Inform. Felix Palmen <felix@palmen-it.de> ,.//..........
    {web} http://palmen-it.de {jabber} [see email] ,//palmen-it.de
    {pgp public key} http://palmen-it.de/pub.txt // """""""""""
    {pgp fingerprint} 6936 13D5 5BBF 4837 B212 3ACC 54AD E006 9879 F231

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)