• "find ... -exec sort -u {} ..." vs "find ... -exec cat {} \;| sort -u .

    From hongyi.zhao@gmail.com@21:1/5 to All on Sat Nov 27 05:25:24 2021
    See my following testings:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    $ wc aaa bbb ccc
    17715 17898 157889 aaa
    875968 1063102 9971040 bbb
    875968 1063102 9971040 ccc
    1769651 2144102 20099969 total

    So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur. Any hints will be highly appreciated.

    Regards,
    HZ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Elvidge@21:1/5 to hongy...@gmail.com on Sat Nov 27 14:11:42 2021
    On 27/11/2021 01:25 pm, hongy...@gmail.com wrote:
    See my following testings:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    $ wc aaa bbb ccc
    17715 17898 157889 aaa
    875968 1063102 9971040 bbb
    875968 1063102 9971040 ccc
    1769651 2144102 20099969 total

    So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur. Any hints will be highly appreciated.

    Regards,
    HZ


    What, exactly, are you trying to do?

    --
    Chris Elvidge
    England

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to hongy...@gmail.com on Sat Nov 27 14:27:27 2021
    On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:

    See my following testings:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    $ wc aaa bbb ccc
    17715 17898 157889 aaa
    875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
    1769651 2144102 20099969 total

    So, the first usage seems to be incorrect. But I can't understand why
    such a mistake would occur.

    There's no mistake. The first command doesn't do the same things
    as the second and third command does.

    The first command overwrites file aaa with the sorted contents of each
    file found. While file aaa will, at times, contain the contents of each
    file (sorted, of course), it's final contents are the sorted contents of
    the /last/ file found by the find(1) command.

    The other two commands /attempt/ to sort the contents of /all/ the found
    files into files bbb and ccc respectively.

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of
    files found, /and/ the maximum size of an argument vector (argv[]), there
    is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    The third command uses cat(1) to concatenate the contents of all the found files into one stream that sort(1) will sort into file ccc.

    Any hints will be highly appreciated.

    Regards,
    HZ




    --
    Lew Pitcher
    "In Skills, We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to hongy...@gmail.com on Sat Nov 27 18:04:45 2021
    On Sunday, November 28, 2021 at 9:53:49 AM UTC+8, hongy...@gmail.com wrote:
    On Saturday, November 27, 2021 at 10:27:32 PM UTC+8, Lew Pitcher wrote:
    On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:

    See my following testings:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    $ wc aaa bbb ccc
    17715 17898 157889 aaa
    875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
    1769651 2144102 20099969 total

    So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur.
    There's no mistake. The first command doesn't do the same things
    as the second and third command does.

    The first command overwrites file aaa with the sorted contents of each file found. While file aaa will, at times, contain the contents of each file (sorted, of course), it's final contents are the sorted contents of the /last/ file found by the find(1) command.
    Thank you for your prompt. Using the following method will get the same results as other methods:

    $ find ./source -type f -exec sort -u {} + > aaa
    The other two commands /attempt/ to sort the contents of /all/ the found files into files bbb and ccc respectively.

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of files found, /and/ the maximum size of an argument vector (argv[]), there is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    The third command uses cat(1) to concatenate the contents of all the found files into one stream that sort(1) will sort into file ccc.
    Therefore, the second method is not as reliable as the following two methods and should be avoided:

    $ find ./source -type f -exec sort -u {} + > aaa
    $ find ./source -type f -exec cat {} + | sort -u -o ccc

    Based on the following explanation of the find command man page:

    $ man find | egrep -A 12 -- '-exec command \{\} +'
    -exec command {} +
    This variant of the -exec action runs the specified command on the selected
    files, but the command line is built by appending each selected file name at
    the end; the total number of invocations of the command will be much less than
    the number of matched files. The command line is built in much the same way
    that xargs builds its command lines. Only one instance of `{}' is allowed
    within the command, and (when find is being invoked from a shell) it should be
    quoted (for example, '{}') to protect it from interpretation by shells. The
    command is executed in the starting directory. If any invocation with the `+'
    form returns a non-zero value as exit status, then find returns a non-zero
    exit status. If find encounters an error, this can sometimes cause an immedi‐
    ate exit, so some pending commands may not be run at all. This variant of
    -exec always returns true.

    It seems that use the + variant of the -exec action is faster:

    $ time find ./source -type f -exec cat {} \;| sort -u -o ccc

    real 0m3.251s
    user 0m3.188s
    sys 0m0.066s
    $ time find ./source -type f -exec cat {} + | sort -u -o ccc

    real 0m2.895s
    user 0m2.827s
    sys 0m0.075s

    Regards,
    HZ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to Lew Pitcher on Sat Nov 27 17:53:46 2021
    On Saturday, November 27, 2021 at 10:27:32 PM UTC+8, Lew Pitcher wrote:
    On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:

    See my following testings:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    $ wc aaa bbb ccc
    17715 17898 157889 aaa
    875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
    1769651 2144102 20099969 total

    So, the first usage seems to be incorrect. But I can't understand why
    such a mistake would occur.
    There's no mistake. The first command doesn't do the same things
    as the second and third command does.

    The first command overwrites file aaa with the sorted contents of each
    file found. While file aaa will, at times, contain the contents of each
    file (sorted, of course), it's final contents are the sorted contents of
    the /last/ file found by the find(1) command.

    Thank you for your prompt. Using the following method will get the same results as other methods:

    $ find ./source -type f -exec sort -u {} + > aaa


    The other two commands /attempt/ to sort the contents of /all/ the found files into files bbb and ccc respectively.

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of
    files found, /and/ the maximum size of an argument vector (argv[]), there
    is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    The third command uses cat(1) to concatenate the contents of all the found files into one stream that sort(1) will sort into file ccc.

    Therefore, the second method is not as reliable as the following two methods and should be avoided:

    $ find ./source -type f -exec sort -u {} + > aaa
    $ find ./source -type f -exec cat {} + | sort -u -o ccc

    Regards,
    HZ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to hongy...@gmail.com on Sun Nov 28 03:36:30 2021
    On 28.11.2021 03:04, hongy...@gmail.com wrote:
    [...]

    It seems that use the + variant of the -exec action is faster:

    Yes, but your numbers below are of little expressiveness, they might
    also result just from caching effects in that given magnitude. Often
    you get a magnitude or more speed increase. Note that post-processing
    with sort unnecessarily affects any speed comparisons of find usage
    variants. Note also that the find built-in -exec/+ could also be done
    using xargs (-print0 | xargs -0).

    Janis


    $ time find ./source -type f -exec cat {} \;| sort -u -o ccc

    real 0m3.251s
    user 0m3.188s
    sys 0m0.066s
    $ time find ./source -type f -exec cat {} + | sort -u -o ccc

    real 0m2.895s
    user 0m2.827s
    sys 0m0.075s

    Regards,
    HZ


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to hongy...@gmail.com on Mon Nov 29 01:52:57 2021
    On 29.11.2021 01:29, hongy...@gmail.com wrote:
    On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
    On 28.11.2021 03:04, hongy...@gmail.com wrote:
    [...]

    It seems that use the + variant of the -exec action is faster:
    Yes, but your numbers below are of little expressiveness, they might
    also result just from caching effects in that given magnitude. Often
    you get a magnitude or more speed increase. Note that post-processing
    with sort unnecessarily affects any speed comparisons of find usage
    variants. Note also that the find built-in -exec/+ could also be done
    using xargs (-print0 | xargs -0).

    Lew Pitcher told the following shortcoming of xargs based solution [1]:

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of
    files found, /and/ the maximum size of an argument vector (argv[]), there
    is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ

    Hence, I try to avoid using xargs.

    It's necessary to understand the mechanics behind the tools to make
    an educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    (In the preceding examples the syntactic details have been omitted
    for clarity, to better see the differences.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to Janis Papanagnou on Sun Nov 28 16:29:27 2021
    On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
    On 28.11.2021 03:04, hongy...@gmail.com wrote:
    [...]

    It seems that use the + variant of the -exec action is faster:
    Yes, but your numbers below are of little expressiveness, they might
    also result just from caching effects in that given magnitude. Often
    you get a magnitude or more speed increase. Note that post-processing
    with sort unnecessarily affects any speed comparisons of find usage
    variants. Note also that the find built-in -exec/+ could also be done
    using xargs (-print0 | xargs -0).

    Lew Pitcher told the following shortcoming of xargs based solution [1]:

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of
    files found, /and/ the maximum size of an argument vector (argv[]), there
    is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ

    Hence, I try to avoid using xargs.


    Janis

    $ time find ./source -type f -exec cat {} \;| sort -u -o ccc

    real 0m3.251s
    user 0m3.188s
    sys 0m0.066s
    $ time find ./source -type f -exec cat {} + | sort -u -o ccc

    real 0m2.895s
    user 0m2.827s
    sys 0m0.075s

    Regards,
    HZ


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to Janis Papanagnou on Sun Nov 28 21:01:29 2021
    On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
    On 29.11.2021 01:29, hongy...@gmail.com wrote:
    On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
    On 28.11.2021 03:04, hongy...@gmail.com wrote:
    [...]

    It seems that use the + variant of the -exec action is faster:
    Yes, but your numbers below are of little expressiveness, they might
    also result just from caching effects in that given magnitude. Often
    you get a magnitude or more speed increase. Note that post-processing
    with sort unnecessarily affects any speed comparisons of find usage
    variants. Note also that the find built-in -exec/+ could also be done
    using xargs (-print0 | xargs -0).

    Lew Pitcher told the following shortcoming of xargs based solution [1]:

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of files found, /and/ the maximum size of an argument vector (argv[]), there is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ

    Hence, I try to avoid using xargs.
    It's necessary to understand the mechanics behind the tools to make
    an educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    Wonderful explanation, which hits the flaw of my knowledge. But I still want to know what is the probability that this problem will occur. OTOH, I think this problem is, to some extent, related to the following values:

    $ xargs --show-limits
    Your environment variables take up 8115 bytes
    POSIX upper limit on argument length (this system): 2086989
    POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2078874
    Size of command buffer we are actually using: 131072
    Maximum parallelism (--max-procs must be no greater): 2147483647

    Execution of xargs will continue now, and it will try to read its input and run commands; if this is not what you wanted to happen, please type the end-of-file keystroke.
    Warning: echo will be run at least once. If you do not want that to happen, then press the interrupt keystroke.


    Anyway, if the xargs failed to do the trick, maybe parallel [1] doesn't have this issue.

    [1] https://www.gnu.org/software/parallel/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to Janis Papanagnou on Sun Nov 28 21:10:16 2021
    On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
    On 29.11.2021 01:29, hongy...@gmail.com wrote:
    On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
    On 28.11.2021 03:04, hongy...@gmail.com wrote:
    [...]

    It seems that use the + variant of the -exec action is faster:
    Yes, but your numbers below are of little expressiveness, they might
    also result just from caching effects in that given magnitude. Often
    you get a magnitude or more speed increase. Note that post-processing
    with sort unnecessarily affects any speed comparisons of find usage
    variants. Note also that the find built-in -exec/+ could also be done
    using xargs (-print0 | xargs -0).

    Lew Pitcher told the following shortcoming of xargs based solution [1]:

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number of files found, /and/ the maximum size of an argument vector (argv[]), there is a chance that the file bbb will only contain the sorted contents of
    a subset of the files found by find(1).

    [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ

    Hence, I try to avoid using xargs.
    It's necessary to understand the mechanics behind the tools to make
    an educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    Yes. They give exactly the same result:

    $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
    $

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to hongy...@gmail.com on Mon Nov 29 07:15:14 2021
    On 29.11.2021 06:10, hongy...@gmail.com wrote:
    On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
    It's necessary to understand the mechanics behind the tools to make
    an educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    Yes. They give exactly the same result:

    As explained [in another context] more thoroughly upthread that may
    also just be coincidence; it's a hint but it's certainly no proof.
    (A difference would have proven it false, but no difference doesn't
    say anything, strictly speaking.)


    $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
    $

    BTW, with 'xargs', in the general case it's usually better to use
    NUL-separated data, as in

    find ./source -type f -print0 | xargs -0 cat | sort -u


    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to hongy...@gmail.com on Mon Nov 29 07:05:08 2021
    On 29.11.2021 06:01, hongy...@gmail.com wrote:

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    Wonderful explanation, which hits the flaw of my knowledge. But I
    still want to know what is the probability that this problem will
    occur.

    It's not so much a question about the probability but rather about
    the reliability of a construct.

    OTOH, I think this problem is, to some extent, related to the
    following values:

    $ xargs --show-limits
    [...]

    The issue stems from the fact of a limited exec-buffer size and
    that [shell-external] commands will operate on that limited buffer.
    Whenever your sample size - actually the argument list size - will
    exceed that limit the outcome is unreliable and depends on the data
    used; it may work in 10 cases and fail in 100, or vice versa, it
    may work for all your application cases (because you are operating
    only on toy data), or it may always fail (because you are working
    with huge amounts of scientific data), or anything else.

    To understand the issue it suffices to assume small values, say a
    buffer-size of 15 and a few short arguments.

    Say you have the file arguments A B C D ... Z and want to sort
    them. Say in the buffer there's room for only 5, so that sorting
    with above 'find'-based constructs will result in many calls;
    sort A B C D E
    sort F G H I J
    ...
    sort Z
    and the output will be the concatenation of the individual calls.
    A..E will be sorted, F..J will be sorted, etc. but A..Z will not
    be sorted after the concatenation of the individual sorted parts.

    Very subtle errors can occur this way if one is not aware of that
    fact; the result may look correct if one looks at the first few MB
    of the result, but may actually be wrong.

    Whether other tools (like the one mentioned below) circumvent the
    exec-buffer issue must be checked - but I wouldn't expect it does.
    What a tool would need to do is either the ability to see all data
    in one call, or to create partly sorted data and make more sort
    runs on that partly sorted data; merge-sort is an algorithm that
    works that way (which had been used on sequentially operating
    tape archives especially in former times).

    Janis


    Anyway, if the xargs failed to do the trick, maybe parallel [1]
    doesn't have this issue.

    [1] https://www.gnu.org/software/parallel/


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to Janis Papanagnou on Mon Nov 29 00:21:24 2021
    On Monday, November 29, 2021 at 2:15:19 PM UTC+8, Janis Papanagnou wrote:
    On 29.11.2021 06:10, hongy...@gmail.com wrote:
    On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
    It's necessary to understand the mechanics behind the tools to make
    an educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    Yes. They give exactly the same result:
    As explained [in another context] more thoroughly upthread that may
    also just be coincidence; it's a hint but it's certainly no proof.
    (A difference would have proven it false, but no difference doesn't
    say anything, strictly speaking.)

    I don't quite understand what you mean above. Here, I mean, the two methods you mentioned *don't* have that issue give exactly the same result.


    $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
    $
    BTW, with 'xargs', in the general case it's usually better to use NUL-separated data, as in

    find ./source -type f -print0 | xargs -0 cat | sort -u

    Thank you for pointing this out.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to hongy...@gmail.com on Mon Nov 29 10:08:44 2021
    On 29.11.2021 09:21, hongy...@gmail.com wrote:
    On Monday, November 29, 2021 at 2:15:19 PM UTC+8, Janis Papanagnou
    wrote:
    On 29.11.2021 06:10, hongy...@gmail.com wrote:

    Yes. They give exactly the same result:
    As explained [in another context] more thoroughly upthread that may
    also just be coincidence; it's a hint but it's certainly no proof.
    (A difference would have proven it false, but no difference
    doesn't say anything, strictly speaking.)

    I don't quite understand what you mean above. Here, I mean, the two
    methods you mentioned *don't* have that issue give exactly the same
    result.

    The reasoning to assume that both are equivalent is non-conclusive.
    * Observing a difference means you have proven it _wrong_.
    * Observing no difference means you have _not proven_ it wrong
    (other tests might still prove it wrong).
    Not being able to prove something wrong does not automatically mean
    that it is correct. - Think about it.
    It may be true but that's not proven.
    You might be able to change your data in a way where it fails.
    Mind, I didn't say it's wrong, I said it's not proven to be correct.

    Physics (and other sciences) is full of tries to prove something
    wrong without a chance to prove something as correct.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Elvidge@21:1/5 to Janis Papanagnou on Mon Nov 29 10:16:00 2021
    On 29/11/2021 06:15 am, Janis Papanagnou wrote:
    On 29.11.2021 06:10, hongy...@gmail.com wrote:
    On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote: >>> It's necessary to understand the mechanics behind the tools to make
    an educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.

    Yes. They give exactly the same result:

    As explained [in another context] more thoroughly upthread that may
    also just be coincidence; it's a hint but it's certainly no proof.
    (A difference would have proven it false, but no difference doesn't
    say anything, strictly speaking.)


    You're trying to use logic again.


    $ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
    $

    BTW, with 'xargs', in the general case it's usually better to use NUL-separated data, as in

    find ./source -type f -print0 | xargs -0 cat | sort -u


    Janis



    --
    Chris Elvidge
    England

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Chris Elvidge on Mon Nov 29 12:20:55 2021
    On 29.11.2021 11:16, Chris Elvidge wrote:

    You're trying to use logic again.

    Don't recall; was that something bad?

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Elvidge@21:1/5 to Janis Papanagnou on Mon Nov 29 12:28:07 2021
    On 29/11/2021 11:20 am, Janis Papanagnou wrote:
    On 29.11.2021 11:16, Chris Elvidge wrote:

    You're trying to use logic again.

    Don't recall; was that something bad?

    Janis


    It it w.r.t. HZ

    --
    Chris Elvidge
    England

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to hongy...@gmail.com on Mon Nov 29 16:42:41 2021
    On Sun, 28 Nov 2021 21:01:29 -0800, hongy...@gmail.com wrote:

    On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou
    wrote:
    On 29.11.2021 01:29, hongy...@gmail.com wrote:
    On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou
    wrote:
    On 28.11.2021 03:04, hongy...@gmail.com wrote:
    [...]

    It seems that use the + variant of the -exec action is faster:
    Yes, but your numbers below are of little expressiveness, they might
    also result just from caching effects in that given magnitude. Often
    you get a magnitude or more speed increase. Note that
    post-processing with sort unnecessarily affects any speed
    comparisons of find usage variants. Note also that the find built-in
    -exec/+ could also be done using xargs (-print0 | xargs -0).

    Lew Pitcher told the following shortcoming of xargs based solution
    [1]:

    The second command depends on xargs(1) to provide sort(1) with a list
    of input files. As the size of this list depends on both the number
    of files found, /and/ the maximum size of an argument vector
    (argv[]), there is a chance that the file bbb will only contain the
    sorted contents of a subset of the files found by find(1).

    [1]
    https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/
    M3tIUPtPBAAJ

    Hence, I try to avoid using xargs.
    It's necessary to understand the mechanics behind the tools to make an
    educated decision.

    The point is that (in the examples) the use of 'cat' is exactly to
    avoid that issue (otherwise we wouldn't need it and could just use
    'sort' at the place where you have used 'cat').

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have that
    issue.

    Wonderful explanation, which hits the flaw of my knowledge. But I still
    want to know what is the probability that this problem will occur.
    [snip]

    The caution, to me, is that unless you are aware of both the limits of the tools that you use (like the argv[] limits imposed through xargs(1) ), and
    the conditions under which you will use these tools (like the number of
    files that find(1) will find, to pass along to xargs(1)), you rely on
    "clever code". The problem with "clever code" isn't that it works, but
    that it can fail in a non-obvious and unexpected manner, which becomes difficult to detect, let alone debug.

    For your find|xargs|sort solution, how would you have known, in any
    particular execution of that pipeline, that find(1) would have exceeded
    the number of filenames that xargs(1) could pass to sort(1)?
    How would you debug that sort of problem?

    Brian Kernighan had, for a while, as his .sig something to the effect that
    Debugging is twice as hard as writing the code in the first place.
    Therefore, if you write the code as cleverly as possible, you are, by
    definition, not smart enough to debug it.

    The find|xargs|sort solution looks to fall under the heading of "clever
    code".

    Just my opinion, of course.
    --
    Lew Pitcher
    "In Skills, We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Chris Elvidge on Mon Nov 29 18:29:24 2021
    On 2021-11-27, Chris Elvidge <chris@mshome.net> wrote:
    On 27/11/2021 01:25 pm, hongy...@gmail.com wrote:
    See my following testings:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    $ wc aaa bbb ccc
    17715 17898 157889 aaa
    875968 1063102 9971040 bbb
    875968 1063102 9971040 ccc
    1769651 2144102 20099969 total

    So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur. Any hints will be highly appreciated.

    Regards,
    HZ


    What, exactly, are you trying to do?

    To avoid learning how to understand programming, while a child goes from
    birth to graduating with a master's degree in CS.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geoff Clare@21:1/5 to hongy...@gmail.com on Tue Nov 30 13:43:06 2021
    hongy...@gmail.com wrote:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc

    Using the following method will get the same results as other methods:

    $ find ./source -type f -exec sort -u {} + > aaa

    Only if find is able to pass all the pathnames to a single execution
    of sort. If there is more than one execution of sort, then this will
    give a different result than all three of the other methods.

    Unlike the first two, it will not lose any data, but the output will
    be in "chunks" where each chunk contains sorted and de-duped data,
    but the overall output is highly likely to be disordered at the chunk boundaries and it may contain duplicates from different chunks.

    Therefore, the second method is not as reliable as the following two methods and should be avoided:

    $ find ./source -type f -exec sort -u {} + > aaa
    $ find ./source -type f -exec cat {} + | sort -u -o ccc

    The first of these two is not reliable, the second is reliable (as
    is the previous version with \; instead of +, but the version
    with + is more efficient).

    --
    Geoff Clare <netnews@gclare.org.uk>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From hongyi.zhao@gmail.com@21:1/5 to Geoff Clare on Tue Nov 30 22:56:33 2021
    On Tuesday, November 30, 2021 at 10:11:06 PM UTC+8, Geoff Clare wrote:
    hongy...@gmail.com wrote:

    $ find ./source -type f -exec sort -u {} -o aaa \;
    $ find ./source -type f | xargs sort -o bbb -u
    $ find ./source -type f -exec cat {} \;| sort -u -o ccc
    Using the following method will get the same results as other methods:

    $ find ./source -type f -exec sort -u {} + > aaa
    Only if find is able to pass all the pathnames to a single execution
    of sort. If there is more than one execution of sort, then this will
    give a different result than all three of the other methods.

    Unlike the first two, it will not lose any data, but the output will
    be in "chunks" where each chunk contains sorted and de-duped data,
    but the overall output is highly likely to be disordered at the chunk boundaries and it may contain duplicates from different chunks.
    Therefore, the second method is not as reliable as the following two methods and should be avoided:

    $ find ./source -type f -exec sort -u {} + > aaa
    $ find ./source -type f -exec cat {} + | sort -u -o ccc
    The first of these two is not reliable, the second is reliable (as
    is the previous version with \; instead of +, but the version
    with + is more efficient).

    Thank you for your elaboration. Yours analysis is basically coincide with the ones pointed by Janis Papanagnou [1]:

    I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
    issue.

    "find -exec cat | sort" or "find | xargs cat | sort" *don't* have
    that issue.


    As a result, I currently use the following two approaches:

    $ find ./source -type f -exec cat {} + | sort -uo american-english-exhaustive or
    $ find ./source -type f -print0 | xargs -0 cat | sort -uo american-english-exhaustive


    [1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/K6bf2bHABAAJ

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)