See my following testings:
$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ wc aaa bbb ccc
17715 17898 157889 aaa
875968 1063102 9971040 bbb
875968 1063102 9971040 ccc
1769651 2144102 20099969 total
So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur. Any hints will be highly appreciated.
Regards,
HZ
See my following testings:
$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ wc aaa bbb ccc
17715 17898 157889 aaa
875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
1769651 2144102 20099969 total
So, the first usage seems to be incorrect. But I can't understand why
such a mistake would occur.
Any hints will be highly appreciated.
Regards,
HZ
On Saturday, November 27, 2021 at 10:27:32 PM UTC+8, Lew Pitcher wrote:
On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:
See my following testings:
$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ wc aaa bbb ccc
17715 17898 157889 aaa
875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
1769651 2144102 20099969 total
So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur.There's no mistake. The first command doesn't do the same things
as the second and third command does.
The first command overwrites file aaa with the sorted contents of each file found. While file aaa will, at times, contain the contents of each file (sorted, of course), it's final contents are the sorted contents of the /last/ file found by the find(1) command.Thank you for your prompt. Using the following method will get the same results as other methods:
$ find ./source -type f -exec sort -u {} + > aaa
The other two commands /attempt/ to sort the contents of /all/ the found files into files bbb and ccc respectively.
The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of files found, /and/ the maximum size of an argument vector (argv[]), there is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).
The third command uses cat(1) to concatenate the contents of all the found files into one stream that sort(1) will sort into file ccc.Therefore, the second method is not as reliable as the following two methods and should be avoided:
$ find ./source -type f -exec sort -u {} + > aaa
$ find ./source -type f -exec cat {} + | sort -u -o ccc
On Sat, 27 Nov 2021 05:25:24 -0800, hongy...@gmail.com wrote:
See my following testings:
$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ wc aaa bbb ccc
17715 17898 157889 aaa
875968 1063102 9971040 bbb 875968 1063102 9971040 ccc
1769651 2144102 20099969 total
So, the first usage seems to be incorrect. But I can't understand whyThere's no mistake. The first command doesn't do the same things
such a mistake would occur.
as the second and third command does.
The first command overwrites file aaa with the sorted contents of each
file found. While file aaa will, at times, contain the contents of each
file (sorted, of course), it's final contents are the sorted contents of
the /last/ file found by the find(1) command.
The other two commands /attempt/ to sort the contents of /all/ the found files into files bbb and ccc respectively.
The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of
files found, /and/ the maximum size of an argument vector (argv[]), there
is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).
The third command uses cat(1) to concatenate the contents of all the found files into one stream that sort(1) will sort into file ccc.
[...]
It seems that use the + variant of the -exec action is faster:
$ time find ./source -type f -exec cat {} \;| sort -u -o ccc
real 0m3.251s
user 0m3.188s
sys 0m0.066s
$ time find ./source -type f -exec cat {} + | sort -u -o ccc
real 0m2.895s
user 0m2.827s
sys 0m0.075s
Regards,
HZ
On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
On 28.11.2021 03:04, hongy...@gmail.com wrote:
[...]Yes, but your numbers below are of little expressiveness, they might
It seems that use the + variant of the -exec action is faster:
also result just from caching effects in that given magnitude. Often
you get a magnitude or more speed increase. Note that post-processing
with sort unnecessarily affects any speed comparisons of find usage
variants. Note also that the find built-in -exec/+ could also be done
using xargs (-print0 | xargs -0).
Lew Pitcher told the following shortcoming of xargs based solution [1]:
The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of
files found, /and/ the maximum size of an argument vector (argv[]), there
is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).
[1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ
Hence, I try to avoid using xargs.
On 28.11.2021 03:04, hongy...@gmail.com wrote:
[...]
It seems that use the + variant of the -exec action is faster:Yes, but your numbers below are of little expressiveness, they might
also result just from caching effects in that given magnitude. Often
you get a magnitude or more speed increase. Note that post-processing
with sort unnecessarily affects any speed comparisons of find usage
variants. Note also that the find built-in -exec/+ could also be done
using xargs (-print0 | xargs -0).
Janis
$ time find ./source -type f -exec cat {} \;| sort -u -o ccc
real 0m3.251s
user 0m3.188s
sys 0m0.066s
$ time find ./source -type f -exec cat {} + | sort -u -o ccc
real 0m2.895s
user 0m2.827s
sys 0m0.075s
Regards,
HZ
On 29.11.2021 01:29, hongy...@gmail.com wrote:
On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
On 28.11.2021 03:04, hongy...@gmail.com wrote:
[...]Yes, but your numbers below are of little expressiveness, they might
It seems that use the + variant of the -exec action is faster:
also result just from caching effects in that given magnitude. Often
you get a magnitude or more speed increase. Note that post-processing
with sort unnecessarily affects any speed comparisons of find usage
variants. Note also that the find built-in -exec/+ could also be done
using xargs (-print0 | xargs -0).
Lew Pitcher told the following shortcoming of xargs based solution [1]:
The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of files found, /and/ the maximum size of an argument vector (argv[]), there is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).
[1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ
Hence, I try to avoid using xargs.It's necessary to understand the mechanics behind the tools to make
an educated decision.
The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.
On 29.11.2021 01:29, hongy...@gmail.com wrote:
On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou wrote:
On 28.11.2021 03:04, hongy...@gmail.com wrote:
[...]Yes, but your numbers below are of little expressiveness, they might
It seems that use the + variant of the -exec action is faster:
also result just from caching effects in that given magnitude. Often
you get a magnitude or more speed increase. Note that post-processing
with sort unnecessarily affects any speed comparisons of find usage
variants. Note also that the find built-in -exec/+ could also be done
using xargs (-print0 | xargs -0).
Lew Pitcher told the following shortcoming of xargs based solution [1]:
The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number of files found, /and/ the maximum size of an argument vector (argv[]), there is a chance that the file bbb will only contain the sorted contents of
a subset of the files found by find(1).
[1] https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/M3tIUPtPBAAJ
Hence, I try to avoid using xargs.It's necessary to understand the mechanics behind the tools to make
an educated decision.
The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.
On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
It's necessary to understand the mechanics behind the tools to make
an educated decision.
The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.
Yes. They give exactly the same result:
$ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
$
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.
Wonderful explanation, which hits the flaw of my knowledge. But I
still want to know what is the probability that this problem will
occur.
OTOH, I think this problem is, to some extent, related to the
following values:
$ xargs --show-limits
[...]
Anyway, if the xargs failed to do the trick, maybe parallel [1]
doesn't have this issue.
[1] https://www.gnu.org/software/parallel/
On 29.11.2021 06:10, hongy...@gmail.com wrote:
On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote:
It's necessary to understand the mechanics behind the tools to make
an educated decision.
The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.
Yes. They give exactly the same result:As explained [in another context] more thoroughly upthread that may
also just be coincidence; it's a hint but it's certainly no proof.
(A difference would have proven it false, but no difference doesn't
say anything, strictly speaking.)
$ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )BTW, with 'xargs', in the general case it's usually better to use NUL-separated data, as in
$
find ./source -type f -print0 | xargs -0 cat | sort -u
On Monday, November 29, 2021 at 2:15:19 PM UTC+8, Janis Papanagnou
wrote:
On 29.11.2021 06:10, hongy...@gmail.com wrote:
As explained [in another context] more thoroughly upthread that may
Yes. They give exactly the same result:
also just be coincidence; it's a hint but it's certainly no proof.
(A difference would have proven it false, but no difference
doesn't say anything, strictly speaking.)
I don't quite understand what you mean above. Here, I mean, the two
methods you mentioned *don't* have that issue give exactly the same
result.
On 29.11.2021 06:10, hongy...@gmail.com wrote:
On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis Papanagnou wrote: >>> It's necessary to understand the mechanics behind the tools to make
an educated decision.
The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that
issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have
that issue.
Yes. They give exactly the same result:
As explained [in another context] more thoroughly upthread that may
also just be coincidence; it's a hint but it's certainly no proof.
(A difference would have proven it false, but no difference doesn't
say anything, strictly speaking.)
$ diff <( find ./source -type f -exec cat {} + | sort -u ) <( find ./source -type f | xargs cat | sort -u )
$
BTW, with 'xargs', in the general case it's usually better to use NUL-separated data, as in
find ./source -type f -print0 | xargs -0 cat | sort -u
Janis
You're trying to use logic again.
On 29.11.2021 11:16, Chris Elvidge wrote:
You're trying to use logic again.
Don't recall; was that something bad?
Janis
On Monday, November 29, 2021 at 8:53:02 AM UTC+8, Janis PapanagnouM3tIUPtPBAAJ
wrote:
On 29.11.2021 01:29, hongy...@gmail.com wrote:
On Sunday, November 28, 2021 at 10:36:35 AM UTC+8, Janis Papanagnou
wrote:
On 28.11.2021 03:04, hongy...@gmail.com wrote:
[...]Yes, but your numbers below are of little expressiveness, they might
It seems that use the + variant of the -exec action is faster:
also result just from caching effects in that given magnitude. Often
you get a magnitude or more speed increase. Note that
post-processing with sort unnecessarily affects any speed
comparisons of find usage variants. Note also that the find built-in
-exec/+ could also be done using xargs (-print0 | xargs -0).
Lew Pitcher told the following shortcoming of xargs based solution
[1]:
The second command depends on xargs(1) to provide sort(1) with a list
of input files. As the size of this list depends on both the number
of files found, /and/ the maximum size of an argument vector
(argv[]), there is a chance that the file bbb will only contain the
sorted contents of a subset of the files found by find(1).
[1]
https://groups.google.com/g/comp.unix.shell/c/ha5t3U54GmY/m/
[snip]It's necessary to understand the mechanics behind the tools to make an
Hence, I try to avoid using xargs.
educated decision.
The point is that (in the examples) the use of 'cat' is exactly to
avoid that issue (otherwise we wouldn't need it and could just use
'sort' at the place where you have used 'cat').
I.e. "find -exec sort" or "find | xargs sort" _both_ *have* that issue.
"find -exec cat | sort" or "find | xargs cat | sort" *don't* have that
issue.
Wonderful explanation, which hits the flaw of my knowledge. But I still
want to know what is the probability that this problem will occur.
On 27/11/2021 01:25 pm, hongy...@gmail.com wrote:
See my following testings:
$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ wc aaa bbb ccc
17715 17898 157889 aaa
875968 1063102 9971040 bbb
875968 1063102 9971040 ccc
1769651 2144102 20099969 total
So, the first usage seems to be incorrect. But I can't understand why such a mistake would occur. Any hints will be highly appreciated.
Regards,
HZ
What, exactly, are you trying to do?
$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
Using the following method will get the same results as other methods:
$ find ./source -type f -exec sort -u {} + > aaa
Therefore, the second method is not as reliable as the following two methods and should be avoided:
$ find ./source -type f -exec sort -u {} + > aaa
$ find ./source -type f -exec cat {} + | sort -u -o ccc
hongy...@gmail.com wrote:
Using the following method will get the same results as other methods:$ find ./source -type f -exec sort -u {} -o aaa \;
$ find ./source -type f | xargs sort -o bbb -u
$ find ./source -type f -exec cat {} \;| sort -u -o ccc
$ find ./source -type f -exec sort -u {} + > aaaOnly if find is able to pass all the pathnames to a single execution
of sort. If there is more than one execution of sort, then this will
give a different result than all three of the other methods.
Unlike the first two, it will not lose any data, but the output will
be in "chunks" where each chunk contains sorted and de-duped data,
but the overall output is highly likely to be disordered at the chunk boundaries and it may contain duplicates from different chunks.
Therefore, the second method is not as reliable as the following two methods and should be avoided:
$ find ./source -type f -exec sort -u {} + > aaaThe first of these two is not reliable, the second is reliable (as
$ find ./source -type f -exec cat {} + | sort -u -o ccc
is the previous version with \; instead of +, but the version
with + is more efficient).
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 296 |
Nodes: | 16 (2 / 14) |
Uptime: | 57:47:35 |
Calls: | 6,652 |
Calls today: | 4 |
Files: | 12,200 |
Messages: | 5,331,029 |