An Opinionated Guide to xargs
source link: https://www.oilshell.org/blog/2021/08/xargs.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
An Opinionated Guide to xargs
This post has everything you need to know about xargs, an essential tool for shell programming. It's based on my #comments on this "wrong" post:
xargs considered harmful (codefaster.substack.com
via lobste.rs)
24 points, 31 comments on 2021-07-16
Apologies to the author for the criticism, but it generated a great discussion! Here is what lies ahead:
- An introduction to xargs, and a discussion of alternatives.
- Tips on using it, accompanied by #sample-code in the blog-code/xargs directory.
- High-level thoughts on shell and the Oil language.
Preliminaries
What Is xargs
?
It's an adapter between text streams and argv
arrays, two essential
concepts in shell. You pass it flags that specify how to split stdin
.
Then it generates arguments and invokes processes. Example:
$ echo 'alice bob' | xargs -n 1 -- echo hi
hi alice
hi bob
What's happening here?
xargs
splits the input stream on whitespace, producing 2 arguments,alice
andbob
.- We passed
-n 1
, soxargs
then passes each argument to a separateecho hi $ARG
command. By default, it passes as many args to a command as possible, likeecho hi alice bob
.
It may help to mentally replace xargs
with the word each. As in, for
each word, line, or token, invoke this process with these args. In fact, I
propose an each
builtin for the Oil language below.
(This explanation was derived from this comment on the same thread.)
Which Flags Should I Know About?
You should know how to control:
- The algorithm for splitting text into arguments (
-d
,-0
). Discussed below. - How many arguments are passed to each process (
-n
). This determines the total number of processes started. - Whether processes are run in sequence or in parallel (
-P
).
When Globs Are Enough
The blog post suggests
rm $(ls | grep foo)
as an alternative to xargs
. In this case, it's better to just use a glob,
which is built into the shell:
rm *foo*
Prefer xargs
Over Shell's Word Splitting
Besides the extra ls
, the suggestion is bad because it relies on shell's
word splitting. This is due to the unquoted $()
. It's better to rely on the
splitting algorithms in xargs
, because they're simpler and more powerful.
(Related: Oil Doesn't Require Quoting
Everywhere.)
For example, if you want the power of regexes to filter names, you can pipe to
egrep
, then explicitly split its output by newlines:
# Remove Python and C++ unit tests
ls | egrep '.*_test\.(py|cc)' | xargs -d $'\n' -- rm
Usage Tips
Now that we've introduced xargs and discussed alternatives, here's some advice for using it.
Choose One of 3 Ways of Splitting stdin
In the comment, I suggest using only these three styles of splitting:
xargs
(the default): when you want "words" without spaces. For example, you can produce two args from the string'alice bob'
.xargs -d $'\n'
: When you want the args to be lines, as in theegrep
example above. (Note that$'\n'
is bash syntax for a newline character, and Oil uses this syntax too.)xargs -0
: When you want to handle untrusted data. Someone could put a newline in a filename, but this is safe with NUL-delimited tokens.
Most of my scripts use the second style, and occasionally the third. Unix
tools generally work better on streams of lines than streams of "words" or
NUL-delimited tokens. Those formats make it harder to filter the list of
tasks/items. That is, grep doesn't support an analogous -0
flag.
(This is one motivation for Oil's QSN serialization format. It's line-based, so regular grep still works. It can also represent every string, including those with NULs.)
xargs
Can Invoke Shell Functions With the $0
Dispatch Pattern
The original post discusses xargs -I {}
, which allows you to control where
each argument is substituted in the argv
array.
I occasionally use -I
, but more often I use xargs with what I call the
$0
Dispatch Pattern. I outlined this shell programming
pattern last month, but I still need to elaborate on it.
The basic idea is to avoid the mini language of -I {}
and just use shell
— by recursively invoking shell functions. I use this all over Oil's
own shell scripts, and elsewhere.
Example:
do_one() {
# Rather than xargs -I {}, it's more flexible to
# use a function with $1
echo "Do something with $1"
cp --verbose "$1" /tmp
}
do_all() {
# Call the do_one function for each item.
# Also add -P to make it parallel
cat tasks.txt | xargs -n 1 -d $'\n' -- $0 do_one
}
"$@" # dispatch on $0; or use 'runproc' in Oil
Now run this script with either:
my_script.sh do_one $ARG
to test the work that's done on each item. You want to make this correct first.myscript.sh do_all
to do work on all items.
This breaks the problem down nicely: make it work on one item, and then figure out which items to run it on. When you combine them, they will work, unlike the "sed into bash" solution given in the original post.
In other words: Use the Shell Language, Not Mini-Languages Like xargs -I {}
. This reduces language cacophony.
Preview Tasks With an echo
Prefix
Before running a command like:
$ cat tasks.txt | xargs -n 1 -- $0 do_one
It's often useful to preview it with echo
:
$ cat tasks.txt | xargs -n 1 -- echo $0 do_one
demo.sh do_one filename
demo.sh do_one with
demo.sh do_one spaces.txt
# Oops! We split the input the wrong way.
# We wanted xargs -d $'\n'.
xargs -P
Automatically Parallelizes Tasks
In the do_all
example above, you can add -P 8
to the xargs invocation
to automatically parallelize it! For example, if you have 1000 indepdendent
tasks, xargs
will use 8 CPUs to run them as quickly as possible.
I've used -P 32
to make day-long jobs take an hour! You can't do that with a
for
loop.
This is one of my favorite tricks, and 3 years ago I gave a 5 minute presentation ago at #recurse-center about it:
Try xargs -P
Before GNU Parallel
Some shell users use GNU parallel
to parallelize processes. I avoid it because it has yet another
mini-language with
{}
and :::
.
I don't think there are any problems GNU parallel can solve that xargs -P
combined with shell functions can't solve. Let me know if you
have a counterexample.
Use -n
To Batch Args, not -L
The original article talks a lot about xargs -L
, which I never use. It looks
like a mini data language to be avoided: e.g. trailing blanks meaning something
special.
I asserted that -n
was always better than -L
in the comments, and nobody
found a counterexample.
Benefits Of This Style
Start One rm
process, Not 10,000
A lobste.rs user asked why you would use find | xargs
rather than find -exec
.
The answer is that it can be much faster. If you’re trying to rm
10,000 files, you can start one process instead of 10,000 processes!
It’s basically
rm one two three
rm one
rm two
rm three
Other commenters pointed out that you can use find -exec +
instead of find -exec \;
, but I'd say that's another mini-language to be avoided.
Links:
Recap
To repeat, here are the benefits of the style I advocate:
- A Clean Problem Separation: Figure out what to do on each item (what's a task?), then figure out what items to do it on (what tasks should I run?)
- Easy Testing by previewing tasks with
echo
. This avoids running long batch jobs on the wrong input! - Better Performance.
xargs
lets you start as few processes as possible.- It also lets you start those processes in parallel. You can't do this with a for loop.
- Fewer Languages to Remember. We use plain shell and a few flags to xargs.
Conclusions
This post explained xargs
, gave advice on using it, and justified the advice.
The most important takeaway is that you can invoke and parallelize shell
functions with xargs, via the $0 Dispatch Pattern.
I use this pattern all the time, but I've rarely seen it used in the wild. Try it out and let me know what you think! You can start from the sample code in the blog-code/xargs directory.
Slogan: Shell-Centric Shell Programming
You may have noticed a high level pattern to this advice: We avoid "mini-languages" in various tools, and use the shell language instead:
- Shell functions and
$1
, instead ofxargs -I {}
xargs
andxargs -n 1
instead offind -exec +
andfind -exec \;
-n
instead of-L
(to avoid an ad hoc data language)- xargs with simple flags instead of GNU parallel
- Bash syntax
$'\n'
instead of tool syntax'\n'
. Using the shell is simpler than relying on every tool to understand the 2 character escape sequence\n
.
(Remember that I'm working on Slogans, Fallacies, and Concepts for shell programming.)
Oil Language: each
Builtin
I sketched an idea for each
and every
builtins in Oil in this
comment on the
same thread. On second thought, I think we should only have each
, and it
should look something like this:
# Items are lines by default.
# Start as few processes as possible, like xargs
# and 'find -exec +'
find . | each {
rm --verbose @_items # remove many files
}
# Start one process for each item, like 'xargs -n 1'
# and 'find -exec \;'
find . | each --one {
echo "Do something with $_item"
}
# Parallelize it
find . | each --one -P 8 {
echo "Do something with $_item"
sleep 0.1
}
So the separate do_one
and do_all
functions are avoided with Ruby-like
blocks in the Oil language. Just like the cd
builtin accepts a
block,
the each
builtin can as well.
Let me know what you think!
Appendices
Alternative Shell Challenge #2
I issued an "alternative shell challenge" last year: Can you redirect stdout
of a shell function that invokes both builtins and external processes?
- Four More Posts in "Shell: The Good Parts" (February 2020)
Here's another challenge for alternative shells: can you
parallelize your shell's notion of functions with xargs -P 8
? As shown
above, it's done in Bourne shell with xargs -P 8 -- $0 myfunc
.
Both of these challenges relate to shell functions and the Perlis-Thompson
Principle, an important idea in
#software-architecture. They explain why the Oil
language is designed around procs
rather than Python- or
JavaScript-like functions.
More Comments
- Comment on a shell injection problem in the original
post. I
referenced it in a later
comment on shell
injections.
sed | bash
is a bad dangerous pattern. You can pipe data rather than code.- String hygiene was one of the concepts in last month's Summer Blog Backlog: Understanding and Using Shell.
- Tip on GNU xargs
--show-limits.
If your input stream produces too many arguments, xargs will invoke
multiple processes, even when
-n
isn't specified.- However, I've never encountered a case where this matters. That is, it's
OK if you need 1 or 2
rm
processes, but not 10,000! (for performance)
- However, I've never encountered a case where this matters. That is, it's
OK if you need 1 or 2
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK