5

GNU Parallel can do anything, but is too complicated to be useful

 3 years ago
source link: https://www.ctrl.blog/entry/not-gnu-parallel.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

GNU Parallel can do anything, but is too complicated to be useful

GNU Parallel is a utility that lets you run command jobs in parallel; on local and on remote hosts over the network. It’s incredibly powerful when you need something more flexible than xargs, and it’s especially useful with small computer clusters.

You don’t need to set up and configure anything. There is no long-lived daemon process or central scheduler, as is customary in distributed compilers. You only need to set up key-based authentication for each remote host (so parallel can log in automatically), and install Perl and Parallel on each system. That’s it.

You can configure it to transfer files from anywhere on your local system to a remote host. However, it can’t transfer files back to any directory other than the current working directory. For whatever reason, my odd jobs often involve moving files between different directories, and GNU Parallel complicates that tremendously. You’ll need to introduce more complications in the form of another tool to move files elsewhere on your system.

GNU Parallel can help you get jobs done quicker, but you need to do a cost-benefit analysis before you use it. Its command syntax is unique and non-intuitive for command-line users, and getting it up and running can be a real head-scratcher. I spend hours in the command line every week, but I can’t make heads or tails of Parallel’s unique syntax. Intuitively, I always expect it to be quick to set it up, but I always run into unique problems that I’ve never experienced before. Its manual page is excellently written, but also practically incomprehensible.

I recently wanted to use GNU Parallel for a job. I needed something to send files to a remote cluster, execute a few commands, wait for them to finish, and transfer the results back. GNU Parallel is perfectly suited, but getting it to work just right took hours.

GNU Parallel is the one tool where command cheat-sheets never seems to help me get going. There are just too many situations where it behaves unexpectedly, or combining arguments cause the underlying tools to change behavior in unexpected ways. It takes a lot of time to prepare and get the data just right into a format where it can be fed to parallel.

It took hours of reading and re-reading the manuals for parallel, rsync, and ssh just to get it up and running. GNU Parallel is built on top of the latter two tools, and you need to be very familiar with both to intuit how parallel will behave in different situations. The different arguments you give parallel frequently influence the underlying tools in ways not detailed in the former’s manual.

For example, the --cleanup argument is outright dangerous! It tries to recursively remove directories it has transferred files into instead of just the files themselves. You can easily end up instructing it to delete system-critical directories like /tmp or personal directories like ~/Documents. Use it with great care!

As many times before, I got it to work in the end. However, I ended up spending way more time than I’d expected, and it needed a metric ton of scaffolding and helper scripts. It would have been much faster to write my own small script that used rsync to transfer files, and ssh to execute remote commands. I often try to shoe-horn parallel into my workflow, but it just never seems to fit.

I ended up rewriting the job logic into a Ruby script in about 20-minutes. The script monitors the availability of remote hosts and can rerun jobs that failed on one host on another. I got everything I wanted, except it doesn’t involve GNU Parallel. Frankly, the process was way simpler without the complexities that GNU Parallel brought with it.

My script is task-specific and not a generic utility like GNU Parallel. However, I think I’ve finally had enough of GNU Parallel. I believe that I’ve attributed it more credit than it’s due. It’s incredibly powerful, but it’s also an incredibly hungry time-sinkhole. For future parallelizable odd jobs, I’ll skip it completely and just write a minimal job manager to orchestrate the job.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK