How to use GNU Parallel

Parallel is a shell utility to run scripts in parallel. It can take its input and start multiple processes at the same time to quickly process its input, such as a list of URLs or files. It breaks down its input into different pieces then lets each process work on a piece.

Parallel’s basic functionality is similar to xargs – please read more about the use of xargs here: Parallel processing in Linux.

Basic usage

Parallel is downloadable from the GNU webpage at: https://www.gnu.org/software/parallel/. It is also available as a package (parallel) in most Linux distributions. MacOSX users can install it using Homebrew.

The easiest way to learn something is to look at some examples so let’s run some tests to show you how parallel works.

First, it will need an input – we’ll use the utility “seq” that can generate numbers:

$ seq 10
1
2
3
4
5
6
7
8
9
10

If we feed this into parallel, we can run any command for each line at the same time. By default parallel starts maximum one thread for each CPU and appends its input after each command:

$ seq 10 | parallel echo
1
2
3
4
5
6
7
8
9
10

What we can’t see here just yet is that it ran 8 of these threads in parallel, so CPU1 ran “echo 1; echo 9”, CPU2 ran “echo 2; echo 10”; CPU3 ran “echo 3”, CPU4 ran “echo 4” etc.

Let’s complicate things a bit, run an “echo ; sleep” command with custom parameters:

$ seq 10 | parallel -j3 echo job slot: {%}, sleeping {}\; sleep {}                         
job slot: 1, sleeping 1
job slot: 2, sleeping 2
job slot: 3, sleeping 3
job slot: 1, sleeping 4
job slot: 2, sleeping 5
job slot: 3, sleeping 6
job slot: 1, sleeping 7
job slot: 2, sleeping 8
job slot: 3, sleeping 9
job slot: 1, sleeping 10

Here, -j3 tells parallel to run 3 threads at maximum, the rest is the command it needs to run.. first echo a message, then sleep for X seconds.

If we measure the time it took to run it, it will be 22 seconds. Job slot 1 ran sleep 1, sleep 4, sleep 7, sleep 10 – exactly 22 seconds:

$ time seq 10 | parallel -j3 echo job slot: {%}, sleeping {}\; sleep {}
job slot: 1, sleeping 1
job slot: 2, sleeping 2
job slot: 3, sleeping 3
job slot: 1, sleeping 4
job slot: 2, sleeping 5
job slot: 3, sleeping 6
job slot: 1, sleeping 7
job slot: 2, sleeping 8
job slot: 3, sleeping 9
job slot: 1, sleeping 10

real	0m22.157s
user	0m0.141s
sys	0m0.132s

Escaping semicolons separating multiple commands

In these examples, we need to escape the semicolon (that separates commands) to make sure the shell doesn’t process them.

parallel abc ; def
parallel abc \; def
parallel "abc ; def"

In the first instance, the shell would run two commands – parallel abc, then def. This is clearly not what we want, because we need parallel to process the whole command including the semicolon.

In the second and third examples above, parallel gets the semicolon and processes it just fine. They are equivalent (using quotes and escaping the semicolon with a backslash).

Using parameters as input

In some cases, we need parallel to run commands based on input parameters from the command line instead of files or the standard input. To do this, we can use the “:::” (three colons) operator appended after the command.

This example is equivalent to the previous one but instead of taking its input from the output of the “seq” command, it takes it from the command line.

$ parallel -j3 echo job slot: {%}, sleeping {}\; sleep {} ::: 1 2 3 4 5 6 7 8 9 10
job slot: 1, sleeping 1
job slot: 2, sleeping 2
job slot: 3, sleeping 3
job slot: 1, sleeping 4
job slot: 2, sleeping 5
job slot: 3, sleeping 6
job slot: 1, sleeping 7
job slot: 2, sleeping 8
job slot: 3, sleeping 9
job slot: 1, sleeping 10

Combining multiple input sources

Parallel can have multiple input sources and combine them by running the specified command once for each combination of the arguments. For example, this command:

$ parallel echo ::: 1 2 3 ::: 4 5 6
1 4
1 5
1 6
2 4
2 5
2 6
3 4
3 5
3 6

takes “1 2 3” as input1, “4 5 6” as input 2. Then for each parameter in input 1, it iterates (walks through) input 2, then does the same for the second parameter in input 1, and so on. The final result is that the number of times the command is run is input 1 * input 2, in this case, 3 * 3 = 9.

Appending multiple input sources

We can also simply append parameters from each input source by using the “:::+” argument instead of the usual “:::”. This command appends each parameter from the second input to each parameter from the first input, and the final command is limited by the shortest input:

$ parallel echo ::: 1 2 3 4 5 :::+ 6 7 8
1 6
2 7
3 8

Here, the first three numbers from input 2 are appended to input 1’s first three numbers, then the whole workflow completes because there are no more parameters left in input 2.

Special command variables

Parallel needs to know where to inject its input into the command so there are a few special variables that get replaced to its input. By default, the input is appended to the command, e.g. “seq 10 | parallel echo”.

The parameter {} is replaced to the whole input line, so these two are equivalent:

seq 10 | parallel echo
seq 10 | parallel echo {}

There are a few other useful parameters, for example

  • {%} – the job slot the command is running in
  • {#} – the job sequence number (in our example it counts from 1 to 10)
  • {n} – if “n” is a number, it refers to the current value of the nth input source (usable with multiple input sources)

Processing multiple files

Let’s create a few files first to have something to work on:

seq 10 | parallel echo {} \> {}.txt

This created 10 txt files named as 1.txt, 2.txt, 3.txt etc. each containing a number. We can then use an asterisk as a parameter to create commands that is ran for each file. For example, to compress all of the files in parallel:

$ parallel gzip -v ::: *.txt
1.txt:		  -99.9% -- replaced with 1.txt.gz
10.txt:		  -99.9% -- replaced with 10.txt.gz
2.txt:		  -99.9% -- replaced with 2.txt.gz
3.txt:		  -99.9% -- replaced with 3.txt.gz
4.txt:		  -99.9% -- replaced with 4.txt.gz
5.txt:		  -99.9% -- replaced with 5.txt.gz
6.txt:		  -99.9% -- replaced with 6.txt.gz
7.txt:		  -99.9% -- replaced with 7.txt.gz
8.txt:		  -99.9% -- replaced with 8.txt.gz
9.txt:		  -99.9% -- replaced with 9.txt.gz

Here, the asterisk gets replaced to “1.txt 10.txt 2.txt 3.txt…” then parallel runs a separate gzip command for each argument, up to the number of CPUs in the machine in parallel.

There are a few parameters that help with file processing:

  • {//} – the folder the file that’s being processed is in
  • {/} – the basename of the file (without the folder)
  • {.} – the filename without extension

Running parallel jobs on other computers

Parallel is capable of distributing jobs among remote computers over SSH. Files often need to be copied to the remote computer so it can also run scripts before/after each command completes or transfer files back and forth.

You’ll need to use public key authentication without a password to use this feature because otherwise, each thread had to wait for interactive input.

Let’s assume we got a folder of images that we want to optimize and we want to use three remote computers to do the heavy lifting for us. Each computer is accessible as host1, host2, and host3 and they have “jpegoptim” installed on them. We want to copy each file over, process them and retrieve the result then clean up by removing the remote files.

parallel -j4 --sshlogin host1 --sshlogin host2 --sshlogin host3 --trc {} jpegoptim ::: *jpg

Here, –trc is shorthand for transfer, return, cleanup. It will log in to each machine, run up to 4 tasks, copy files over, run commands, then retrieve the same optimized file.

Combine parallel with the “find” command

When processing files, using an asterisk to let the shell do the file globbing (injecting filenames into the command line) has a few drawbacks – it’s impossible to do it recursively and it also has limitations on the number of files it can process.

Similarly to xargs, it’s possible to feed the output of the “find -print0” command into parallel to process files that way. The “-print0” argument in the find command outputs files terminated by NUL characters for safety and parallel can process the same input format when the “-0” parameter is set.

$ find . -name '*.txt' -print0 | parallel -0 echo {%} {} 
1 ./10.txt
2 ./9.txt
3 ./8.txt
4 ./5.txt
5 ./4.txt
6 ./6.txt
7 ./7.txt
8 ./3.txt
1 ./2.txt
2 ./1.txt

Related Posts