Article describing tool (for citations):
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
Authors' website for obtaining code:
http://www.gnu.org/software/parallel/
All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:
- Run the same program on many files
- Run the same program on every sequence
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
EXAMPLE: Replace a for-loop
It is often faster to write a command using GNU Parallel than making a for
loop:
for i in *gz; do
zcat $i > $(basename $i .gz).unpacked
done
can be written as:
parallel 'zcat {} > {.}.unpacked' ::: *.gz
The added benefit is that the zcat
s are run in parallel - one per CPU core.
EXAMPLE: Parallelizing BLAT
This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:
cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl
EXAMPLE: Blast on multiple machines
Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:
cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results
If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:
cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result
EXAMPLE: Run bigWigToWig for each chromosome
If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.
parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M
EXAMPLE: Running composed commands
GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):
parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna
See also: http://code.google.com/p/biopieces/wiki/HowTo#Howto_use_Biopieces_with_GNU_Parallel
EXAMPLE: Running experiments
Experiments often have several parameters where every combination should be tested. Assume we have a program called experiment
that takes 3 arguments: --age --sex --chr:
experiment --age 18 --sex M --chr 22
Now we want to run experiment
for every combination of ages 1..80, sex M/F, chr 1..22+XY:
parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
To save the output in different files you could do:
parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y
But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:
parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y
This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.
If you have many different parameters it may be handy to name them:
parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y
Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout
If one of your parameters take on many different values, these can be read from a file using '::::'
echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y
EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts
Assume you have BASH/Perl/Python script called launch
. It takes one arguments, ID:
launch ID
Using parallel you can run multiple IDs in parallel using:
parallel launch ::: ID1 ID2 ...
But you would like to hide this complexity from the user, so the user only has to do:
launch ID1 ID2 ...
You can do that using --shebang-wrap. Change the shebang line from:
#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python
to:
#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python
You further develop your script so it now takes an ID and a DIR:
launch ID DIR
You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to:
#!/usr/bin/parallel --shebang-wrap bash
And now you can run:
launch ID1 ID2 ID3 ::: DIR
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
#ilovefs
If you like GNU Parallel:
- Give a demo at your local user group/team/colleagues (remember to show them --bibtex)
- Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
- Get the merchandise https://www.gnu.org/s/parallel/merchandise.html
- Request or write a review for your favourite blog or magazine
- Request or build a package for your favourite distribution (if it is not already there)
- Invite me for your next conference
When using programs that use GNU Parallel to process data for publication it is a requirement that you cite (use: parallel --bibtex). If you want to license GNU Parallel without this requirement, contact me.
If GNU Parallel saves you money:
- (Have your company) donate to FSF https://my.fsf.org/donate/