본문 바로가기

Linux & HPC

GNU Parallel

Article describing tool (for citations):

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

Authors' website for obtaining code:

http://www.gnu.org/software/parallel/

All new computers have multiple cores. Many bioinformatics tools are serial in nature and will therefore not use the multiple cores. However, many bioinformatics tasks (especially within NGS) are extremely parallelizeable:

  • Run the same program on many files
  • Run the same program on every sequence

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

A personal installation does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

EXAMPLE: Replace a for-loop

It is often faster to write a command using GNU Parallel than making a for loop:

for i in *gz; do 
  zcat $i > $(basename $i .gz).unpacked
done

can be written as:

parallel 'zcat {} > {.}.unpacked' ::: *.gz

The added benefit is that the zcats are run in parallel - one per CPU core.

EXAMPLE: Parallelizing BLAT

This will start a blat process for each processor and distribute foo.fa to these in 1 MB blocks:

cat foo.fa | parallel --round-robin --pipe --recstart ">" "blat -noHead genome.fa stdin >(cat) >&2" >foo.psl

EXAMPLE: Blast on multiple machines

Assume you have a 1 GB fasta file that you want blast, GNU Parallel can then split the fasta file into 100 KB chunks and run 1 jobs per CPU core:

cat 1gb.fasta | parallel --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > results

If you have access to the local machine, server1 and server2, GNU Parallel can distribute the jobs to each of the servers. It will automatically detect how many CPU cores are on each of the servers:

cat 1gb.fasta | parallel -S :,server1,server2 --block 100k --recstart '>' --pipe blastp -evalue 0.01 -outfmt 6 -db db.fa -query - > result

EXAMPLE: Run bigWigToWig for each chromosome

If you have one file per chomosome it is easy to parallelize processing each file. Here we do bigWigToWig for chromosome 1..19 + X Y M. These will run in parallel but only one job per CPU core. The {} will be substituted with arguments following the separator ':::'.

parallel bigWigToWig -chrom=chr{} wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr{}.map ::: {1..19} X Y M

EXAMPLE: Running composed commands

GNU Parallel is not limited to running a single command. It can run a composed command. Here is now you process multiple FASTA files using Biopieces (which uses pipes to communicate):

parallel 'read_fasta -i {} | extract_seq -l 5 | write_fasta -o {.}_trim.fna -x' ::: *.fna

See also: http://code.google.com/p/biopieces/wiki/HowTo#Howto_use_Biopieces_with_GNU_Parallel

EXAMPLE: Running experiments

Experiments often have several parameters where every combination should be tested. Assume we have a program called experimentthat takes 3 arguments: --age --sex --chr:

experiment --age 18 --sex M --chr 22

Now we want to run experiment for every combination of ages 1..80, sex M/F, chr 1..22+XY:

parallel experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

To save the output in different files you could do:

parallel experiment --age {1} --sex {2} --chr {3} '>' output.{1}.{2}.{3} ::: {1..80} ::: M F ::: {1..22} X Y

But GNU Parallel can structure the output into directories so you avoid having thousands of output files in a single dir:

parallel --results outputdir experiment --age {1} --sex {2} --chr {3} ::: {1..80} ::: M F ::: {1..22} X Y

This will create files like outputdir/1/80/2/M/3/X/stdout containing the standard output of the job.

If you have many different parameters it may be handy to name them:

parallel --result outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} ::: AGE {1..80} ::: SEX M F ::: CHR {1..22} X Y

Then the output files will be named like outputdir/AGE/80/CHR/Y/SEX/F/stdout

If one of your parameters take on many different values, these can be read from a file using '::::'

echo AGE > age_file
seq 1 80 >> age_file
parallel --results outputdir --header : experiment --age {AGE} --sex {SEX} --chr {CHR} :::: age_file ::: SEX M F ::: CHR {1..22} X Y

EXAMPLE(advanced): Using GNU Parallel to parallelize you own scripts

Assume you have BASH/Perl/Python script called launch. It takes one arguments, ID:

launch ID

Using parallel you can run multiple IDs in parallel using:

parallel launch ::: ID1 ID2 ...

But you would like to hide this complexity from the user, so the user only has to do:

launch ID1 ID2 ...

You can do that using --shebang-wrap. Change the shebang line from:

#!/usr/bin/env bash
#!/usr/bin/env perl
#!/usr/bin/env python

to:

#!/usr/bin/parallel --shebang-wrap bash
#!/usr/bin/parallel --shebang-wrap perl
#!/usr/bin/parallel --shebang-wrap python

You further develop your script so it now takes an ID and a DIR:

launch ID DIR

You would like it to take multiple IDs but only one DIR, and run the IDs in parallel. Again just change the shebang line to:

#!/usr/bin/parallel --shebang-wrap bash

And now you can run:

launch ID1 ID2 ID3 ::: DIR

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

#ilovefs

If you like GNU Parallel:

  • Give a demo at your local user group/team/colleagues (remember to show them --bibtex)
  • Post the intro videos on Reddit/Diaspora*/forums/blogs/ Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
  • Get the merchandise https://www.gnu.org/s/parallel/merchandise.html
  • Request or write a review for your favourite blog or magazine
  • Request or build a package for your favourite distribution (if it is not already there)
  • Invite me for your next conference

When using programs that use GNU Parallel to process data for publication it is a requirement that you cite (use: parallel --bibtex). If you want to license GNU Parallel without this requirement, contact me.

If GNU Parallel saves you money:


 출처 : https://www.biostars.org/p/63816/


꽤 재밌는 내용이라 가져와본다. 내 클러스터 서버에 적용해봐야지.