Tutorial: Piping With Samtools, Bwa And Bedtools
82
gravatar for Ying W
6.5 years ago by
Ying W3.8k
South San Francisco, CA
Ying W3.8k wrote:

In this tutorial I will introduce some concepts related to unix piping. Piping is a very useful feature to avoid creation of intermediate use once files. It is assumed that bedtools, samtools, and bwa are installed.

Lets begin with a typical command to do paired end mapping with bwa: (./ means look in current directory only)

# -t 4 is for using 4 threads/cores
bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq > s1.sam

Supposed we wish to compress sam to bam, sort, remove duplicates, and create a bed file.

samtools view -Shu s1.sam > s1.bam
samtools sort s1.bam s1_sorted
samtools rmdup -s s1_sorted.bam s1_sorted_nodup.bam
bamToBed -i s1_sorted_nodup.bam > s1_sorted_nodup.bed

This workflow above creates many files that are only used once (such as s1.bam) and we can use the unix pipe utility to reduce the number intermediate files created. The pipe function is the character | and what it does is take the output from one command and sets it as input for next command (assuming next command knows how to deal with this input).

For example, when you type in head myfile.txt the command 'head' will read first 10 lines of myfile.txt and output it (by default) to the display (stdout) so you will see the first 10 lines printed on your screen. However, if you were to do something like this:

head myfile.txt | wc -l

you will instead get 10 printed out on your screen. What the pipe command did was take the output of head command and used it as input for the wordcount (wc) command. wc -l command will count the number of lines passed in (which is 10 since head by default prints the first 10 lines). An equivalent command would be

head myfile.txt > myfile_top10_lines.txt
wc -l myfile_top10_lines.txt

By using the pipe command we are able to eliminate the intermediate file that was generated (myfile_top10_lines.txt).

Below is an example of how pipe command can be used in samtools

bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq | \
samtools view -Shu - | \
samtools sort - - | \
samtools rmdup -s - - | \
tee s1_sorted_nodup.bam | \
bamToBed > s1_sorted_nodup.bed

A couple of things I wanted to point out here:

  • The \ symbol after the pipe symbol tells the terminal that I want to keep writing command on the next line, that way you can have multiline commands instead of super-long command on one line.
  • The tee command takes the information being piped and creates a file while passing the data along the pipe, this command is useful if you want to save an intermediate file but still need to do additional processing. (Think of a T shaped drain pipe)
  • The - symbol in samtools it to tell the samtools program to take the input from pipe
  • It is not recommended you run this command all at once like I showed above if you have a giant bam file because sort will take forever. Running things in a pipe can be more space efficient since you are not storing intermediate files but can create issues/bottlenecks elsewhere. If problems arise, run the pieces without piping to troubleshoot.
  • The two bwa aln commands can be run in parallel (for example: on another machine) before the sampe command

Other common uses of pipes that I have not shown above it to pipe output into snp calling, you can find some examples here: (NOTE: pileup command is depreciated, the new way to call SNPs using samtools using mpileup command)

samtools merge - *.bam | tee merged.bam | samtools rmdup - - | tee rmdup.bam | samtools mpileup - uf ./hg19.fasta - | bcftools view -bvcg - | gzip > var.raw.bcf.gz

Compressing files with gzip can save a lot of space, but makes these files non-human readable (if you use the head or less command it will give you gibberish). To view these files you can use zless and zcat command.

zless var.raw.bcf.gz #similar to gzip -cd var.raw.bcf.gz | less
zcat var.raw.bcf.gz | head

In case you are wondering if there is a way to run bwa sampe without storing intermediate sai files the command below will do this:

bwa sampe ./hg19.fasta <(bwa aln -t 4 ./hg19.fasta ./s1_1.fastq) <(bwa aln -t 4 ./hg19.fasta ./s1_2.fastq) ./s1_1.fastq ./s1_2.fastq | samtools view -Shb /dev/stdin > s1.bam

Notice here that I used the -b command instead of -u command in samtools view. -u is useful for piping since it spits out an uncompressed bam so you don't have to waste CPU cycles compressing/decompressing at every step, however, if you have a file you want to store, it would be better to compress it with -b. Additionally, I used /dev/stdin instead of using the - symbol, they both mean the same thing (afaik) and I wanted to point it out that /dev/stdin might be a more explicit way of showing things. (ie samtools view -Shu /dev/stdin )

Update: newer versions of bwa use the bwa mem command to map:

# 4 cores, -M is for Picard compatibility
bwa mem -M -t 4 ./hg19.fasta ./s1_1.fastq ./s1_2.fastq > s1.sam

This command would replace the following lines of code in the tutorial above:

bwa aln -t 4 ./hg19.fasta ./s1_1.fastq > ./s1_1.sai
bwa aln -t 4 ./hg19.fasta ./s1_2.fastq > ./s1_2.sai
bwa sampe ./hg19.fasta ./s1_1.sai ./s1_2.sai ./s1_1.fastq ./s1_2.fastq > s1.sam

While piping can be useful in some cases, it can also be bottlenecked by slow steps. Resource utilization can be monitored by programs like nmon and ole.tange's guide on using parallel in the comments below will be a better solution than using pipes when there are slow steps and you are processing many files.

ADD COMMENTlink modified 2.9 years ago by Yaseen Ladak20 • written 6.5 years ago by Ying W3.8k
2

I think there may be a typo in the piping example, where "samtools sort - - | \" should be "samtools sort - | \". Thanks for the great write-up. Cheers!

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by tantrev20

Thanks for the great tutorial! However, I think the " > ./s1_2.sai" part of the above bwa sampe command without intermediate sai files shouldn't be included.

Also, do you know of a way to pipe into and out of the samtools index command?

ADD REPLYlink written 6.3 years ago by Bilby10

you're right, thx for catching that, I've changed it. You do not need to pipe anything into samtools index because its creating a .bai file. you could append a 'samtools index s1.bam;' to the end of your command

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Ying W3.8k

Hi Ying, I was trying to follow the workflow described above but could not get it to work using bwa-mem.. Have you found a way to do this? samtools view complains about truncated input.

ADD REPLYlink written 4.9 years ago by thomasvangurp0

this would probably be best as a new question with more details, from the sounds of it though, maybe bwa did not finish or create a proper file

ADD REPLYlink written 4.9 years ago by Ying W3.8k

The \ symbol is not needed if put after | in Bash. Bash automatically knows the line continues if the last char is a |.

ADD REPLYlink written 4.7 years ago by ole.tange3.2k

thanks for the tip

ADD REPLYlink written 4.7 years ago by Ying W3.8k

This is great, however nowadays I would use bwa mem which can align paired ends data, and substitute the work of bwa aln and bwa sampe.

ADD REPLYlink written 4.6 years ago by Giovanni M Dall'Olio26k
11
gravatar for ole.tange
4.6 years ago by
ole.tange3.2k
Denmark
ole.tange3.2k wrote:

Piping is highly efficient, but you will often experience, that a single program in the pipeline will slow the whole thing down. If you have a situation like this:

A | B | C

where A and C are fast, but B is slow, you can often speed up the process by parallelizing B. If B reads a record (e.g. a line), processes the line, and sends some output to C, you can parallelize B using GNU Parallel GNU Parallel - parallelize serial command line programs without changing them

A | parallel --pipe B | C

The above assumes:

  • A record is a line (\n is the record delimiter)
  • The order to process data is unimportant (It is OK that line 1000 is processed before line 999)

GNU Parallel will read blocks of 1 MB, find a \n, chop there and pass the block on to B.

If the order is must be kept (The output of processing line 1000 must be after the output from line 999), then use -k (--keep-order):

A | parallel --pipe -k B | C

If the records start with '>' then you can use:

A | parallel --pipe --recstart '>' B | C

If the records start with '@' and end with '\n' and the body of the record can contain both '@' and '\n', then you cannot simply split on neither '@' nor '\n':

A | parallel --pipe --recend '\n' --recstart '@' B | C

This will only split if '\n' is followed immediately by '@', and '\n' will be kept with the previous record.

If the 1 MB block size is too small, adjust it with --block:

A | parallel --pipe --block 10M B | C

If B is really slow at starting (e.g. it has to read a database before it can start processing) you can spread the blocks round robin (if 4 jobs are started, job 3 will get block 3,7,11,15...):

A | parallel --pipe --round-robin B | C

As this will mix blocks --keep-order will not work, and C will only receive its first input after all output from A has been processed.

GNU Parallel has many other ways of parallelizing. The tutorial will get you through many of them GNU Parallel tutorial

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by ole.tange3.2k

The parallelizing seems like a pretty neat idea. The question is, can the programs Ying W mentioned be parallelized? Has anyone tried it and can share how it worked out?

ADD REPLYlink written 2.7 years ago by alons260
6
gravatar for Pablo Marin-Garcia
6.1 years ago by
Spain
Pablo Marin-Garcia1.8k wrote:

Good tutorial, thanks. Only a note: pileup is deprecated use mpileup instead. The SAMtools wiki at sourceforge is a bit old (last entry in FAQ from 2010), and pileup action has been deprecated since then. For calling SNPs and specially indels now you need to use mpileup and bcftools. You can read more about mpileup here http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html and Calling SNPs/INDELs with SAMtools/BCFtools

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Pablo Marin-Garcia1.8k
2
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

Supposed we wish to compress sam to bam, sort, remove duplicates, and create a bed file.

One likely faster way to do this might be to skip the sam to bam conversion and convert directly to a sorted BED file:

$ sam2bed < reads.sam > sorted_redundant.bed

The sam2bed (convert2bed) task converts columns directly to BED and does sorting with sort-bed, which is faster than GNU sort (both serial and parallel) and sortBed. The sam2bed step also preserves all columns, whereas I think bamToBed with sortBed will discard data.

To make non-redundant, add an awk statement to remove duplicates:

$ sam2bed < reads.sam | awk '{ if (a[$0]++ == 0) print $0; }' > sorted_nonredundant.bed

If you want compression, the Starch format offers a 33% space savings over the bam format, generally:

$ sam2bed < reads.sam | awk '{ if (a[$0]++ == 0) print $0; }' | starch - > sorted_nonredundant.starch

Starch removes redundancy in the coordinate data before compression.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds26k
1
gravatar for Zee
4.5 years ago by
Zee20
Malaysia
Zee20 wrote:

With Novoalign we do the equivalent

 

1. Create the aligned , unsorted bam file. The aligner will use all or most cores (controlled  with -c) so we don't want to exhaust resources:

novoalign -d genome.novoindex -f s_1_1.fastq.gz s_1_2.fastq.gz -a -oSAM \

|  samtools view -uS - > s1.bam

2. The next part can be piped as above

 

novosort -c $CORES -m 16g s1.bam | samtools rmdup -s - - | tee s1_sorted_nodup.bam \

| bamToBed -i stdin > s1_sorted_nodup.bed

 

With the "-m 16g" we're just telling the sorting program to use 16G of RAM and it will consume $CORES CPUs leaving the remainder for samtools rmdup.

 

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Zee20
0
gravatar for deepthitheresa
6.4 years ago by
Canada
deepthitheresa20 wrote:

Hi

I tried piping samtools. It is not working properly. Only first command is doing correctly.

Anyway thank you for sharing the concept of piping.

Deeps

ADD COMMENTlink written 6.4 years ago by deepthitheresa20

very often when piping into a samtools command you will need a - to tell the program you are piping something in, you can also use /dev/stdin instead of - in the place of the file and it should also work, if you post an example i could help you out

ADD REPLYlink written 6.4 years ago by Ying W3.8k
0
gravatar for deepthitheresa
6.4 years ago by
Canada
deepthitheresa20 wrote:

Hi Ying,

I already posted the issue in the Biostar. I am sending you the link.

http://www.biostars.org/post/show/12747/how-to-combine-two-samtools-commands/#45006

Thanks, Deepthi

ADD COMMENTlink written 6.4 years ago by deepthitheresa20
0
gravatar for Casey Bergman
6.3 years ago by
Casey Bergman17k
Athens, GA, USA
Casey Bergman17k wrote:

Hi -

I think you need to add a -b flag to the following steps to get samtools view to output BAM format which is required for samtools sort:

samtools view -Shub s1.sam > s1.bam

and

samtools view -Shub - | \
ADD COMMENTlink written 6.3 years ago by Casey Bergman17k

No, the -u flag replaces the -b flag when piping to save time on compressing and decompressing bam files

ADD REPLYlink written 6.3 years ago by Bilby10

Right, I hadn't tried this for the pipe, but I think b is needed for the first command, no?

ADD REPLYlink written 6.3 years ago by Casey Bergman17k

Yep, if its not piped then you need the -b and not the -u as I understand it.

ADD REPLYlink written 6.3 years ago by Bilby10

I've added a small part at the end to clarify this.

ADD REPLYlink written 6.2 years ago by Ying W3.8k
0
gravatar for eva
3.7 years ago by
eva10
Finland
eva10 wrote:
samtools sort - - | samtools rmdup -s - - |

 part of the pipe wasn't working for me(giving invalid bam binary header error). I had to use a different way of piping for the sort part as was suggested here http://sourceforge.net/p/samtools/mailman/message/27792510/  like this:

samtools sort -o - tmp| samtools rmdup -s - - |
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by eva10
0
gravatar for Yaseen Ladak
2.9 years ago by
Yaseen Ladak20
United Kingdom, London, Imperial College London
Yaseen Ladak20 wrote:

Hi All

I have a question. I wish to run bowtie over 3 cores and get an output of aligned sorted and indexed bam files.

bowtie2 -p 3 -x ../genomes/hg38/hg38 -1 tg/1/S1R1_val_1.fq.gz -2 tg/1/S1R2_val_2.fq.gz | samtools view -Shu - | samtools sort - - | samtools index > bam/S1_srt.bam

The above command just foes till sorted how can I also include a index via the pipe such that I get a sorted aligned bam and also the index in the same pipe?

I look forward for your help.

Thanks,

Yaseen

ADD COMMENTlink written 2.9 years ago by Yaseen Ladak20
0
gravatar for Yaseen Ladak
2.9 years ago by
Yaseen Ladak20
United Kingdom, London, Imperial College London
Yaseen Ladak20 wrote:

Hi All

I have a question. I wish to run bowtie over 3 cores and get an output of aligned sorted and indexed bam files.

bowtie2 -p 3 -x ../genomes/hg38/hg38 -1 tg/1/S1R1_val_1.fq.gz -2 tg/1/S1R2_val_2.fq.gz | samtools view -Shu - | samtools sort - - | samtools index > bam/S1_srt.bam

The above command just foes till sorted how can I also include a index via the pipe such that I get a sorted aligned bam and also the index in the same pipe?

I look forward for your help.

Thanks,

Yaseen

ADD COMMENTlink written 2.9 years ago by Yaseen Ladak20

remove the second dash in the sort command

ADD REPLYlink written 2.9 years ago by andrew.j.skelton735.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1142 users visited in the last hour