Question

Samtools Pileup Got Really Slow After I Switched To Stampy For Mapping

0

Entering edit mode

12.1 years ago

Applet ▴ 150

So here is my perplexing question.

When I used to run my pipeline on the same input files but using BWA for the mapping, the pileup step took less than 1 hour.

I switched to stampy for the mapping and now the pileup step takes 23 hours. Nothing else has changed.

I'm using an old version of samtools (Version: 0.1.7 (r510)) that we need for compatibility with other downstream programs, but if you guys think it would help, I'd consider upgrading samtools.

Other info: sim_w501.fastq is about a 6GB file. The paerent.fa file is about 136MB.

Here are the calls I make in the pipeline:

#Quality trim the reads
python TQSfastq.py -f sim_w501.fastq -t 20 -c 30 -q -o sim_w501.fastq
#Stampy still requires a bwa index. Generate that for my reference genome
bwa index -a bwtsw parent1.fa
#Generate stampy hashes
stampy.py -G parent1.stampy.msg parent1.fa
stampy.py -g parent1.stampy.msg -H parent1.stampy.msg
#Run stampy
stampy.py --bwaoptions="-q10 parent1.fa" -g parent1.stampy.msg -h parent1.stampy.msg -M sim_w501.fastq -o update_reads-aligned-parent1.sam
#Convert the output to BAM (from SAM)
samtools view -bhS -o update_reads-aligned-parent1.bam update_reads-aligned-parent1.sam
#Convert the output from SAM to BAM
samtools view -h -o update_reads-aligned-parent1.sam update_reads-aligned-parent1.stampy.bam
#Remove any bad reads that couldn't be mapped
filter-sam.py -i update_reads-aligned-parent1.sam -o update_reads-aligned-parent1.filtered.sam -a aln -s 1
#Convert back to BAM format
samtools view -b -o update_reads-aligned-parent1.bam update_reads-aligned-parent1.filtered.sam
#Sort the BAM file as it speeds up the next calmd step significantly
samtools sort update_reads-aligned-parent1.bam update_reads-aligned-parent1.bam.sorted
#Add in MD tags, stampy leaves these out
samtools calmd -b update_reads-aligned-parent1.bam.sorted.bam parent1.fa > update_reads-aligned-parent1.sorted.calmd.bam
#I'm not sure what this step is for
samtools index update_reads-aligned-parent1.bam.sorted.calmd.bam
#Run the pileup
samtools pileup -f parent1.fa update_reads-aligned-parent1.sorted.calmd.bam -c > update_reads-aligned-parent1.pileup

pileup samtools bwa • 5.3k views

ADD COMMENT • link updated 12.1 years ago by Swbarnes2 ★ 1.6k • written 12.1 years ago by Applet ▴ 150

0

Entering edit mode

you are running an old version of samtools. "pileup" has been deprecated .

ADD REPLY • link 12.1 years ago by Pierre Lindenbaum 161k

score 1 · Answer 1 · 2012-03-15

1

Entering edit mode

12.1 years ago

Vikas Bansal ★ 2.4k

I think you really need to read about samtools, stampy and bwa. Assuming that you want to map your reads using stampy and you need pileup file in the end, follow these steps-> As your reference genome (parent.fa) is not very big, use stampy without bwa ->

First you have to build a genome file:

  ./stampy.py -G hg18 parent.fa

After building the genome file, you need to build a hash table:

  ./stampy.py -g hg18 -H hg18

Map the reads:

  ./stampy.py -g hg18 -h hg18 -M sim_w501.fastq

According to stampy manual ->

Note that the hash file may be large (up to 2 Gb); to improve startup speed you may want to keep it on a local drive. Note also that the file system must support memory mapped files for Stampy to work; NFS does, but e.g. GlusterFS does not. Some file systems support memory mapped files but become exceedingly slow; if this happens try moving both files to a local drive.

You have to check this yourself.

So when you mapped the reads, stampy will give you output in sam format. Now as you mentioned, "Remove any bad reads that couldn't be mapped" and by this if you mean to remove unmapped reads, use:

 samtools view -bS -F 4 stampy_output.sam    -o output.bam
# Here -F 4 says that skip the flag 4 (means skip unmapped reads)

Now you can convert this bam file into pileup file.

ADD COMMENT • link 12.1 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Thanks. Why do you suggest not using BWA with stampy? I thought I read that it's almost always faster. I'm already doing the other stampy hash steps. The only issue I'm having trouble with is generating the pileup is very slow.

ADD REPLY • link 12.1 years ago by Applet ▴ 150

0

Entering edit mode

The last step where you are generating pileup file, try it without -c.

ADD REPLY • link 12.1 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

I believe the -c gives me special formatting in the output file which I need.

ADD REPLY • link 12.1 years ago by Applet ▴ 150

0

Entering edit mode

If the problem is during converting sam to pileup -> another solution is to compare the sam file which you got initially using bwa with the sam file you got from stampy.

ADD REPLY • link 12.1 years ago by Vikas Bansal ★ 2.4k

score 1 · Answer 2 · 2012-03-15

1

Entering edit mode

12.1 years ago

Swbarnes2 ★ 1.6k

Why do you have so many conversions to and from .sam and .bam?

You should pipe whatever it is that makes the .sam straight into samtools view so it can make the .bam on the fly. You can filter out unmapped reads at that step too.

Indexing isn't required for pileup or mpileup, unless you with to specify a region of the genome for your pileup. It's for slicing a region out of your .bam and visualizers like IGV require it too.

ADD COMMENT • link 12.1 years ago by Swbarnes2 ★ 1.6k

0

Entering edit mode

That's good advice. I don't think that would speed up the pileup step though. That's the really slow part that I want to fix. See the last line in my list of commands.

ADD REPLY • link 12.1 years ago by Applet ▴ 150