Question: Samtools Pileup Got Really Slow After I Switched To Stampy For Mapping
gravatar for Applet
8.7 years ago by
Applet150 wrote:

So here is my perplexing question.

When I used to run my pipeline on the same input files but using BWA for the mapping, the pileup step took less than 1 hour.

I switched to stampy for the mapping and now the pileup step takes 23 hours. Nothing else has changed.

I'm using an old version of samtools (Version: 0.1.7 (r510)) that we need for compatibility with other downstream programs, but if you guys think it would help, I'd consider upgrading samtools.

Other info: sim_w501.fastq is about a 6GB file. The paerent.fa file is about 136MB.

Here are the calls I make in the pipeline:

#Quality trim the reads
python -f sim_w501.fastq -t 20 -c 30 -q -o sim_w501.fastq
#Stampy still requires a bwa index. Generate that for my reference genome
bwa index -a bwtsw parent1.fa
#Generate stampy hashes -G parent1.stampy.msg parent1.fa -g parent1.stampy.msg -H parent1.stampy.msg
#Run stampy --bwaoptions="-q10 parent1.fa" -g parent1.stampy.msg -h parent1.stampy.msg -M sim_w501.fastq -o update_reads-aligned-parent1.sam
#Convert the output to BAM (from SAM)
samtools view -bhS -o update_reads-aligned-parent1.bam update_reads-aligned-parent1.sam
#Convert the output from SAM to BAM
samtools view -h -o update_reads-aligned-parent1.sam update_reads-aligned-parent1.stampy.bam
#Remove any bad reads that couldn't be mapped -i update_reads-aligned-parent1.sam -o update_reads-aligned-parent1.filtered.sam -a aln -s 1
#Convert back to BAM format
samtools view -b -o update_reads-aligned-parent1.bam update_reads-aligned-parent1.filtered.sam
#Sort the BAM file as it speeds up the next calmd step significantly
samtools sort update_reads-aligned-parent1.bam update_reads-aligned-parent1.bam.sorted
#Add in MD tags, stampy leaves these out
samtools calmd -b update_reads-aligned-parent1.bam.sorted.bam parent1.fa > update_reads-aligned-parent1.sorted.calmd.bam
#I'm not sure what this step is for
samtools index update_reads-aligned-parent1.bam.sorted.calmd.bam
#Run the pileup
samtools pileup -f parent1.fa update_reads-aligned-parent1.sorted.calmd.bam -c > update_reads-aligned-parent1.pileup
samtools pileup bwa • 4.1k views
ADD COMMENTlink written 8.7 years ago by Applet150

you are running an old version of samtools. "pileup" has been deprecated .

ADD REPLYlink written 8.7 years ago by Pierre Lindenbaum131k
gravatar for Vikas Bansal
8.7 years ago by
Vikas Bansal2.4k
Berlin, Germany
Vikas Bansal2.4k wrote:

I think you really need to read about samtools, stampy and bwa. Assuming that you want to map your reads using stampy and you need pileup file in the end, follow these steps-> As your reference genome (parent.fa) is not very big, use stampy without bwa ->

First you have to build a genome file:

  ./ -G hg18 parent.fa

After building the genome file, you need to build a hash table:

  ./ -g hg18 -H hg18

Map the reads:

  ./ -g hg18 -h hg18 -M sim_w501.fastq

According to stampy manual ->

Note that the hash file may be large (up to 2 Gb); to improve startup speed you may want to keep it on a local drive. Note also that the file system must support memory mapped files for Stampy to work; NFS does, but e.g. GlusterFS does not. Some file systems support memory mapped files but become exceedingly slow; if this happens try moving both files to a local drive.

You have to check this yourself.

So when you mapped the reads, stampy will give you output in sam format. Now as you mentioned, "Remove any bad reads that couldn't be mapped" and by this if you mean to remove unmapped reads, use:

 samtools view -bS -F 4 stampy_output.sam    -o output.bam
# Here -F 4 says that skip the flag 4 (means skip unmapped reads)

Now you can convert this bam file into pileup file.

ADD COMMENTlink modified 8.7 years ago • written 8.7 years ago by Vikas Bansal2.4k

Thanks. Why do you suggest not using BWA with stampy? I thought I read that it's almost always faster. I'm already doing the other stampy hash steps. The only issue I'm having trouble with is generating the pileup is very slow.

ADD REPLYlink written 8.7 years ago by Applet150

The last step where you are generating pileup file, try it without -c.

ADD REPLYlink written 8.7 years ago by Vikas Bansal2.4k

I believe the -c gives me special formatting in the output file which I need.

ADD REPLYlink written 8.7 years ago by Applet150

If the problem is during converting sam to pileup -> another solution is to compare the sam file which you got initially using bwa with the sam file you got from stampy.

ADD REPLYlink written 8.7 years ago by Vikas Bansal2.4k
gravatar for Swbarnes2
8.7 years ago by
Swbarnes21.5k wrote:

Why do you have so many conversions to and from .sam and .bam?

You should pipe whatever it is that makes the .sam straight into samtools view so it can make the .bam on the fly. You can filter out unmapped reads at that step too.

Indexing isn't required for pileup or mpileup, unless you with to specify a region of the genome for your pileup. It's for slicing a region out of your .bam and visualizers like IGV require it too.

ADD COMMENTlink written 8.7 years ago by Swbarnes21.5k

That's good advice. I don't think that would speed up the pileup step though. That's the really slow part that I want to fix. See the last line in my list of commands.

ADD REPLYlink written 8.7 years ago by Applet150
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1684 users visited in the last hour