Question: Best Practices / Faster Samtools Mpileup, Any Tips?
gravatar for Kevin
8.0 years ago by
Kevin630 wrote:

I need to run samtools mpileup on 38 individuals for whole genome sequencing. I intend to parallelize the process by splitting by chromosomes. I thought of splitting by regions to get more parallel chunks but I was told that each mpileup process consumes quite a fair bit of memory and it will segfault if it runs out of memory.

I am looking for tips on how to speedup the mpileup calls as I think from past experiences, it took 2 weeks for mpileup calls on 100 individuals for chr1.

I also separate ref.fa for male and female subjects. Is it alright if I were to use the male ref.fa for all idv ?


samtools mpileup • 12k views
ADD COMMENTlink written 8.0 years ago by Kevin630

What kind of hardware do you have access to?

ADD REPLYlink written 8.0 years ago by Alex Paciorkowski3.4k

20 linux servers with ~ 32 Gb ram ... but not a whole lot of hdd space .. which is the bummer .. the shared HPC resource has way more cores but only 500 Gb scratch .. which is just enough for one chromosome from 38 idv i guess

ADD REPLYlink written 8.0 years ago by Kevin630
gravatar for David Langenberger
8.0 years ago by
David Langenberger9.5k wrote:

Using xargs, you can parallelize your call by chromosomes (-P is the number of parallel processes started parallel - don't use more than the number of cores you have):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c "samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - > tmp.{}.vcf"

To merge the results afterwards, you might want to do something like this (not the best way):

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

I use xargs quite a lot, but I cannot say anything about the memory usage of mpileup. So, you should check the memory usage, before using to many parallel jobs.

ADD COMMENTlink written 8.0 years ago by David Langenberger9.5k

If there was Biostars Gold I would give it to you, very nice piece of code here.

ADD REPLYlink written 3.0 years ago by mbcouger0

There is a small error

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].bcf >> yourFile.vcf");'

should be

samtools view -H yourFile.bam | grep "\@SQ" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system("cat tmp.$F[0].vcf >> yourFile.vcf");'
ADD REPLYlink modified 20 months ago • written 20 months ago by vellamike0
gravatar for Kevin
7.8 years ago by
Kevin630 wrote:

Ok just learnt about the new calling by region option in mpileup.

Calling SNPs/INDELs in small regions (see ) splitchr -l 500000 ref.fa.fai | xargs -i \     echo samtools mpileup
 -C50 -m3 -F0.0002 -DSuf ref.fa -r {} -b bam.list \| bcftools \     view -bcvg - \> part-{}.bcf

Random position retrieval that works. One of the most powerful features of mpileup is that you can specify a region with -r chrom:start-stop and it will report pileup output for the specified position(s). The old pileup command had this option, but took a long time because it looked at all positions and just reported the ones within your desired region. Instead, mpileup leverages BAM file indexing to retrieve data quite rapidly: In my experience, it takes about 1 second to retrieve the pileup for several samples at any given position in the human genome. Multi-sample, rapid random access has lots of uses for bio-informaticians; for example, I can retrieve all bases observed in all samples at a variant of interest to look at the evidence in each sample.

ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Kevin630

I confirm it is about 1 second per lookup now, but that adds up when I have 1000s of SNPs per bacterial genome, and 100s of genomes... but thanks for the info, it's very useful.

ADD REPLYlink written 6.8 years ago by Torst950
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour