Question: Samtools mpileup - Individual Pile-ups vs Group Pile-ups
gravatar for jrh154
7 months ago by
jrh1540 wrote:

Hi All,

First time posting here (or really forums in general), so apologies if I am breaking any etiquette.

My question regards using the mpileup feature in Samtools. I am curious if it matters whether or not it matters if I run mpileup individually for each bam file versus running mpileup as one process for all the bam files. E.g., if I have 3 bam files, 1.bam, 2.bam, and 3.bam, I can either run samtools mpileup 1.bam and then samtools mpileup 2.bam, etc or run samptools mpileup 1.bam, 2.bam, 3.bam (I know the syntax here is wrong, just trying to illustrate the problem).

The reason I'm asking: We are trying to use cluster computer resources (i.e., Westgrid), and as far as we can tell, there is no easy way to run the mpileup over multiple processors. It therefore seems like it we would be able to save a lot of time by running the mpileup individually as separate jobs on the cluster. Is this legit/would the results we get be legitimate? (It seems that read depth would decrease in the individual case?) Also, is there a convenient way to combine the resulting bcf files into a single bcf file?

Our group is very new to cluster computing and also fairly new to the whole GBS/NGS sequencing process, any help or resources you can point us to would be greatly appreciated!

sequencing mpileup snp samtools • 260 views
ADD COMMENTlink written 7 months ago by jrh1540

I don't know if the results would be different in the case of samtools, but if you want to reduce the time, I suggest you give BBMap's CallVariants a try. In this case the command line would be (with the latest version, currently 36.77): in=1.bam,2.bam,3.bam ref=reference.fasta vcf=variants.vcf multisample ploidy=2

This is multithreaded and much faster than mpileup+bcftools. For CallVariants, the results would be different between running one at a time and all at once. The variant calls per sample would be the same, but with the combined version, there would be an entry for each variant per sample - in other words, if sample 1 has a variation but sample 2 and sample 3 don't, the individual vcf files for 2 and 3 would not show that variation. But in the combined version, all 3 would have an entry for the variation indicating the coverage and allele frequency, which might be 0.99 for sample 1 and 0.00 for sample 2 and sample 3.

ADD REPLYlink modified 7 months ago • written 7 months ago by Brian Bushnell13k

Cool, I will definitely look into using BBMap. If we wanted to compare the results from CallVariants with previous results obtained using Samtools, will there be any significant differences/would these still be valid comparisons?

ADD REPLYlink written 7 months ago by jrh1540

The results would be somewhat different. I suggest you compare a sample or two with both tools to get an idea of what the differences are, and examine the bam files visually in IGV at any sites with discrepancies to determine which seems to be correct. Overall, it's not safe to compare samples that have been processed with different methodologies - whether that is read preprocessing, sequencing platform, mapping program, variant caller, reference fasta, etc. However, the results should agree most of the time. I have noticed samtools+bcftools doing weird things like calling indels at under 2% allele fraction, which doesn't make much sense to me for default settings.

ADD REPLYlink written 7 months ago by Brian Bushnell13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1244 users visited in the last hour