Question: SAMtools mpileup and bcftools doesn't call InDels...
5
gravatar for Kirill
3.3 years ago by
Kirill250
Australia
Kirill250 wrote:

Hi Guys, 

I have been researching fair bit over the last few days about SNPs/InDels calling, My project pretty much exactly summarised here http://seqanswers.com/forums/showthread.php?t=57494&highlight=samtools+mpileup. My NGS data has also been generated by cripsr/cas9 method. 

I used this method

HaplotypeCaller -R genome.fa -I sorted.bam -stand_call_conf 30 -stand_emit_conf 10 -o sorted-out.vcf -L chrX --downsampling_type NONE

and it sort of worked. I do get some InDels picked up. I want to call the same InDels with samtools mpileup. Two reasons:

1. I feel I'm not understanding how InDels/SNP being called in general

2. The actual project is a bit more involved. we are interested in both SNP and InDels some of which will only be found in fewer reads.

This is what I did to try replicate HyplotypeCaller witih samtools mpileup

samtools mpileup -g -f genome.fa sorted.bam > sorted-mpileup.bcf 

bcftools call -c  -V snps sorted-mpileup.bcf | less

I get no InDels found however I do know that there is an InDel from both IGV and GATK HaplotypeCaller. And I also generate pile file and looked into it and it shows me that I do have and InDel...so what is going on...?

Also can anyone reiterate here why mpileup has max per-file depth set to 8000 and how does this effects my SNP/InDel calling given that I have ~ 400,000 reads in my bam files? 

Also also when I do

samtool mpileup -g -f genome.fa sorted.bam > sorted-mpileup..bcf

I get DP tag included as a default and given that I have more than 8000 reads in my bam file DP tag for all SNP is 7999. But the question is what is the different when I include -t DP,DPR option during my mpileup..? because I end up with two DP tags in my vcf file with slightly different values..?

(this is really a testRun data and it is an amplicon data however in future I'll be calling SNP/InDel from genome wide NGS data) 

Note please assume that I'm complete noob in variant calling and provide as much explanation as you can. much appreciate your time in advance 

I understand this is really a few different questions in one, but I believe this will be a good recourse for others to refer to given that SNP/InDels calling is common topic and I couldn't find any good resources that would give a good guide for SNP/InDel calling hence me trying to include as much description about variant calling as I can and hope that others would to through comments and answers.

Cheers, 

ADD COMMENTlink modified 17 months ago by tassa.saldi0 • written 3.3 years ago by Kirill250

I just want to add a bit more info to this question. The main problem is that I can see an InDel in mpileup format. i.e

samtools -f genome.fa sorted.bam > sorted-mpileup.mpileup 

but when I run above line with an option `-b` meaning convert to bcf format I loose that InDel..

Can anyone at least explain what is happening there..?

Cheers

ADD REPLYlink written 3.3 years ago by Kirill250

Kirill, Hi! just very quickly... Did you find any reasonable explanation for the disappearing indels issue? I could not find any other post which would describe the problem in such detail! This would be great if you could share any piece of information you might have found to solve it. Many thanks!

ADD REPLYlink written 19 months ago by gajkun0

@gajkun I pretty much stopped (never really got to use) samtools mpile up. Mainly because it was slow and I found another (easier in my view) solution. Evgeniia (below) gave very good anwser on how to use samtools mpileup with bcftools. I don't do a lot of varinat calling, but when I do I mostly use freebayes https://github.com/ekg/freebayes

ADD REPLYlink written 18 months ago by Kirill250
9
gravatar for Evgeniia Golovina
3.3 years ago by
New Zealand
Evgeniia Golovina940 wrote:

I used samtools 1.1 and bcftools 1.1 with the following commands to call SNP and indels. And it works fine:

1) To call only SNPs:

samtools mpileup --skip-indels -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f <reference genome.fa> -o <output.bcf> <list of input bam files>
bcftools index  <output.bcf> <indexed.bcf>
bcftools call --skip-variants indels --multiallelic-caller --variants-only  -O v <output.bcf> -o <output.vcf>

2) To call only Indels:

samtools mpileup -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f <reference genome.fa> -o <output.bcf> <list of input bam files>
bcftools index  <output.bcf> <indexed.bcf>
bcftools call --skip-variants snps --multiallelic-caller --variants-only  -O v <output.bcf> -o <output.vcf>

3) To call both SNPs and indels:

samtools mpileup -d 250 -m 1 -E --BCF --output-tags DP,DV,DP4,SP -f <reference genome.fa> -o <output.bcf> <list of input bam files>
bcftools index  <output.bcf> <indexed.bcf>
bcftools call --multiallelic-caller --variants-only  -O v <output.bcf> -o <output.vcf>

 

You can reed more about options there --> C: SNP call using bcftools or in the samtools/bcftools manuals.

I hope, it helps a little.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Evgeniia Golovina940

@Evgeniia thanks for the answer. I do appreciate it. with samtool mpileup options you suggested, alot of flags are options, the way I understand... e.g -d default to 250 and -m to 1 so they are actually redundant in this case. Also I'm not sure if I need to index my file before bcftools call.. do I ? I followed your option 2 exactly, but still can't get my InDel that I get if I run HaplotyCaller. Can you explain to me why am I not getting the same result and also what is the difference between "multiallelic-caller" (this is what you suggested) and another option bcftools call offer "consensus-caller" ..?

Thanks 

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Kirill250

The consensus-caller assumes only biallelic sites. Multiallelic caller - multiallelic sites.

You can reads about the difference between biallelic and multiallelic sites here --> http://gatkforums.broadinstitute.org/gatk/discussion/6455/biallelic-vs-multiallelic-sites

ADD REPLYlink written 18 months ago by Evgeniia Golovina940
0
gravatar for tassa.saldi
17 months ago by
tassa.saldi0 wrote:

Hi,

Not sure how helpful this is, but I think the disappearing deletions and insertions are due to indels only being called for sites that do not have another type of mutation (i.e. mismatch).

Anyone know how to get mpileup and bcftools to report both an indel and a mismatch at the same site regardless of coverage or percent representation of the mutation (basically to just output everything)?

ADD COMMENTlink written 17 months ago by tassa.saldi0

BBMap's variant-caller has the option to call everything:

callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fasta ploidy=2 clearfilters

By default, though, it does not mask variants because they coincide, so you don't necessarily need "clearfilters". Additionally, there's a "rarity" flag; e.g. "rarity=0.05" would not penalize variants for having a low coverage depth unless the depth was below a 0.05 allele fraction.

ADD REPLYlink written 17 months ago by Brian Bushnell16k

Awesome! I will give it a try. Thanks!

ADD REPLYlink written 17 months ago by tassa.saldi0

Hello, I went ahead and had BBtools installed on our university server (there is a lot of cool stuff in there!). I ran the the following command:

callvariants.sh in=C4_DMS_genome_merge_sorted.sam out=bbtools_out.txt ref=~/Datafiles/human/genome.fa ploidy=2 rarity=0.00001 clearfilters (also tried rarity=0)

and got the following error:

Exception in thread "Thread-5" java.lang.AssertionError: 45 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmC MD:Z:45C104 at stream.MDWalker.fixMatch(MDWalker.java:83) at stream.SamLine.toShortMatch(SamLine.java:1291) at stream.SamLine.toRead(SamLine.java:1971) at stream.SamLine.toRead(SamLine.java:1831) at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:270) at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:133) Exception in thread "Thread-3" java.lang.AssertionError: 91 CCCCCCCmmmmmmmmmmmmmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmm MD:Z:20A63C59 at stream.MDWalker.fixMatch(MDWalker.java:83) at stream.SamLine.toShortMatch(SamLine.java:1291) at stream.SamLine.toRead(SamLine.java:1971) at stream.SamLine.toRead(SamLine.java:1831) at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:270) at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:133) Exception in thread "Thread-4" java.lang.AssertionError: 55 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDmmmmmmmmmmmmmmmmmmmmmmmmm MD:Z:55C33C1T0A1C25C30 at stream.MDWalker.fixMatch(MDWalker.java:83) at stream.SamLine.toShortMatch(SamLine.java:1291) at stream.SamLine.toRead(SamLine.java:1971) at stream.SamLine.toRead(SamLine.java:1831) at stream.SamReadStreamer$ProcessThread.makeReads(SamReadStreamer.java:270) at stream.SamReadStreamer$ProcessThread.run(SamReadStreamer.java:133)

Any ideas? Thanks so much!

ADD REPLYlink written 17 months ago by tassa.saldi0

It appears that your sam file is corrupt; the MD tags are incorrect. How were they generated? For those reads which according to the cigar string appear to contain deletions, there is no corresponding "^" symbol in the MD tag. You might be able to fix them with samtools "callmd" function. Unfortunately my assertion messages are not sufficient in this case; I really should have printed out the cigar string and read name so it would be easier to see what's going on and replicate. Perhaps you could use grep to fish out a couple of those lines? For example:

grep MD:Z:55C33C1T0A1C25C30 C4_DMS_genome_merge_sorted.sam
ADD REPLYlink written 17 months ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1751 users visited in the last hour