Question: Reducing sequencing errors in .vcf file
gravatar for lcc1844
5.0 years ago by
United Kingdom
lcc184430 wrote:

I am trying different things in Samtools to get a reasonable .vcf file for 2 human exomes I have sequenced. 

Whatever I do I keep getting a file of ~600Mb which has millions of rows and can't fully open in Excel. I would expect ~22,000 variant calls per person so I assume this file is filled with sequencing blips that aren't genuine variants. 

The method I have tried are making a pileup and converting a bcf to vcf via the varFilter command: 

samtools mpileup -g -f hg19.fa > raw.bcf


bcftools view -vcg raw.bcf > var.bcf

bcftools view var.bcf | perl varFilter > var-final.vcf


I have also tried this: 

 samtools mpileup -uf hg19.fa srt.bam | bcftools call -mv - > file.vcf


Can anyone tell me how to best call rare variants in Samtools? 

Many thanks 




next-gen alignment • 1.8k views
ADD COMMENTlink modified 5.0 years ago by Vivek2.4k • written 5.0 years ago by lcc184430

Just because you got more than expected doesn't mean they are fake. Look into the data and see what is present! Where did you get the idea people carry 22,000 variants?

Your samtools command is fine, other tools will make other lists of variants.

Samtools doesn't know what a rare variant is, you'll have to define rare and do some post-filtering. Maybe you want nonsynonymous exonic variants only?

ADD REPLYlink written 5.0 years ago by karl.stamm3.6k

I think OP needs to follow GATK Best Practices, in that OP needs to annotate variants and then filter by functionality, region, MAF to get to their desired result.

ADD REPLYlink written 5.0 years ago by RamRS25k

Hi thanks for your reply, I have looked at the data and that is why I think they are fake. I am getting millions of variant calls, the literature suggests and general is that the average exome carries around 22,000 variants (this includes synonymous changes). I also have experience analysing exome data once the variants have been called (by someone else). This is why I was hoping for something similar. I thought using varFilter would help select genuine variants, and I have also used -D to change the depth of the call I want to keep but nothing changed in my output. 

I was putting off using GATK because I didn't find it user friendly, not that any of these tools are to someone who has no experience at this! You're right, I do want variants that are nonsyn and exonic. I also want to scream into a pillow. 

ADD REPLYlink written 5.0 years ago by lcc184430

If the sample was good and the sequencing was done to a sufficient depth, variants can rarely be fake. Your target will need filtering to be found. Try picking only exonic variants - and remember, samtools is really sensitive and good for low coverage data, but GATK is better for high coverage human data. Maybe an expert on samtools can give you better inputs on optimizing it for your case.

GATK might look like it is a bit more complicated, but it is definitely so for a reason. It has a ton of downstream analysis tools and the best documentation I've seen yet. You'd benefit by giving it a go. Most problems you ever face in GATK will have a straightforward solution already discussed on their forums.

ADD REPLYlink written 5.0 years ago by RamRS25k

Are these millions of sites really variant from reference? Maybe you're seeing all callable sites including those homozygous-reference. 

The other possibility is that the BAM is not aligned to this hg19.fa, so you will get basically every site called as mutant. You have to be sure the BAM is aligned to the same reference file as you call variants with.  Maybe you can post some of the VCF you think is wrong?

(a lot of GATK's unfriendliness is to prevent these mistakes. )

ADD REPLYlink written 5.0 years ago by karl.stamm3.6k
gravatar for RamRS
5.0 years ago by
Houston, TX
RamRS25k wrote:

Do not open VCF files in Excel. It's not supposed to work. Use a plain text editor, such as gedit, TextWrangler or Notepad++, or better, use R.

And yes, human exome VCF will contain tons of variants. If you wish to identify rare variants, filter the vcf by MAF using vcftools.

Oh, and do not use Excel for Bioinformatics

EDIT: For human exomes, you're better off using GATK. Plus, you can use the bundle that contains gold standard indels and SNPs to your advantage.

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by RamRS25k
gravatar for Vivek
5.0 years ago by
Vivek2.4k wrote:

Before going at the obvious variant filtering and subsequent annotation steps you could initially supply your exome target regions BED file to samtools mpileup using the -l option to restrict variant calling to the regions of interest.

ADD COMMENTlink written 5.0 years ago by Vivek2.4k

This is a good idea too. GATK best practices includes this in the pipeline, so it kinda slipped my mind.

ADD REPLYlink written 5.0 years ago by RamRS25k

If I'm not wrong, since GATK became pay for use for industry, samtools is the go-to variant calling tool for some pipelines. So that workflow might not be an option for everyone.

ADD REPLYlink written 5.0 years ago by Vivek2.4k

Ah, I see. I did not know GATK had that commercial aspect to it - thank you for that information!

ADD REPLYlink written 5.0 years ago by RamRS25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2358 users visited in the last hour