Reducing sequencing errors in .vcf file
2
0
Entering edit mode
9.2 years ago
lcc1844 ▴ 40

I am trying different things in Samtools to get a reasonable .vcf file for 2 human exomes I have sequenced.

Whatever I do I keep getting a file of ~600Mb which has millions of rows and can't fully open in Excel. I would expect ~22,000 variant calls per person so I assume this file is filled with sequencing blips that aren't genuine variants.

The method I have tried are making a pileup and converting a bcf to vcf via the varFilter command:

samtools mpileup -g -f hg19.fa aln.srt.bam > raw.bcf
bcftools view -vcg raw.bcf > var.bcf
bcftools view var.bcf | perl vcfutils.pl varFilter > var-final.vcf

I have also tried this:

samtools mpileup -uf hg19.fa srt.bam | bcftools call -mv - > file.vcf

Can anyone tell me how to best call rare variants in Samtools?

Many thanks

alignment next-gen • 3.0k views
ADD COMMENT
0
Entering edit mode

Just because you got more than expected doesn't mean they are fake. Look into the data and see what is present! Where did you get the idea people carry 22,000 variants?

Your samtools command is fine, other tools will make other lists of variants.

Samtools doesn't know what a rare variant is, you'll have to define rare and do some post-filtering. Maybe you want nonsynonymous exonic variants only?

ADD REPLY
0
Entering edit mode

I think OP needs to follow GATK Best Practices, in that OP needs to annotate variants and then filter by functionality, region, MAF to get to their desired result.

ADD REPLY
0
Entering edit mode

Hi thanks for your reply, I have looked at the data and that is why I think they are fake. I am getting millions of variant calls, the literature suggests and general is that the average exome carries around 22,000 variants (this includes synonymous changes). I also have experience analysing exome data once the variants have been called (by someone else). This is why I was hoping for something similar. I thought using varFilter would help select genuine variants, and I have also used -D to change the depth of the call I want to keep but nothing changed in my output.

I was putting off using GATK because I didn't find it user friendly, not that any of these tools are to someone who has no experience at this! You're right, I do want variants that are nonsyn and exonic. I also want to scream into a pillow.

ADD REPLY
0
Entering edit mode

If the sample was good and the sequencing was done to a sufficient depth, variants can rarely be fake. Your target will need filtering to be found. Try picking only exonic variants - and remember, samtools is really sensitive and good for low coverage data, but GATK is better for high coverage human data. Maybe an expert on samtools can give you better inputs on optimizing it for your case.

GATK might look like it is a bit more complicated, but it is definitely so for a reason. It has a ton of downstream analysis tools and the best documentation I've seen yet. You'd benefit by giving it a go. Most problems you ever face in GATK will have a straightforward solution already discussed on their forums.

ADD REPLY
0
Entering edit mode

Are these millions of sites really variant from reference? Maybe you're seeing all callable sites including those homozygous-reference.

The other possibility is that the BAM is not aligned to this hg19.fa, so you will get basically every site called as mutant. You have to be sure the BAM is aligned to the same reference file as you call variants with. Maybe you can post some of the VCF you think is wrong?

(a lot of GATK's unfriendliness is to prevent these mistakes)

ADD REPLY
1
Entering edit mode
9.2 years ago
Ram 43k

Do not open VCF files in Excel. It's not supposed to work. Use a plain text editor, such as gedit, TextWrangler or Notepad++, or better, use R.

And yes, human exome VCF will contain tons of variants. If you wish to identify rare variants, filter the vcf by MAF using vcftools.

Oh, and do not use Excel for Bioinformatics.

EDIT: For human exomes, you're better off using GATK. Plus, you can use the bundle that contains gold standard indels and SNPs to your advantage.

ADD COMMENT
0
Entering edit mode
9.2 years ago
Vivek ★ 2.7k

Before going at the obvious variant filtering and subsequent annotation steps you could initially supply your exome target regions BED file to samtools mpileup using the -l option to restrict variant calling to the regions of interest.

ADD COMMENT
0
Entering edit mode

This is a good idea too. GATK best practices includes this in the pipeline, so it kinda slipped my mind.

ADD REPLY
0
Entering edit mode

If I'm not wrong, since GATK became pay for use for industry, samtools is the go-to variant calling tool for some pipelines. So that workflow might not be an option for everyone.

ADD REPLY
0
Entering edit mode

Ah, I see. I did not know GATK had that commercial aspect to it - thank you for that information!

ADD REPLY

Login before adding your answer.

Traffic: 2184 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6