How To Select The Top 30% Variants (Based On Quality) From A Vcf File?
4
2
Entering edit mode
10.8 years ago
William ★ 5.2k

How can I select the top 30% variants (based on quality) from a VCF file?

So all the the variants (or positions) that have a variant quality score (or Quality by Depth(QD) score ?) that is in the top 30% of all quality scores (or QDscores).

vcf • 5.2k views
ADD COMMENT
3
Entering edit mode
10.8 years ago

"in the top 30% of all quality scores":

gunzip -c input.vcf.gz | awk -F '        '  -v M=` gunzip -c  input.vcf.gz | grep -v -E "^#" | cut -d ' ' -f 6 | sort -k1n | tail -1` '/^#/ {print;next;} { if(($6)>= M*2.0/3.0) print;}'
ADD COMMENT
0
Entering edit mode

Hi Pierre, An extension to the original question if I may... Would you happen to know how to get a value representing the top 10% of Phred-scaled strand bias P-value from a VCF? I would do it in excel land look at the SP value under the FORMAT column in the vcf. Wondering if there would be a command line method..

ADD REPLY
3
Entering edit mode
10.8 years ago

If you know the minimum quality, you can use vcftools:

vcftools --gzvcf input.vcf.gz --minQ <float> ...

To find the range of your qualities you can use --site-quality option in vcftools, in a separate command.

ADD COMMENT
2
Entering edit mode
10.8 years ago

I wouldn't recommend hard filtering variants. If you are working with human data you can use VSQR to model false positives and false negatives.

Of course I am a huge hypocrite and commonly hard filter VCFs.

Care to tell us a little more about why you want 30%?

ADD COMMENT
1
Entering edit mode

Because I want to use VQSR and don't have an external truth set for indels. So I want to use the top x percent of my indels as a truth set to get VQSR started. Luckily I have a BAC contig based call set for couple of MB region to to see if this works and which truth sensitivity level I should take.

ADD REPLY
1
Entering edit mode

Gotcha. It would be nice if you could describe your approach as may non-model organism folks, including myself, are thinking about how to tackle this problem.

ADD REPLY
0
Entering edit mode

Sure I'll add a reply here or a how to post.

ADD REPLY
0
Entering edit mode

Known, truth, and training sets is used by the algorithm for modelling. What if I do not have this for my organism in question. How sure can I be that the top x% of the variants are fit to be used as the training set?

ADD REPLY
1
Entering edit mode
10.8 years ago
William ★ 5.2k

I've decided to use the variants with a QD above 30 using GATK select variant:

GenomeAnalysisTK.jar -T SelectVariants -R reference.fa --variant input.vcf -o input_QDAbove30.vcf -select "QD > 30.0"
ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6