Filter out low coverage and minor alleles of a frequency above 40%.
1
0
Entering edit mode
4.8 years ago
pennakiza ▴ 60

Hello,

I was wondering if there is anybody to help me filter my vcf file (from freebayes) in order to check for heterogeneity in my single genome sequencing data.

I assume I should use:

vcftools --vcf my.vcf --maf 0.4 --recode --recode-INFO-all --out my_filtered.vcf

but I am not sure how I can add an option for low coverage reads.

Thanks!

vcf variant calling freebayes HIV • 1.8k views
ADD COMMENT
0
Entering edit mode

Can you provide more detail on your starting VCF, and exactly how you would like to filter it?

Depth of coverage is (generally) a sample-level statistic. There isn't one coverage value for the whole variant row; there's a value for every genotype call in the row. Assuming you have a multi-sample VCF, what do you want to do with low coverage genotypes? Remove the whole row if any genotypes have low coverage? Change the genotype to missing if the coverage is low? Etc.

On the other hand, you may have a DP statistic in the INFO field that represents something like the mean coverage across all samples; is this what you're referring to?

ADD REPLY
0
Entering edit mode

So, I've got a vcf file of a single genome (HIV provirus) and what I would like to do is to confirm that there is only one genome indeed, by checking if there are any accurate variants. In this context, I thought that if I get rid of any rows of low coverage variants and minor alleles that appear less than 40%, I can get an idea of how "clean" my sample is. My VCF looks like this:

K03455|HIVHXB2CG 6237 . GA GGGATCAGGAA 10230.7 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=314;CIGAR=1M9I1M;DP=324;DPB=1799.5;DPRA=0;EPP=11.9728;EPPR=0;GTI=0;LEN=9;MEANALT=9;MQM=59.9108;MQMR=0;NS=1;NUMALT=1;ODDS=500.205;PAIRED=0.996815;PAIREDR=0;PAO=22;PQA=828;PQR=828;PRO=22;QA=11521;QR=0;RO=0;RPL=126;RPP=29.5935;RPPR=0;RPR=188;RUN=1;SAF=201;SAP=56.5641;SAR=113;SRF=0;SRP=0;SRR=0;TYPE=ins GT:DP:RO:QR:AO:QA:GL 1/1:324:0:0:314:11521:-1036.08,-121.014,0 61
K03455|HIVHXB2CG 6252 . TTGTGGAGATGGGG TG 8560.2 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=269;CIGAR=1M12D1M;DP=280;DPB=49.6429;DPRA=0;EPP=4.82659;EPPR=0;GTI=0;LEN=12;MEANALT=2;MQM=59.8959;MQMR=0;NS=1;NUMALT=1;ODDS=378.905;PAIRED=0.996283;PAIREDR=0;PAO=1.5;PQA=37;PQR=0;PRO=0.5;QA=9548;QR=0;RO=0;RPL=129;RPP=3.98706;RPPR=0;RPR=140;RUN=1;SAF=158;SAP=20.8422;SAR=111;SRF=0;SRP=0;SRR=0;TYPE=deGT:DP:RO:QR:AO:QA:GL 1/1:280:0:0:269:9548:-862.024,-81.8802,0
ADD REPLY
0
Entering edit mode
4.8 years ago
bari.ballew ▴ 460

Since it looks like you are using a single-sample VCF, this is pretty straightforward. Just add something like --min-meanDP n, where n is the depth cutoff, to your vcftools command to filter out low depth sites. From the vcftools documentation:

--min-meanDP <float>
--max-meanDP <float>

Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site.

ADD COMMENT

Login before adding your answer.

Traffic: 2094 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6