Calculate individual missingness in vcf NOT using VCFtools
0
0
Entering edit mode
5.0 years ago
evan.hersh • 0

Hello All,

I've got a bunch of SNPs from my ddRAD data, and am attempting to follow a set of filtering guidelines made by the good folks who make dDocent (https://github.com/jpuritz/dDocent/blob/master/tutorials/Filtering%20Tutorial.md). This guide (and many others) suggest filtering individuals that have a lots of missing data. Individual missingness can be easily calculated using VCFtools with the "--missing-indv" parameter. Unfortunately, VCFtools does not support polyploidy, and over half of my samples are polyploids, so I'm trying to find an alternative. I've done a fair amount of googling to see if this can be done using bcftools or vcflib or something, but haven't had any luck. Please let me know if you have any suggestions.

Thanks!

Evan

snp filtering RAD-seq polyploid • 3.2k views
ADD COMMENT
0
Entering edit mode

and over half of my samples are polyploids

can you please please post one line of such VCF ?

ADD REPLY
0
Entering edit mode

Here's the first line from my vcf. SNPs were called using freebayes as part of the dDocent pipeline.

dDocent_Contig_2        61      .       C       A       0.00151623      .       AB=0.0909091;ABP=34.9902;AC=1;AF=0.00325733;AN=307;AO=11;CIGAR=1X;DP=9788;D
PB=9788;DPRA=0.9472;EPP=26.8965;EPPR=21094.7;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=114;NUMALT=1;ODDS=0;PAIRED=1;PAIREDR=0.995084;PAO=0;PQA=0;PQR=0;PRO=0;
QA=259;QR=342663;RO=9765;RPL=3;RPP=7.94546;RPPR=3.14928;RPR=8;RUN=1;SAF=3;SAP=7.94546;SAR=8;SRF=4902;SRP=3.34853;SRR=4863;TYPE=snp;technology.Illumina=1  G
T:DP:AD:RO:QR:AO:QA    0/0:62:62,0:62:2177:0:0 0/0:53:52,0:52:1807:0:0 0/0:64:64,0:64:2336:0:0 0/0:14:14,0:14:432:0:0  0/0:32:32,0:32:1154:0:0 0/0:10:10,0:
10:356:0:0  0/0:57:57,0:57:1995:0:0 0/0:104:104,0:104:3535:0:0      0/0:36:36,0:36:1272:0:0 0/0:62:62,0:62:2207:0:0 0/0:56:56,0:56:1971:0:0 0/0:50:50,0:50:
1749:0:0 0/0:44:44,0:44:1549:0:0 0/0:55:55,0:55:1947:0:0 0/0:22:22,0:22:771:0:0  0/0/0:108:108,0:108:3774:0:0    0/0/0:91:90,1:90:3074:1:14      0/0/0:129:
129,0:129:4512:0:0    0/0/0:146:146,0:146:5151:0:0    0/0/0:99:98,0:98:3406:0:0       0/0/0:95:95,0:95:3352:0:0       0/0/0:111:111,0:111:3835:0:0    0/0/0
:128:128,0:128:4483:0:0    0/0/0:100:98,2:98:3488:2:74     0/0/0:117:117,0:117:3966:0:0    0/0/0:155:155,0:155:5370:0:0    0/0/0:117:117,0:117:4153:0:0    
0/0/0:98:98,0:98:3260:0:0       0/0/0:84:84,0:84:2838:0:0       0/0/0:139:139,0:139:4863:0:0    0/0:32:32,0:32:1139:0:0 0/0:17:17,0:17:625:0:0  0/0:24:24,0
:24:838:0:0  0/0/0:140:138,0:138:4782:0:0    0/0/0:38:38,0:38:1302:0:0       0/0/0:102:102,0:102:3602:0:0    0/0/0:74:74,0:74:2564:0:0       0/0/0:140:140,
0:140:4898:0:0    0/0/0:174:173,0:173:6165:0:0    0/0/0:77:77,0:77:2736:0:0       0/0/0:116:116,0:116:4065:0:0    0/0/0:147:147,0:147:5124:0:0    0/0/0:105
:105,0:105:3661:0:0    0/0/0:64:64,0:64:2228:0:0       0/0/0:138:137,1:137:4913:1:14   0/0/0:138:138,0:138:4881:0:0    0/0/0:92:92,0:92:3370:0:0       0/0/
0:67:67,0:67:2389:0:0       0/0/0:88:88,0:88:3120:0:0       0/0/0:71:71,0:71:2486:0:0       0/0/0:74:74,0:74:2587:0:0       0/0/0:70:70,0:70:2466:0:0
       0/0/0:98:98,0:98:3525:0:0       0/0:44:43,0:43:1523:0:0 0/0:34:33,1:33:1186:1:14        0/0:50:50,0:50:1768:0:0 0/0/0:60:60,0:60:2061:0:0       0/0/
0:95:95,0:95:3322:0:0       0/0/0:145:145,0:145:5096:0:0    0/0/0:92:91,0:91:3250:0:0       0/0/0:119:119,0:119:4123:0:0    0/0:53:53,0:53:1733:0:0 0/0:58:
58,0:58:2059:0:0 0/0:72:72,0:72:2632:0:0 0/1:22:20,2:20:695:2:74 0/0:72:72,0:72:2564:0:0 0/0:26:26,0:26:898:0:0  0/0:26:26,0:26:952:0:0  0/0:40:40,0:40:140
8:0:0 0/0:31:31,0:31:1020:0:0 0/0:56:56,0:56:1907:0:0 0/0:28:27,1:27:931:1:14 0/0:24:24,0:24:860:0:0  0/0:92:92,0:92:3301:0:0 0/0:101:101,0:101:3603:0:0
      0/0:94:94,0:94:3374:0:0 0/0/0:132:132,0:132:4619:0:0    0/0/0:125:125,0:125:4358:0:0    0/0/0:122:121,1:121:4270:1:14   0/0/0:101:100,0:100:3566:0:0
    0/0/0:111:111,0:111:3875:0:0    0/0/0:112:112,0:112:3968:0:0    0/0/0:84:84,0:84:3011:0:0       0/0/0:96:96,0:96:3428:0:0       0/0/0:72:72,0:72:2450:0
:0       0/0/0:104:102,0:102:3582:0:0    0/0/0:91:91,0:91:3189:0:0       0/0/0:144:144,0:144:5119:0:0    0/0/0:114:113,0:113:3952:0:0    0/0/0:100:100,0:10
0:3584:0:0    0/0/0:124:124,0:124:4423:0:0    0/0/0:93:93,0:93:3220:0:0       0/0/0:115:115,0:115:3999:0:0    0/0/0:98:98,0:98:3480:0:0       0/0/0:91:91,0
:91:3164:0:0       0/0/0:108:108,0:108:3722:0:0    0/0/0:106:106,0:106:3606:0:0    0/0/0:92:91,0:91:3175:0:0       0/0/0:112:112,0:112:3999:0:0    0/0/0:13
1:131,0:131:4666:0:0    0/0:116:116,0:116:4051:0:0      0/0/0:159:159,0:159:5586:0:0    0/0:32:32,0:32:1078:0:0 0/0:60:60,0:60:2088:0:0 0/0:22:22,0:22:744:
0:0  0/0/0:88:88,0:88:3134:0:0       0/0/0:112:111,1:111:3726:1:27   0/0/0:149:149,0:149:5236:0:0    0/0/0:130:130,0:130:4658:0:0    0/0/0/0:88:87,1:87:306
6:1:14    0/0/0/0:94:94,0:94:3229:0:0     0/0/0/0:85:85,0:85:2992:0:0     0/0/0/0:59:59,0:59:2058:0:0     0/0/0/0:58:58,0:58:2077:0:0`
ADD REPLY
0
Entering edit mode

unless I'm wrong. I don't see any "missing data" here.

ADD REPLY
0
Entering edit mode

If I understand correctly (I'm new to all of this), the line I just posted is one site (out of >400,000) from one contig (out of ~22,000). I'm also not exactly sure how "individual missingness" is calculated using VCFtools, but I think it takes each individual and calculates the percentage of sites (out of the >400,000 raw SNP sites provided by freebayes) for which that individual has data. I know that I certainly have missing data, because when I follow the SNP-filtering guide posted above (only works on my diploids, but I tested it) I have a few individuals that have close to 50% missing data.

ADD REPLY
0
Entering edit mode

These 0/0 calls are homozygous ref; so, not exactly missing - this information is just a valuable as knowing that a variant / mutation is present. BCFtools should have some way of calculating these. If not, use my script here (the second one): A: calculate Per variant Heterozygosity from VCF file

ADD REPLY

Login before adding your answer.

Traffic: 1505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6