Unusual reports from "bcftools stats" (making me question my data)
2
0
Entering edit mode
3.4 years ago
rightmirem ▴ 70

I've generated a bunch (70) of VCF files containing SNP data for individual humans (diploid, one sample per file). I've also merged all 70 samples into a single BCF file.

The issue is; when I ran bcftools stats I got some unusual (I think impossible) Allele Frequency data (based on what I see directly in the file).

Heres a line from both the individual sample, and the compined samples. These are indicative of the entire file.

Sample: Individual, diploid (human)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample-63
1       12383   .       G       A       36.9773 .       DP=76; VDB=0.320743; SGB=-0.693145; RPB=0.870155; MQB=0.979967; MQSB=0.839692; BQB=0.935797; MQ0F=0.0526316; ICB=1; HOB=0.5; AC=1; AN=2; DP4=19,14,21,20; MQ=4       GT:PL   0/1:67,0,5


Sample: 70 Individuals, diploid (human)

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample-1     Sample-2     Sample-3     Sample-4     Sample-5     Sample-6     Sample-7     Sample-8     Sample-9     Sample-10     Sample-11     Sample-12     Sample-13     Sample-14     Sample-15     Sample-16     Sample-17     Sample-18     Sample-19     Sample-20     Sample-21     Sample-22     Sample-23     Sample-24     Sample-25     Sample-26     Sample-27     Sample-28     Sample-29     Sample-30     Sample-31     Sample-32     Sample-33     Sample-34     Sample-35     Sample-36     Sample-37     Sample-38     Sample-39     Sample-40     Sample-41     Sample-42     Sample-43     Sample-44     Sample-45     Sample-46     Sample-47     Sample-48     Sample-49     Sample-50     Sample-51     Sample-52     Sample-53     Sample-54     Sample-55     Sample-56     Sample-57     Sample-58     Sample-59     Sample-60     Sample-61     Sample-62     Sample-63     Sample-64     Sample-65     Sample-66     Sample-67     Sample-68     Sample-69     Sample-70

1   12383   .       G       A       115     PASS    VDB=0.350935; SGB=-0.693147;RPB=0.996445; MQB=0.597159; MQSB=0.155977; BQB=0.983101; MQ0F=0.00813008; ICB=1; HOB=0.5; MQ=8; DP=2869; DP4=569,817,659,762; AN=56; AC=31      GT:PL   0/1:110,0,39    0/1:58,0,7      ./.:.   ./.:.   ./.:.   0/1:84,0,78     ./.:.   0/1:95,0,0      ./.:.   ./.:.   ./.:.   0/1:148,0,23    ./.:.   ./.:.   ./.:.   0/1:88,0,12     0/1:77,0,92     0/1:93,0,12     0/1:74,0,67     0/1:88,0,13     ./.:.   ./.:.   0/1:70,0,49     0/1:69,0,61     1/1:128,14,0    ./.:.   ./.:.   ./.:.   ./.:.   0/1:72,0,12     ./.:.   0/1:68,0,15     0/1:88,0,46     ./.:.   ./.:.   ./.:.   0/1:113,0,60    ./.:.   ./.:.   ./.:.   1/1:93,1,0      ./.:.   ./.:.   0/1:82,0,80      ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   ./.:.   0/1:97,0,42     ./.:.    0/1:53,0,16     ./.:.   0/1:75,0,28     0/1:85,0,47     ./.:.   0/1:111,0,121   1/1:99,20,0     0/1:67,0,5      0/1:62,0,0      ./.:.   0/1:72,0,55     ./.:.   ./.:.   ./.:.   ./.:.


The important thing of note here is; both the individual file, and the 70-samples file have been filtered for SNPs only (so, the only GT data is either 0/X or X/X...but no 0/0). There are the ./.:. data which (based on my understanding) means the individual file was REF/REF for that point - and since only SNPs are being captured (NOT 0/0)...there is no data for that sample at POS=12383. Also note the AN=56 and the AC=31 8in the combined file line).

The definition of AN (as I understand it, as confirmed by the lines in MY files) is that AN is the sum of all the IDENTIFIED alleles in a line, regardless of whether they were REF or ALT. So 0/1 (AN=2)and 1/1 (AN=2). ACis all the IDENTIFIED REF in a line. So 0/1 (AC=1)and 1/1 (AC=2). AF(Allele frequency) is AC/AN.

So, (again, as I understand it) in the individual file The AN will ALWAYS be 2 (since REF/REF or 0/0 were all filtered out). And the AC can ONLY be 1 or 2 (not three unless trisomy, and not 0 because filtered).

Also notice in the combined file, the ./.:. refers to a sample that was REF/REF (and thus not included in the file). But if you look at the AC and AN in the combined line, the ./.:. are NOT included in those counts.

QUESTIONS:

(one) Since, in the individual file(s), AN=2 and AC > 0 - it is impossible for the AF to be less than 0.5 (Q1: correct?)

(two) Since ALL the data in the combined file comes from similarly filtered individual files, the same is true for the combined file (AF cannot be less than 0.5) (Q2: Correct?).

It IS possible for the AN of a single line (70 samples) to be as high as 140 (70 samples x 2 chromosomes)...however (since the filter excludes 0/0) then the minimum AC would be 70. No matter how you sliced it, AF >= 0.5

(three) So, a quick glance at the output of bcftools stats tells me something's wrong, because theyre claiming I have a number of SNPs at AF=0.1, AF=0.11, AF=0.12

The question is; based on the above, where/how is it calculating that AF? Am I doing it wrong, is bcftools doing it wrong (based on the AN and AC in the lines themselves), or is the stats portion doing something else?

    # AF, Stats by non-reference allele frequency:
# AF    [2]id   [3]allele frequency     [4]number of SNPs       [5]number of transitions        [6]number of transversions      [7]number of indels     [8]repeat-consistent    [9]repeat-inconsistent  [10]not applicable
AF      1       0.100000        374     137     237     0       0       0       0
AF      1       0.110000        652     241     411     0       0       0       0
AF      1       0.120000        612     224     388     0       0       0       0
AF      1       0.130000        467     157     310     0       0       0       0
AF      1       0.140000        734     272     462     0       0       0       0
AF      1       0.150000        407     140     267     0       0       0       0
AF      1       0.160000        787     274     513     0       0       0       0
AF      1       0.170000        340     115     225     0       0       0       0
AF      1       0.180000        526     152     374     0       0       0       0
AF      1       0.190000        880     334     546     0       0       0       0
AF      1       0.200000        313     115     198     0       0       0       0
AF      1       0.210000        451     156     295     0       0       0       0
(...snip...)
AF      1       0.480000        1020    396     624     0       0       0       0
AF      1       0.490000        8252448 5590474 2661974 0       0       0       0
AF      1       0.500000        96973   65778   31195   0       0       0       0
AF      1       0.510000        316516  214884  101632  0       0       0       0
AF      1       0.520000        103985  68975   35010   0       0       0       0
AF      1       0.530000        67025   43946   23079   0       0       0       0
AF      1       0.520000        103985  68975   35010   0       0       0       0
AF      1       0.530000        67025   43946   23079   0       0       0       0
AF      1       0.540000        50260   32635   17625   0       0       0       0
AF      1       0.550000        50590   32823   17767   0       0       0       0
AF      1       0.560000        31551   20233   11318   0       0       0       0
AF      1       0.570000        31220   19894   11326   0       0       0       0
AF      1       0.580000        13029   8194    4835    0       0       0       0
AF      1       0.590000        26025   16457   9568    0       0       0       0
AF      1       0.600000        12164   7604    4560    0       0       0       0
(...snip...)
AF      1       0.970000        529     318     211     0       0       0       0
AF      1       0.980000        73      40      33      0       0       0       0
AF      1       0.990000        169643  100968  68675   0       0       0       0

SNP next-gen alignment software error • 2.2k views
4
Entering edit mode
3.4 years ago

Hello again,

Welcome to Bioinformatics.

Also notice in the combined file, the ./.:. refers to a sample that was REF/REF (and thus not included in the file). But if you look at the AC and AN in the combined line, the ./.:. are NOT included in those counts.

This is not true. This annotation means that there was not sufficient data in the sample such that any call —reference or otherwise— could be made. Think of it as 'NA'. These cannot reliably be assumed to be either ref or alt alleles. They are just 'missing'. In light of this, I literally counted my own AC/AN for your variant at position 12383 and I got:

• AN=56
• AC=31

So, the numbers in your VCF are correct.

(one) Since, in the individual file(s), AN=2 and AC > 0 - it is impossible for the AF to be less than 0.5 (Q1: correct?)

Yes, for germline DNA and when we're referring to a single sample, we would expect the AF to be either 0.5 or 1.0. The AF will most likely drift from this when we merge multiple samples together (see below), and also at multi-allelic sites, i.e., where more than one variant is called at the same position in the same individual (however, it is good practice to split these multi-allelic sites into multiple lines). The AF will also differ immensely when analysing, for example, a bulk tumour biopsy, where you would in fact be analysing millions of different tumour clones together.

(two) Since ALL the data in the combined file comes from similarly filtered individual files, the same is true for the combined file (AF cannot be less than 0.5) (Q2: Correct?).

Now that we're referring to the situation where we have merged multiple samples, the AF will most like change. If I merge a VCF of my variants with those of my 3 brothers and 3 of my cousins (I have >40 in fact), look at the merged records in my 'pseudo VCF':

CHR POS    REF ALT Me    Brother1 Brother2 Brother3 Cousin1 Cousin2 Cousin3
1   12345  T   G   0/1   0/1      0/1      0/1      0/0     0/0     ./.

• AN is 6 * 2 = 12
• AC is 4
• AF is 4/12 = 0.333 recurring

Cousin3's data is not included as the information was missing. It's NA.

It IS possible for the AN of a single line (70 samples) to be as high as 140 (70 samples x 2 chromosomes)...however (since the filter excludes 0/0) then the minimum AC would be 70. No matter how you sliced it, AF >= 0.5

No, 0/0 is counted as 2 alleles... it's just that they are both reference alleles.

(three) So, a quick glance at the output of bcftools stats tells me something's wrong, because theyre claiming I have a number of SNPs at AF=0.1, AF=0.11, AF=0.12

The question is; based on the above, where/how is it calculating that AF? Am I doing it wrong, is bcftools doing it wrong (based on the AN and AC in the lines themselves), or is the stats portion doing something else?

See my example above.

0
Entering edit mode

I had typed out an answer but Kevin beat me (and did a better job). I think it's probably still worth providing a link to the VCF documentation for future reference on what the fields and terms within the VCF file refer to.

1
Entering edit mode

Hey Russ, my apologies. I think and work quite rapidly. Please feel free to provide your answer if you still wish. It's good to get all opinions/ideas out there.

1
Entering edit mode

No need to apologize! You provided a nice succinct answer, my additions would be redundant :) Cheers.

0
Entering edit mode
3.4 years ago
rightmirem ▴ 70

Hi @kevin

Wow and wow. Great response. And everything you say makes sense.

"Welcome to bioinformatics" Ya :D (LOL) Thanks!

(So, two more quick questions below)

===

One thing for clarification (I think my OP was unclear on this point)...due to filtering, my merged BCF file has NO 0/0 or 0|0data in it. All samples in all lines are either 0/X or X/X(In fact, I don't even have 0|X or X|X data [ with the pipe ]...just forward slash data).

I just confirmed this using:

bcftools view MergedCALL.bcf | grep -P ".*0[/\|]0.*"


...which responded with no lines found. So, (again, just to confirm my thinking) if NO lines in the merged file have 0/0or 0|0, then we're back to a minimum AF of 0.5 (or greater)....correct?

I DO have multiallelic and INDEL data in the file however, but it's a small percentage...

bcftools query -f '%ALT\n' MergedCALL.bcf | grep -P ".{2,}" | wc -l
32939


But out of a 13765007 line file, this is only 0.0024% ... BUT the anomalous data above is not in % but in counts of variants. And for all counts where AF < 0.49 the count is an average of < 1000 (NOTE*: it's less than 0.49 , NOT less than or equal to 0.49...this is important for question 2).

Q1: Could the multiallelic data be the cause for the small number (~ less than 1000) of variant counts AF<0.49?

Q2: AF=0.49 (NOT AF=0.5) has a large variant count. And AF=0.5 has an unexpectedly LOW count. What could cause THAT? (I would expect the AF=0.5 to be the first, large count.)

(snip)
AF  0.5 12610
AF  0.49    3679530
AF  0.48    489
(snip)