Question: Calculate Variant Allele Frequency in a TCGA dataset
gravatar for mp85
14 months ago by
mp8510 wrote:

Hello everyone,

I have little to no prior knowledge of biology (let's say high school level), but I do have strong machine learning background. A project I am involved in has to do with obtaining predictions from a dataset of tumor samples. One of the predictors we need is Variant Allele Frequency (VAF), so I downloaded one tumor dataset from the TCGA data portal to see how this calculation might be done. I understand what Variant Allele Frequency is in general, however I cannot seem to understand how the calculation is done in practice.

The dataset has the following columns (it has many more in fact, but for my needs I just summarized all numerical columns):

Statistic         N      Mean         St. Dev.       Min        Max    
t_depth           2     84.000         36.770         58        110    
t_ref_count       2     70.000         32.527         47         93    
t_alt_count       2     13.500         3.536          11         16     
n_depth           2     88.500         45.962         56        121     
ALLELE_NUM        2     1.000          0.000          1          1     
TRANSCRIPT_STRAND 2     0.000          1.414          -1         1     
PICK              1     1.000                         1          1     
TSL               2     1.000          0.000          1          1     
MINIMISED         2     1.000          0.000          1          1     

What I wish is to add a column, say named vafs, where for each row (each tumor sample) the Variant Allele Frequency is calculated. From my (very basic) understanding, t_ref_count and t_alt_count are the columns that are needed to calculate the Variant Allele Frequency. Is that correct? Do I need to use other columns to perform the calculation? And how precisely this calculation is done?

As an aside, I am going to ask a field expert at some point (I am not going to do all by myself since I lack the knowledge), but I also need to at least grasp how this can be obtained before going any further.

R genome • 995 views
ADD COMMENTlink modified 14 months ago by igor6.5k • written 14 months ago by mp8510
gravatar for igor
14 months ago by
United States
igor6.5k wrote:

Yes, you're correct. VAF is t_alt_count / (t_ref_count + t_alt_count).

When dealing with allele frequencies, also be careful regarding the context. They can be referring to fraction of reads in a single sample (since you are dealing with cancer data, that is probably the case) or fraction of individuals with a mutation in a population.

ADD COMMENTlink modified 14 months ago • written 14 months ago by igor6.5k

+1 for mentioning - "fraction of individuals with a mutation in a population." VAF could be an ambiguous term without proper context.

ADD REPLYlink written 14 months ago by poisonAlien2.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour