Question: Getting 1000 Genomes phase one MAF values
gravatar for eurioste
3.3 years ago by
eurioste20 wrote:

I need to get the global 1000 genomes phase 1 minor allele frequencies for all 1000 Genomes low coverage phase 1 SNPs. I have the 1000 genomes .vcf files. Which is the easiest way to get this values? Any suggested tool?

1000 genomes maf • 1.8k views
ADD COMMENTlink modified 3.1 years ago by Cyriac Kandoth5.5k • written 3.3 years ago by eurioste20

So, the question is, how do I get the MAF from a VCF? It doesn't matter that this VCF is the 1000 genomes vcf, right?

ADD REPLYlink written 3.3 years ago by WouterDeCoster44k

Perhaps you could clarify this question? Just so you know, 1000 genomes does not strictly refer to humans. I assume you are talking about humans, though.

Anyway, for the purposes of the forum, it would be useful if you explained what you are trying to accomplish, why you are trying to accomplish it, what you have tried, and... well, what organism you are working with is always helpful.

Explaining what you mean by "phase 1" would also be helpful, so that people don't have to look it up. I looked it up, and read about it here. But, well... it's not clear to me why anyone would care about that. As far as I can tell, phase 1 is a preliminary, inaccurate part of the human 1000 genomes project. Why would you want to use that for anything, when there are subsequent, more accurate phases?

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Brian Bushnell17k

1000 genomes does not strictly refer to humans

Doesn't it? As far as I know, 1000 genomes project is humans only, while the 10K genome project also includes other species.

ADD REPLYlink written 3.3 years ago by WouterDeCoster44k

So, I used to think "thousand genomes" applied to humans only. Then I started working at JGI, and found out that there are other projects called "thousand genomes" that are related to other organisms (such as Aspergillus). My co-workers were baffled when I assumed that when they said "thousand genomes", they were talking about human genomes.

I'm not sure how important this is in most of the world. When I was at UT Southwestern, "thousand genomes" strictly meant a human project. But at JGI, it strictly means not a human project. So, I think it is useful to specify the organism, and also to provide a link to the project, to prevent unnecessary confusion.

ADD REPLYlink written 3.3 years ago by Brian Bushnell17k

Alright - I wasn't aware of other thousand genome projects.

ADD REPLYlink written 3.3 years ago by WouterDeCoster44k

I see similar "vocabulary bubbles" in people who study cancer genetics (like me) vs people who spent their careers studying genetics of germline/mendelian diseases. Every research institute invents its own vocabulary. Clearly, we don't get out much. :)

ADD REPLYlink written 3.1 years ago by Cyriac Kandoth5.5k
gravatar for Cyriac Kandoth
3.1 years ago by
Cyriac Kandoth5.5k
Memorial Sloan Kettering, New York, USA
Cyriac Kandoth5.5k wrote:

If you already have these Phase 1 1000 genome VCFs downloaded, then the Global Allele Frequency is within AF under the INFO column. The VCF format is complex, so don't try to write your own code to parse out INFO/AF. If you need a flattened tab-delimited format, then use tools like bcftools query. Download and install bcftools as explained here, and then you can use a command like this:

bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t%AF\t%AC\t%AN\n' ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz

This generates output that looks like this:

22      16050408        T       C       0.06    134     2184
22      16050984        C       G       0.0023  5       2184
22      16051722        TA      T       0.01    32      2184
22      16052239        A       G       0.46    1010    2184
22      16053659        A       C       0.76    1655    2184

Where the columns are CHROM, POS, REF, ALT, AF, AC, AN. ALT Allele Count (AC) and Total Allele Count (AN) are useful to know, even though you're only seeking AF which is equal to AC/AN.

An important point is that AF is the frequency of the ALT allele. But sometimes the REF is the minor allele. For example, if AF is 0.76, then it is too common in the population to be called the "minor" allele. So if AF is greater than 0.50, then set MAF=(1-AF). Otherwise MAF=AF. I'm not very experienced in germline genetics, so there may be other caveats. But this should be enough to get you started.

ADD COMMENTlink modified 3.0 years ago • written 3.1 years ago by Cyriac Kandoth5.5k

Hi Cyriac

You explained very well. I'm not able to get MAF in my vcf annotation from either VEP or Annovar. Instead, I'm getting 1000G_All. So is the formula you mentioned above [MAF = 1-AF] perfectly correct, because I'm not able to get any document regarding this. It will be really helpful if you can send a link or a document that describes this formula.


ADD REPLYlink written 2.2 years ago by gs9590
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1054 users visited in the last hour