Getting 1000 Genomes phase one MAF values
1
0
Entering edit mode
6.9 years ago
eurioste ▴ 20

I need to get the global 1000 genomes phase 1 minor allele frequencies for all 1000 Genomes low coverage phase 1 SNPs. I have the 1000 genomes .vcf files. Which is the easiest way to get this values? Any suggested tool?

maf 1000 genomes • 3.5k views
ADD COMMENT
1
Entering edit mode

So, the question is, how do I get the MAF from a VCF? It doesn't matter that this VCF is the 1000 genomes vcf, right?

ADD REPLY
0
Entering edit mode

Perhaps you could clarify this question? Just so you know, 1000 genomes does not strictly refer to humans. I assume you are talking about humans, though.

Anyway, for the purposes of the forum, it would be useful if you explained what you are trying to accomplish, why you are trying to accomplish it, what you have tried, and... well, what organism you are working with is always helpful.

Explaining what you mean by "phase 1" would also be helpful, so that people don't have to look it up. I looked it up, and read about it here. But, well... it's not clear to me why anyone would care about that. As far as I can tell, phase 1 is a preliminary, inaccurate part of the human 1000 genomes project. Why would you want to use that for anything, when there are subsequent, more accurate phases?

ADD REPLY
0
Entering edit mode

1000 genomes does not strictly refer to humans

Doesn't it? As far as I know, 1000 genomes project is humans only, while the 10K genome project also includes other species.

ADD REPLY
2
Entering edit mode

So, I used to think "thousand genomes" applied to humans only. Then I started working at JGI, and found out that there are other projects called "thousand genomes" that are related to other organisms (such as Aspergillus). My co-workers were baffled when I assumed that when they said "thousand genomes", they were talking about human genomes.

I'm not sure how important this is in most of the world. When I was at UT Southwestern, "thousand genomes" strictly meant a human project. But at JGI, it strictly means not a human project. So, I think it is useful to specify the organism, and also to provide a link to the project, to prevent unnecessary confusion.

ADD REPLY
1
Entering edit mode

Alright - I wasn't aware of other thousand genome projects.

ADD REPLY
0
Entering edit mode

I see similar "vocabulary bubbles" in people who study cancer genetics (like me) vs people who spent their careers studying genetics of germline/mendelian diseases. Every research institute invents its own vocabulary. Clearly, we don't get out much. :)

ADD REPLY
1
Entering edit mode
6.6 years ago

If you already have these Phase 1 1000 genome VCFs downloaded, then the Global Allele Frequency is within AF under the INFO column. The VCF format is complex, so don't try to write your own code to parse out INFO/AF. If you need a flattened tab-delimited format, then use tools like bcftools query. Download and install bcftools as explained here, and then you can use a command like this:

bcftools query --format '%CHROM\t%POS\t%REF\t%ALT\t%AF\t%AC\t%AN\n' ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz

This generates output that looks like this:

22      16050408        T       C       0.06    134     2184
22      16050984        C       G       0.0023  5       2184
22      16051722        TA      T       0.01    32      2184
22      16052239        A       G       0.46    1010    2184
22      16053659        A       C       0.76    1655    2184

Where the columns are CHROM, POS, REF, ALT, AF, AC, AN. ALT Allele Count (AC) and Total Allele Count (AN) are useful to know, even though you're only seeking AF which is equal to AC/AN.

An important point is that AF is the frequency of the ALT allele. But sometimes the REF is the minor allele. For example, if AF is 0.76, then it is too common in the population to be called the "minor" allele. So if AF is greater than 0.50, then set MAF=(1-AF). Otherwise MAF=AF. I'm not very experienced in germline genetics, so there may be other caveats. But this should be enough to get you started.

ADD COMMENT
0
Entering edit mode

Hi Cyriac

You explained very well. I'm not able to get MAF in my vcf annotation from either VEP or Annovar. Instead, I'm getting 1000G_All. So is the formula you mentioned above [MAF = 1-AF] perfectly correct, because I'm not able to get any document regarding this. It will be really helpful if you can send a link or a document that describes this formula.

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6