Get the Minor Allele Frequency for Specific Populations from NCBI dbSNP
2
1
Entering edit mode
20 months ago
andy.wang ▴ 40

Hello,

I have a list of snps (rs-id) in a .txt file. Say it contains many snps. I want to get their minor allele frequency (MAF) for a specific population (e.g. 1000Genomes - Europe, not Global).

The NCBI batch query service is down since a long time ago. Now, it has changed to an API: Variation Services. But the "frequency" service is still under construction. Then I downloaded all data from dbSNP directly and hoped to query it locally. The best I can get is still just the Global MAF, not for some specific population like Europe.

I also tried some R packages like "Rsnps" but similarly they work fine for getting global MAF, not European MAF. If there are such packages available, please tell me to have try and I appreciate a lot.

Finally, I moved to web scrapping and directly query the NCBI website (append rs-id to "https://www.ncbi.nlm.nih.gov/snp/" and get the 1000Genomes-Europe-MAF). Obviously, there is a limit on your query speed and the best is 3 times / sec. Otherwise it will cause an error from NCBI server. If the amount of snps becomes very large, then this method is unrealistic (For 1 million snps, it needs 20 days).

Do you have any suggestions on how to get the MAF for a specific population? Thank you very much for your help.

SNP snp dbSNP NCBI MAF • 1.3k views
2
Entering edit mode

Via biomaRt in R, you can obtain the global MAF ( A: How to retrieve Gene name from SNP ID using biomaRt ), however, this is not what you need.

If I were you, I would take time to set up an ANNOVAR installation on your computer. With this, you can easily annotate genetics data in many ways, including MAFs for all global populations from projects that include both 1000 Genomes and GME (Greater Middle East), and others. Take a look: https://doc-openbio.readthedocs.io/projects/annovar/en/latest/

Another option, of course, is Ensembl's VEP (Variant Effect Predictor).

Kevin

1
Entering edit mode

Thank you very much for your response first. I would have them a look.

1
Entering edit mode
19 months ago
andy.wang ▴ 40

Recently, I have also tried a website SNPnexus to test how well it returns MAF using a list of rs-ids. Overall, the good thing of SNPnexus is that it is fast. The confusion I am having is like the result might be incomplete.

It requires an input format of two columns, e.g. "dbsnp rs100". I tested a list of 100 / 1000 / 10,000 / 30,000 snps accordingly (I haven't tried the maximum # inputs 100,000). SNPnexus can return the results quickly by just a few seconds. However, there are two concerns:

1. I input a list of 31,351 rs-ids, but SNPnexus only returns 30,806 snps. This is consistent if I try more times. It also applies for the other number of input snps (say, input 1000 will return 980). I haven't compared which snps are missing.
2. With the same input of 31,351 rs-ids, I get different number of snps with MAF if I try multiple times. Certainly there are some snps that are not in 1000Genomes study or no MAF. But I don't know why I get 4,070 snps for first try but 3,938 snps for second try on the same input. For now, I guess there are some duplicates because of multiple alternative allele. Or more likely, there are some snps missing.

I will keep digging more to figure out what happens and update my findings here.

1
Entering edit mode
19 months ago
andy.wang ▴ 40

Final update:

I did a lot research, have tried many ways and ultimately I leave two most accessible approaches. Just record my findings here:

1. Thanks to Pierre's answer. If you go to dbSNP's ftp server and download the latest (20130502) files, for each chromosome there is a list of SNPs with rs-ids and their allele frequencies. For now, I just used gunzip -c xxx.vcf.gz | grep -v ^# | cut -f1-8 | head -10 to check the first 10 lines and 8 columns. The 8-th column does contain all frequencies I want. Because this is a way of querying locally, I do not worry too much on the speed as long as our lab's CPU and memory are reliable. One thing worth noticing is that the SNPs are in built 37. If you want built 38, then it may need further check.

2. Many thanks to Kevin's reply in this page. Certainly, Ensemble's VEP is another good way. For now, I just tried 1000 SNPs and it looks well. You can output the result in .txt format and extract needed information easily. I haven't tried any incredibly larger number of SNPs. It looks like VEP is a fully functioned tool. You can try it in 3 ways: web, locally or even API.

I believe that our team will move forward, go back to this question and review these approaches very soon. Anyways, thanks everyone for your attention.

Andy

0
Entering edit mode

Hey, of course, the 1000 Genomes Phase III data also contains the allele frequencies. However, Pierre's answer and link are old; so, follow Step 1 here, and you will be able to download all of the most recent release from 2013: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

I just checked the data on my disk and indeed it contains (copied from header):

##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">

0
Entering edit mode

Hi Andy,

Thank you so much for this post.

I have been trying to extract EUR MAF for a list of rsIDs that I have for several days now. What I noticed from all the Biostars' posts/replies etc, is that everyone is suggesting the known databases to retrieve this information. The 1000G VCF files from the ftp server which I also downloaded for my chromosomes of interest, give us the population AFs indeed, but when I double-checked from the 1000G browser, those AFs are not necessarily the MAF for that specific SNP.

I checked VEP, dbSNP and the 1000G browser.

Could you please let me know if you were able to extract the EUR MAF and not the EUR AF for your variants of interest? The minor allele is not necessarily the reference allele for example, therefore different databases provide different information.

Any help would be greatly appreciated.