I have been looking for a way to find "which nsSNP (with rs ID number like rs769971095) belong to what population(s), and if possible what gender"? I came to know of the vcftools but how this goal can be achieved is what I am struggling with? If am right, this info can be taken from vcf files from here "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp//release/20130502/supporting/functional_annotation" (basically from the "filtered" and "unfiltered" folders of the provided link? ).Can you please let me know what could be the solution to this? Thank you so much! please let me know if something is not clear.
ENSEMBL's browser allows you to search for lots of information on SNPS, including allele frequencies in each population group studied by 1000 Genomes: http://grch37.ensembl.org/index.html
Other than that, you could literally run a program like ANNOVAR or Variant Effect Predictor on all 1000 Genomes SNPs and short indels if you wanted very comprehensive annotation. Take a look at my thread here to see how you could download 1000 Genomes in VCF format and then annotate it. In my protocol, you can download a PED file, which contains gender-specific information for each of the 1000 Genomes samples.
Edit: SNPs don't 'belong' to any particular population. The vast majority have varying allele frequencies in each population, some with higher frequencies than in others. A minority of SNPs have 0% frequencies in certain groups, as they have only been encountered in very isolated population groups
Thanks much! The link you provided is also helpful. However, if you could please share a way where if given a rsnumber ; lets say " rs769971095", I could get the allele frequencies for each population it belong to, from the vcf files or a source you know is better. Also, taking help from your post I could plot it too. Thanks!
I got that information by following my link to ENSEMBL (above), searching for your SNP and then clicking on 'Population Genetics' when the search record appeared
Thanks much! Based on your suggestion I think I am getting closer to results. Could you please give your comments on these following related queries :
a) I found the respective population genetics info for 2 rsIDs; rs559632360 & rs769971095
For rs769971095 the super-population it shows is: ALL, AFR, AMR, ASJ, EAS, FIN, NFE, OTH, SAS.
For rs559632360 the super-population it shows is: ALL, AFR, AMR, EAS, SAS, EUR.
For rs559632360 rsID, it also shows population genetics from "1000 Genomes Project Phase 3 & gnomAD exomes" along with "subpopulation" information, whereas, for rs769971095 it shows only "gnomeAD exomes" population genetics.
Does this mean that for "rs769971095" there is no "1000 genomes project phase 3" data available?
I am interested to know if these two rsIDs belong to one population, so, can it be said that these rsIDs share same population? If yes, what population they share? It would be great if I could know how to make a reasonable interpretation for this.
Also, I need to do this for many rsIDs, could you please let me know how this process can be automated?
Yes, they appear to generally have similar frequencies across each population. The only difference is that rs559632360 has a higher frequency in Non-Finnish Europeans (NFE), whereas rs769971095 has a higher frequency in SAS (South-East Asians). I would not look too much into the fact that 1 was listed under 1000 Genomes Phase III, whilst the other was not. As far as I know, they are still documenting all variants identified in 1000 Genomes and many may not have even made it to dbSNP yet (or else they have already been identified by other projects, like gnomADe).
To automate this process is not easy! This tool on ENSEMBL's website may be what you need (it outputs all sorts of info, including allele frequencies - check the output options further down the page):
http://grch37.ensembl.org/Homo_sapiens/Tools/VEP?db=core
Alternatively, there is this, which allows you to look up SNPs within defined regions.
Thanks much! The link you provided is also helpful. However, if you could please share a way where if given a rsnumber ; lets say " rs769971095", I could get the allele frequencies for each population it belong to, from the vcf files or a source you know is better. Also, taking help from your post I could plot it too. Thanks!
For that particular SNP, it appears that the C allele dominates all populations. The only population where T appears is SAS (South-East Asian). You can take a look here: http://grch37.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=3:12625875-12626875;v=rs769971095;vdb=variation;vf=135759093
I got that information by following my link to ENSEMBL (above), searching for your SNP and then clicking on 'Population Genetics' when the search record appeared
The SAS group contains:
source: http://www.internationalgenome.org/category/phenotype/
Thanks much! Based on your suggestion I think I am getting closer to results. Could you please give your comments on these following related queries :
a) I found the respective population genetics info for 2 rsIDs; rs559632360 & rs769971095
For rs769971095 the super-population it shows is: ALL, AFR, AMR, ASJ, EAS, FIN, NFE, OTH, SAS. For rs559632360 the super-population it shows is: ALL, AFR, AMR, EAS, SAS, EUR.
For rs559632360 rsID, it also shows population genetics from "1000 Genomes Project Phase 3 & gnomAD exomes" along with "subpopulation" information, whereas, for rs769971095 it shows only "gnomeAD exomes" population genetics.
http://grch37.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=3:12625875-12626875;v=rs769971095;vdb=variation;vf=135759093
http://grch37.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=3:12632759-12633759;v=rs559632360;vdb=variation;vf=92299087#population_freq_SAS
Does this mean that for "rs769971095" there is no "1000 genomes project phase 3" data available? I am interested to know if these two rsIDs belong to one population, so, can it be said that these rsIDs share same population? If yes, what population they share? It would be great if I could know how to make a reasonable interpretation for this.
Also, I need to do this for many rsIDs, could you please let me know how this process can be automated?
Thanks much! DK
Yes, they appear to generally have similar frequencies across each population. The only difference is that rs559632360 has a higher frequency in Non-Finnish Europeans (NFE), whereas rs769971095 has a higher frequency in SAS (South-East Asians). I would not look too much into the fact that 1 was listed under 1000 Genomes Phase III, whilst the other was not. As far as I know, they are still documenting all variants identified in 1000 Genomes and many may not have even made it to dbSNP yet (or else they have already been identified by other projects, like gnomADe).
To automate this process is not easy! This tool on ENSEMBL's website may be what you need (it outputs all sorts of info, including allele frequencies - check the output options further down the page): http://grch37.ensembl.org/Homo_sapiens/Tools/VEP?db=core
Alternatively, there is this, which allows you to look up SNPs within defined regions.
Hope that this helps somewhat.