1000Genomes population allele frequencies for list of SNPs
2
2
Entering edit mode
5.4 years ago
biosol ▴ 170

Hi all,

I have a list of more than 100 SNPs (rsXXXXXX) and I would like to obtain the different allele frequencies that each of them shows in each of the 1000Genomes populations (if possible not manually...). Is there any tool, R package etc, that allows to download them all at once? I've thought that I could maybe obtain those frequencies from the UCSC database or directly from the 1000Genomes database, but I'm open to any suggestions. Thank you very much in advanced!

snps 1000Genomes • 5.0k views
ADD COMMENT
4
Entering edit mode
5.4 years ago

Solution 1: The raw variant call data can be downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ . Once you have those files, the INFO column in each VCF file contains superpopulation allele frequencies, and there are a bunch of tools which can look up the INFO column entry for a particular rsID. The one complication is that, since there's a separate VCF file per chromosome, you may first need to figure out which rsIDs are on which chromosomes.

Solution 2: With plink 2.0, http://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 provides a single merged dataset containing all chromosomes (download the boldfaced links, then rename phase3_corrected.psam to all_phase3.psam). Then,

plink2 --pfile all_phase3 vzs --extract [your list of rsIDs] --export vcf

can then be used to export a VCF with only the rsIDs you care about; the precomputed superpopulation allele frequencies will be in the INFO column of this freshly generated VCF. You can also define your own populations with --keep and compute allele frequencies on the fly with --freq.

ADD COMMENT
0
Entering edit mode

Sorry, I don't know what might be happening but my plink2 returns that it doesn't recognise the "--pfile" option... I'm launching the command in a folder with the files all_phase3.pgen.zst, all_phase3.psam, all_phase3.pvar.zst and my list of rsIDs... is it possible that it is not recognising some of the files? Or is it more likely a problem of my plink2 installation?

Thanks again :)

ADD REPLY
0
Entering edit mode
  1. You need to decompress the .pgen.zst file first; see the instructions at the top of the resources page.

  2. This requires plink 2.0, not 1.9. What do you get when you type “plink2 —version”?

ADD REPLY
5
Entering edit mode
5.4 years ago
$ cat rslist.txt | while read R ; do wget -q -O - "https://www.ncbi.nlm.nih.gov/snp/${R}?download=frequency" | grep -E '^(#Study|1000Genomes)' | sed "s/^/${R}\t/" ; done

rs25    #Study  Population  Group   Samplesize  Ref Allele  Alt Allele  BioProject ID   BioSample ID
rs25    1000Genomes Global  Study-wide  5008    T=0.485 C=0.515 PRJEB6930   SAMN07490465
rs25    1000Genomes African Sub 1322    T=0.493 C=0.507     SAMN07486022
rs25    1000Genomes East Asian  Sub 1008    T=0.474 C=0.526     SAMN07486024
rs25    1000Genomes Europe  Sub 1006    T=0.521 C=0.479     SAMN07488239
rs25    1000Genomes South Asian Sub 978 T=0.52  C=0.48      SAMN07486027
rs25    1000Genomes American    Sub 694 T=0.38  C=0.62      SAMN07488242
rs26    #Study  Population  Group   Samplesize  Ref Allele  Alt Allele  BioProject ID   BioSample ID
rs26    1000Genomes Global  Study-wide  5008    T=0.335 C=0.665 PRJEB6930   SAMN07490465
rs26    1000Genomes African Sub 1322    T=0.404 C=0.596     SAMN07486022
rs26    1000Genomes East Asian  Sub 1008    T=0.291 C=0.709     SAMN07486024
rs26    1000Genomes Europe  Sub 1006    T=0.341 C=0.659     SAMN07488239
rs26    1000Genomes South Asian Sub 978 T=0.36  C=0.64      SAMN07486027
rs26    1000Genomes American    Sub 694 T=0.22  C=0.78      SAMN07488242
rs27    #Study  Population  Group   Samplesize  Ref Allele  Alt Allele  BioProject ID   BioSample ID
rs27    1000Genomes Global  Study-wide  5008    G=0.283 C=0.717 PRJEB6930   SAMN07490465
rs27    1000Genomes African Sub 1322    G=0.355 C=0.645     SAMN07486022
rs27    1000Genomes East Asian  Sub 1008    G=0.284 C=0.716     SAMN07486024
rs27    1000Genomes Europe  Sub 1006    G=0.261 C=0.739     SAMN07488239
rs27    1000Genomes South Asian Sub 978 G=0.29  C=0.71      SAMN07486027
rs27    1000Genomes American    Sub 694 G=0.16  C=0.84      SAMN07488242
rs28    #Study  Population  Group   Samplesize  Ref Allele  Alt Allele  BioProject ID   BioSample ID
rs28    1000Genomes Global  Study-wide  5008    C=0.517 T=0.483 PRJEB6930   SAMN07490465
rs28    1000Genomes African Sub 1322    C=0.601 T=0.399     SAMN07486022
rs28    1000Genomes East Asian  Sub 1008    C=0.476 T=0.524     SAMN07486024
rs28    1000Genomes Europe  Sub 1006    C=0.517 T=0.483     SAMN07488239
rs28    1000Genomes South Asian Sub 978 C=0.53  T=0.47      SAMN07486027
rs28    1000Genomes American    Sub 694 C=0.40  T=0.60      SAMN07488242
ADD COMMENT
0
Entering edit mode

Thank you so much for this solution @Pierre. Could you please tell me how can i write this to a file versus printing it on my screen because I need to get the frequencies for around 500 variants. Thank you so much!

ADD REPLY
1
Entering edit mode

how can i write this to a file versus printing it on my screen

https://www.tecmint.com/linux-io-input-output-redirection-operators/

ADD REPLY

Login before adding your answer.

Traffic: 2759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6