Question: Getting Allele Frequencies From 1000 Genomes
20
gravatar for Stephen
8.0 years ago by
Stephen2.7k
Charlottesville Virginia
Stephen2.7k wrote:

Hi folks.

I'm fine-mapping an association to a gene using next-generation sequencing data in African-Americans. I have a list of a few thousand variants on chromosome 12 listed by position based on UCSC hg19 build 36.

I want to retrieve the reference/variant alleles and minor allele frequency from 1000 genomes project for YRI samples for comparison to my own sequencing data. How might I best do this without downloading the 1000 genomes data and recomputing allele frequencies? Is there a way to query ensembl or UCSC for this information?

Thanks in advance.

genome hapmap ucsc maf ensembl • 17k views
ADD COMMENTlink modified 2.0 years ago by Isaac Joseph80 • written 8.0 years ago by Stephen2.7k

just to check do you mean ncbi36/hg18 or GRCh37/hg19

ADD REPLYlink written 8.0 years ago by Laura1.7k
24
gravatar for Stephen
8.0 years ago by
Stephen2.7k
Charlottesville Virginia
Stephen2.7k wrote:

Thanks for all the responses. I actually got a very helpful response from a friend of mine. It was dead simple to download and compile vcftools and tabix on my virtual linux system. It took about 30 seconds to download my genotype calls of interest from 1000 genomes using tabix:

tabix -f -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/AFR.2of4intersection_allele_freq.20100804.genotypes.vcf.gz 12:101266000-101422000 > afr.vcf

Then less than a second to analyze this data for allele frequency using vcftools:

vcftools --vcf afr.vcf --freq --out freq-afr

Which gives me exactly what I want.

CHROM    POS    N_ALLELES    N_CHR    {ALLELE:FREQ}
12    101266049    2    348    A:0.87069    G:0.12931
12    101266292    2    348    A:0.954023    T:0.045977
12    101266307    1    10    G:1
12    101266725    1    10    T:1
12    101266729    2    348    G:0.991379    A:0.00862069
12    101266756    2    348    T:0.985632    G:0.0143678

Now, I have frequency info for my data mapped to UCSC hg19/b36. Now I just need to figure out how to match these up to the 1000 genomes frequencies using their positions.

ADD COMMENTlink written 8.0 years ago by Stephen2.7k
2

+1. I usually recommend to directly look at the 1000g ftp rather than from a 3rd-party database. Nonetheless, you should read the policy when you use 1000g data in large scale.

ADD REPLYlink written 8.0 years ago by lh331k
6
gravatar for Bert Overduin
8.0 years ago by
Bert Overduin3.6k
Edinburgh Genomics, The University of Edinburgh
Bert Overduin3.6k wrote:

Using the Ensembl Perl API:

#!/usr/local/bin/perl
use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

# get a variation adaptor
my $va = $reg->get_adaptor("human", "variation", "variation");

# fetch the variation by name
my $v = $va->fetch_by_name('rs123');

# get all the alleles
my @alleles = @{$v->get_all_Alleles()};

foreach my $allele(@alleles) {
    if($allele->population && $allele->population->name =~ /1000GENOMES.*YRI/){
        print
          $v->name, "\t",
          $allele->allele, "\t",
          (defined($allele->frequency) ? $allele->frequency : "-"), "\t",
          $allele->population->name, "\n";
    }
}

Or starting out from the position of a variant instead of the rs number:

#!/usr/local/bin/perl
use strict;
use warnings;

use Bio::EnsEMBL::Registry;

my $reg = 'Bio::EnsEMBL::Registry';

$reg->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
);

my $sa = $reg->get_adaptor("human", "core", "slice");
my $vfa = $reg->get_adaptor("human", "variation", "variationfeature");

my $chromosome = '7';
my $position = 24966446;

my $slice = $sa->fetch_by_region('chromosome', $chromosome, $position, $position);

my @vfs = @{$vfa->fetch_all_by_Slice($slice)};

foreach my $vf(@vfs){

  my @alleles = @{$vf->variation->get_all_Alleles()};

  foreach my $allele(@alleles) {
    if($allele->population && $allele->population->name =~ /1000GENOMES.*YRI/){
      print
        $vf->seq_region_name, "\t",
        $vf->seq_region_start, "\t",         
        $vf->seq_region_end, "\t",
        $vf->variation->name, "\t",
        $allele->allele, "\t",
        (defined($allele->frequency) ? $allele->frequency : "-"), "\t",
        $allele->population->name, "\n";
    }
  }
}

Ensembl questions can also be posted to the Ensembl Helpdesk or the Ensembl developers mailing list.

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Bert Overduin3.6k

Thanks Bert. I've never used the Ensembl Perl API before. It looks like this is fetching by RS number. If all I have is a list of positions (UCSC hg19/b36) on chromosome 12, is there an easy modification to this perl script that wouldn't require figuring out how to get RS numbers for those positions?

ADD REPLYlink written 8.0 years ago by Stephen2.7k

Stephen, I added a second code example that starts with a genomic position instead of a rs number. Please note that in Ensembl we distinguish variations and variation features. A variation feature is the genomic mapping of a variation (one variation can in principle map to multiple positions in the genome!). For more information regarding the Ensembl API, please have a look at http://www.ensembl.org/info/data/api.html. If you have more questions, these can be directed to the Ensembl Helpdesk (helpdesk@ensembl.org).

ADD REPLYlink written 8.0 years ago by Bert Overduin3.6k
3
gravatar for David Quigley
8.0 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

In Pierre's answer to this question 1000 genome allele frequencies are being returned (along frequencies from other data sources) for a query to ensembl. That suggests that you could adapt his SQL to get the answer you want without recomputing the allele frequencies.

ADD COMMENTlink written 8.0 years ago by David Quigley11k
3
gravatar for Jorge Amigo
8.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

in case you prefer an interfaced solution rather than an automated script or sql query, maybe you could be interested in using BioMart (as several times suggested here) by selecting the most up to date human database (ensembl variation 61) and dataset (human variation 132), filling in the "Multiple Chromosomal Regions" field with your variants' positions. note that these positions should be formatted as Chr:Start:End:Strand. you will then be able to select the attributes you are interested in, such as "allele", "population name" and "genotype frequency", and once you generate such query you'll be able to filter (now locally) by the YRI population.

[self-promoting mode on] or you may want to directly access SPSmart, since we've recently incorporated 1000 Genomes data (accessible through here, although we are waiting for a BMC Bioinformatics paper to be accepted in order to allow proper merging with the rest of the available datasets). all you'll have to do is to select the YRI population and then query either the chr12 region of interest or the chromosome positions with a "chr:pos" format. [self-promoting mode off]

ADD COMMENTlink written 8.0 years ago by Jorge Amigo11k
0
gravatar for Isaac Joseph
2.0 years ago by
Isaac Joseph80
Berkeley, CA USA
Isaac Joseph80 wrote:

Here's an example programmatically, using python and the handy-dandy myvariant package (thanks Cyrus Afrasiabi, Chunlei Wu):

import myvariant
myvariant.MyVariantInfo()
mv.query('dbsnp.rsid:rs58991260', fields='dbsnp')['hits'][0]['dbsnp']['gmaf'] # returns 0.02157

Note that this selects the first possible hit in dbSNP for the given query. Queries can be made based on base pair/sequence name in addition to by rsID as shown.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Isaac Joseph80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1220 users visited in the last hour