Question: Api Variation Tools And 1000Genomes Dbsnp132
gravatar for Krisr
9.8 years ago by
United States
Krisr460 wrote:

I am interested in generating the flanking sequences (10 nt from each side) from a list of ~500,000 SNPs.

I am using the perl API variation to extract this information, however the script throws an error when it encounters many of the recently discovered SNPs (from the 1000genomes data).


 use strict;
use warnings;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Utils::Slice qw(split_Slices);

my $registry = 'Bio::EnsEMBL::Registry';

    -host => '',
    -user => 'anonymous'

my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); #get the different adaptors for the different objects needed
my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature');
my $gene_adaptor = $registry->get_adaptor('human', 'core', 'gene');

my @rsIds = qw(rs6688221 rs113821789 rs112155239 rs10147700 rs74917091);
foreach my $id (@rsIds){
# get Variation object
  my $var = $va_adaptor->fetch_by_name($id); #get the Variation from the database using the name

sub get_VariationFeatures{
  my $var = shift;
  # get all VariationFeature objects: might be more than 1 !!!
  foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){
      print $vf->variation_name(),","; # print rsID
      print $vf->allele_string(),","; # print alleles
      #print join(",",@{$vf->get_consequence_type()}),","; # print consequenceType
      print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string
      print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq
      print $vf->seq_region_name, ":", $vf->start,"-",$vf->end, "\n"; # print position in Ref in format Chr:start-end
      #&get_TranscriptVariations($vf); # get Transcript information

OUTPUT: MSG: Bio::EnsEMBL::Variation::Variation arg expected STACK Bio::EnsEMBL::Variation::DBSQL::VariationFeatureAdaptor::fetchallbyVariation /Users/ensembl/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/DBSQL/ STACK main::getVariationFeatures STACK toplevel Ensembl API version = 60

Looking at the code in, it appears these new rs #s are not found in the database being referenced. Does anyone know if there is/or will be an update, or if this data can be accessed another way to get the sequences - in bulk?

genome perl api variation snp • 2.9k views
ADD COMMENTlink modified 9.8 years ago by Giulietta - Ensembl Helpdesk1.2k • written 9.8 years ago by Krisr460
gravatar for Giulietta - Ensembl Helpdesk
9.8 years ago by
Cambridge, UK

Hi Kris,

You saw this answer at Just to add it to the discussion:

The variation API is designed to work against the Ensembl variation database (only). We currently have dbSNP 131 in release 60 (live version). We will be getting dbSNP 132 in release 61 (due end of January, 2011). If you can wait until then, you can use the API against dbSNP 132 data.

ADD COMMENTlink written 9.8 years ago by Giulietta - Ensembl Helpdesk1.2k

FYI - this is now updates and works very well. I've been able to retrieve all SNPs and surrounding seq.

One other question; does anyone know how I might convert a transcript ID (ENSG#####) to a gene name (BRCA1, MYC, etc)? I'd like to add this into the script, so a perl solution would be ideal.

ADD REPLYlink written 9.7 years ago by Krisr460

I just sent this answer to you on helpdesk- here you are:

Slice->get_all_Genes(). You can get the gene name via the external_name() subroutine on a gene object.

Hope that helps!

ADD REPLYlink written 9.6 years ago by Giulietta - Ensembl Helpdesk1.2k
gravatar for Pierre Lindenbaum
9.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

If all your snps are some rs### from dbsnp what about using the NCBI E-Utilities to fetch the flanking sequence ?

for S in 6688221 113821789 112155239 10147700 74917091
   echo -n "rs$S";
   curl -s "$S&rettype=fasta&retmode=text"|\
   egrep -v "^1" | tr -d " " |\
   awk '/^>/ {s=$0; i=index(s,"|pos=");i+=5; s=substr(s,i); i=index(s,"|"); s=substr(s,1,i-1); len=int(s);seq=""; next;}
END { printf("\t%s\t%s\t%s\n",substr(seq,len-10,10),substr(seq,len,1),substr(seq,len+1,10)); }'

rs6688221       GTGGCAGAGC      Y       TGAGCCGTTC
rs113821789     ATAGCATTAG      R       AGAAATACCT
rs112155239     CTAACCCTAA      M       CCCTAACCCT
rs10147700      TTTTTTTTTT      K       TTTGTGCATT
rs74917091      TTCTGGGATT      W       TTTTTTTTTT
ADD COMMENTlink modified 13 months ago by RamRS30k • written 9.8 years ago by Pierre Lindenbaum130k

Thanks. I'm new to Eutilities, is there a command to extract the major and minor alleles?

ADD REPLYlink written 9.8 years ago by Krisr460


ADD REPLYlink modified 13 months ago by RamRS30k • written 9.8 years ago by Pierre Lindenbaum130k

I am interested in generating sequences containing the allele which is not in the current human reference assembly. Can this data be extracted?

ADD REPLYlink written 9.8 years ago by Krisr460

That will be tricky this because you will have to handle the strands of the snp vs the genome. However, you will find the position of the SNP in the XML definition of the SNP ( see NCBI EFetch doc), and to get the reference base use a library (bioperl...) or see this other question (slower)

ADD REPLYlink modified 12 months ago by RamRS30k • written 9.8 years ago by Pierre Lindenbaum130k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2241 users visited in the last hour