Question: How To Retrieve Gene Variations From Ensembl Using The Perl Api?
2
gravatar for Jarretinha
8.5 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

Dear all,

A very practical question. I need to retrieve all variations (loose sense) of the MEN1 gene from the human genome at ensembl. Of course, I want to do it programmatically. But, I'm having a hard time using the perl API. I think I didn't grasp the adaptor idea (maybe the nomenclature scheme). Has someone a hint on how to use a gene ID to recover variations?

I've tried some examples from the tutorial. But, something went wrong.

perl ensembl api variation • 4.8k views
ADD COMMENTlink modified 16 months ago by Biostar ♦♦ 20 • written 8.5 years ago by Jarretinha3.3k
6
gravatar for Pontus Larsson
8.5 years ago by
Pontus Larsson60 wrote:

Hi,

The numbers you are looking at in the variation table on the gene page in Ensembl are the number of different consequences that affects the transcripts of the gene by variations that fall within it. In other words, one single variation can have multiple consequences on one transcript (i.e. 'intronic' and 'essential splice site') and it can also have different consequences depending on what transcript you're looking at (e.g. 'intronic' in one transcript but 'non-synonymous coding' in another). The table summarizes these numbers.

To get just the unique variations that falls within this gene, the API script provided above will work fine. If you need the variation consequences for each variation on each transcript, you can do something like this:

my $species='human';
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 'anonymous');

my $gene_adaptor =$reg->get_adaptor($species, 'core','Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000133895');
my $tv_adaptor = $reg->get_adaptor($species,'variation','transcriptvariation');

my $tvs = $tv_adaptor->fetch_all_by_Transcripts($gene->get_all_Transcripts());
foreach my $tv (@{$tvs}) {
   print join("\t",($tv->variation_feature->variation_name(),$tv->transcript_stable_id,join(",",@{$tv->consequence_type()}))) . "\n";
}

Hope this helps /Pontus Larsson - Ensembl Variation

PS. Note that in the Ensembl pipeline, we do some consistency checking on the data we import. Prior to release 61, we deleted data for variations that fail these checks from the database but starting from release 61, we just flag them and keep the data. By default, these flagged variations are not displayed in the web browser so they won't be included in the variation consequence table. These flagged variations can be viewed in the location view by turning on the 'Failed variations' track. The default API behaviour is also to not return these variations but this behaviour can be changed (see the variation API tutorial for details). The MySQL query given above will return these flagged variations and you would need to do a left join to the 'failed_variation' table to properly filter them out.

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Pontus Larsson60

very interesting Pontus, thanks. (And welcome on Biostar)

ADD REPLYlink written 8.5 years ago by Pierre Lindenbaum122k

Wonderful! Now, everything makes sense!!! Gonna flag this as the right one for the excelent explanation. But, everyone contributed to made my day!!! YEAH! Variations in hand right now via perl API and SQL!!! Certainly will include a mention to Biostar in the paper acknowledgement!

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k
3
gravatar for User 6659
8.5 years ago by
User 6659970
User 6659970 wrote:

my $species='human'; my $reg = 'Bio::EnsEMBL::Registry'; $reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 'anonymous');

my $vfa = $reg->get_adaptor($species, 'variation', 'variationfeature');
my $slice_adaptor = $reg->get_adaptor($species, 'core', 'slice');
my $gene_adaptor =$reg->get_adaptor($species, 'core','Gene');
my $gene = $gene_adaptor->fetch_by_stable_id('ENSG00000133895');

my  $slice = $gene->feature_Slice();
my $counter = 0;
foreach my $vf (@{$vfa->fetch_all_by_Slice($slice)}) {
    $counter++;
    print $vf->seq_region_start(), '-', $vf->seq_region_end(), ' ', $vf->allele_string(), "\n";
  }

print "counter: $counterndonen";

ADD COMMENTlink written 8.5 years ago by User 6659970
1

the api and db return different numbers becase the api only gets dbSNP and hgmd whereas the db gets everything (cosmic too). I don't know where the website gets the figure 6348 from at all. I get 473 vfs in dbSNP and cosmic with 13699 transcript variants with the db, and 474 and 13700 with the api. One has gone missing somewhere though :)

ADD REPLYlink written 8.5 years ago by User 6659970
1

if you want somatic mutations as well i don't know how to get them with the api. if you just want inherited variants you can use the api code already give but recurse over transcripts (and you'll get 13700). if you do need somati variants then you can use this query in the db

ADD REPLYlink written 8.5 years ago by User 6659970

That's good too. I've tried smth very similar. Your script returned 474 results. But, I can't understand exactly what's happening. It returned HGMD mutations, but just part of them (6348 in total). Should I recurse over transcripts?

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k

i don't know what information you want to be fair. Do you want just HGMD variants? Do you want the variation_features (i.e. the locations of the variations) or all the transcripts affected by the variants. Do you want SNPs and indels?

ADD REPLYlink written 8.5 years ago by Andrea_Bio2.5k

i don't know what information you want. Do you want just HGMD variants? Do you want the variation_features (i.e. the locations of the variations) or all the transcripts affected by the variants. Do you want SNPs and indels?

ADD REPLYlink written 8.5 years ago by User 6659970

In fact I want all variants of a gene. I need them to model how this gene is evolving in the human population. I don't understand Ensembl numbers. Check this numbers http://www.ensembl.org/Homo_sapiens/Gene/Variation_Gene/Table?g=ENSG00000133895;r=11:64570988-64578766

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k

I need all of them!!! That's the problem. Check these numbers http://www.ensembl.org/Homo_sapiens/Gene/Variation_Gene/Table?g=ENSG00000133895;r=11:64570988-64578766

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k

You want somatic variations too?

ADD REPLYlink written 8.5 years ago by User 6659970

mysql> select count(*) from variation_feature vf inner join seq_region sr on sr.seq_region_id = vf.seq_region_id inner join variation v on vf.variation_id = v.variation_id left outer join failed_variation fv on fv.variation_id = v.variation_id inner join transcript_variation tv on tv.variation_feature_id = vf.variation_feature_id inner join source s on s.source_id = vf.source_id where sr.name = '11' and vf.seq_region_start >=64570988 and vf.seq_region_end <= 64578766 and fv.variation_id is null ;

ADD REPLYlink written 8.5 years ago by User 6659970

I'll surely try this. The number differences are really puzzling me. Anyway, MEN1 is a much more complicated gene than I tought.

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k

some bad luck with numbers here actually. I thought that because i got the same number of variants as the previous respondent who filtered by gene, there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl and sql given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene and get figures of about 9000 but not 6000

ADD REPLYlink written 8.5 years ago by User 6659970

i deleted some previous comments for clarity. some bad luck with numbers here actually. I thought that because i got the same number of variants with the sql as the previous respondent who filtered by gene, there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene

ADD REPLYlink written 8.5 years ago by User 6659970

i deleted some previous comments for clarity. some bad luck with numbers here actually. I thought that because i got the same number of variants with my sql without filtering by gene as the previous respondent (who filtered by gene), there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene

ADD REPLYlink written 8.5 years ago by User 6659970

i deleted some previous comments for clarity. some bad luck with numbers here actually. I thought that because i got the same number of variants with my sql without filtering by gene as the previous respondent (whose query seemed to filter by gene), there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene – unknown (google) 0 secs ago

ADD REPLYlink written 8.5 years ago by User 6659970

i deleted some previous comments for clarity. some bad luck with numbers here actually. I thought that because i got the same number of variants with my sql without filtering by gene as the previous respondent (whose query seemed to filter by gene), there was only one gene on that chromosome. I should have checked but there is > 1 gene there. The perl given assumes only one gene there. I've modified the perl api code and sql looking for only transcripts in that gene

ADD REPLYlink written 8.5 years ago by User 6659970
3
gravatar for Graham Ritchie
8.5 years ago by
Graham Ritchie30 wrote:

Just to add to Pontus' answer, if you are also interested in somatic mutations (which are included in the numbers in the table on the website) you can fetch these with the fetch_all_somatic_by_Transcripts() method on the TranscriptVariationAdaptor. The script above could be modified to include these as follows:

my $tvs = $tv_adaptor->fetch_all_by_Transcripts($gene->get_all_Transcripts());
push @$tvs, @{ $tv_adaptor->fetch_all_somatic_by_Transcripts($gene->get_all_Transcripts()) };

And then loop over $tvs as before.

Cheers,

Graham Ritchie - Ensembl Variation

ADD COMMENTlink written 8.5 years ago by Graham Ritchie30

Many thanks for the tip! Be welcome you too to Biostar! Now I've got the gist of the API. Let's see what can I do . . .

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k
1
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

I thought it would be interesting to find the mysql query for this request.

I'm not a specialist of the SQL schema for ENSEMBL and I would be interested if someone could improve and/or criticize this SQL query.

> mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306

use homo_sapiens_core_61_37f;

select distinct
  XREF.display_label,
  VAR.name
from
   external_db as EXTDB,
   xref as XREF,
   gene as GENE,
   homo_sapiens_variation_61_37f.variation_feature as VARFEAT,
   homo_sapiens_variation_61_37f.variation as VAR
where
   EXTDB.db_name="HGNC" and
   XREF.external_db_id = EXTDB.external_db_id and
   XREF.display_label="MEN1" and
   XREF.xref_id = GENE.display_xref_id and
   GENE.seq_region_id = VARFEAT.seq_region_id and
   VARFEAT.seq_region_start >= GENE.seq_region_start and 
   VARFEAT.seq_region_end <= GENE.seq_region_end and 
   VARFEAT.variation_id = VAR.variation_id
order by VARFEAT.seq_region_start
;

+---------------+-------------+
| display_label | name        |
+---------------+-------------+
| MEN1          | rs117705251 |
| MEN1          | rs34171410  |
| MEN1          | rs71581758  |
| MEN1          | rs1804850   |
| MEN1          | rs71581759  |
| MEN1          | rs71996491  |
| MEN1          | rs1804848   |
| MEN1          | rs1804849   |
| MEN1          | rs111895237 |
| MEN1          | rs71581749  |
(...)
ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum122k
1

mysql> select * from variation_feature vf inner join seq_region sr on sr.seq_reg ion_id = vf.seq_region_id inner join variation v on vf.variation_id = v.variatio n_id left outer join failed_variation fv on fv.variation_id = v.variation_id whe re sr.name = '11' and vf.seq_region_start >=64570988 and vf.seq_region_end <= 64 578766 and fv.variation_id is null;

ADD REPLYlink written 8.5 years ago by User 6659970
1

if you choose to go with the database schema directly and not use the api you are going to lose all of the 'protection' the api gives you such as sorting the failed variation features from the good ones

ADD REPLYlink written 8.5 years ago by User 6659970

Absolutely nice! A SQL approach would be equally suitable! And you're right about ensembl schemas. They're a complexity nightmare. So, let's see how this query behaves...

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k

Hum... I think we hit the same problem with different approaches. When you enter MEN1 in the websearch, it returns 542 variants (mutations+variants in ensembl sense). But, when you browse the human gene there are 9751! Your query returned 621. I'm a bit lost right now. Time to tweak a little ...

ADD REPLYlink written 8.5 years ago by Jarretinha3.3k

That query returns 621 because it includes failed variants. There are 618 without failed variants.

ADD REPLYlink written 8.5 years ago by User 6659970

but in the web search ~600 (!) SNPs have been duplicated. e.g: I counted "rs653534" 19 times.

ADD REPLYlink written 8.5 years ago by Pierre Lindenbaum122k

-1 for inefficient query

ADD REPLYlink written 8.3 years ago by lh331k

-1 for inefficient query. Use Ensembl API please.

ADD REPLYlink written 8.3 years ago by lh331k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1620 users visited in the last hour