Question: Pre-computed list of genes in LD with a SNP with r2 from 1000Genomes / hg19
0
gravatar for Khader Shameer
5.0 years ago by
Manhattan, NY
Khader Shameer18k wrote:

Is there any database out there that have pre-computed LD data for hg19 using 1000Genomes. I need to map a given SNP to the genes that are in LD with it. I am ideally looking for something like scan-db, they provide nice data (see below) but the data is a bit outdated. 

Thanks, 

Shameer 

snp genomics ld • 4.1k views
ADD COMMENTlink modified 5.0 years ago by Emily_Ensembl18k • written 5.0 years ago by Khader Shameer18k
3
gravatar for Emily_Ensembl
5.0 years ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

We have LD for 1000 genomes pops in Ensembl. You can access it through the browser, but it is also possible to get the data via our Perl API.

ADD COMMENTlink written 5.0 years ago by Emily_Ensembl18k

Thanks Emily, can you please point me to an example that show the access to pre-computed data via Ensembl Perl API ? 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

There's some sample code that you can use here.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k

Great Emily, 

I was able to run the example and get the the r2 information for the SNPs. See below: 
High LD between variations 1.000000    rs1185568-rs9393672
High LD between variations 0.913043    rs1184803-rs1185978
High LD between variations 0.955420    rs1184803-rs942379
High LD between variations 1.000000    rs6905614-rs1408273
High LD between variations 0.958109    rs1408273-rs942379
High LD between variations 1.000000    rs1165182-rs6905614
High LD between variations 1.000000    rs1165189-rs1165187
High LD between variations 1.000000    rs1185568-rs1408273

I have few questions: 

- Is this LD info computed on the fly or do you have a backend database with all pair-wise LD between all SNPs ? If you have a database, is it possible to access it ? 

- My input is list of SNPs from GWAS catalog (all rsIDs liftover to hg19) and I need to find all other SNPs and their bp/position for downstream analytics. From the code, it needs a start and end range within a chromosome to find the LD. I don't have this handy - is it possible to get LD information without this ? 

- Is there a way to map the genes mapped to the rsIDs as well ? Which modules should I include to map SNPs to HGNC symbols via Ensembl API ? 

Thoughts ? 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

1. We calculate it on the fly.

2. You can go from a variant to other variants in LD with it. You would need to:

3. You can map to genes by getting the TranscriptVariation from the VariationFeature. From there, you can get to an Ensembl Transcript. This will then get you into our Core API, where you can get the Gene, and get its external name.

I can help you out with any of this.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k

Thanks Emily, 

I am looking for a way to integrate all of that modules you have mentioned to create a file that takes a list of SNP

Example here and generate an output in the following format 

QuerySNP    LDSNP    Gene-HGNC-Symbol    r2
ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Khader Shameer18k

You'll need to write a Perl script that reads through the list, one-by-one, then feeds the IDs in to all those steps I mentioned.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k

Sure, that's what I was trying to do since last evening. With limited documentation and some of the methods are defined as under development, am a bit disappointed with the progress. Will you be able to provide an example that link those modules ? Appreciate your help! 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

I've written the following script for you that will go through the lines one by one. I tested it on just a single variant and it gave thousands of lines of data in response so you might want to add in some filtering. For example, at the moment it's giving all the 1000 genomes populations - you might want to filter down to your favourites. It's also giving all linked variants with r2 values - you could choose an r2 cut-off instead.

 

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::EnsEMBL::Registry;

open (IN, "<SNP_list_sorted.txt");
open (OUT, ">SNPs_in_LD.tsv");

my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org',-user => 'anonymous');
my $va = $reg->get_adaptor( 'human', 'variation', 'variation' );

while (<IN>) {

    my $v = $va->fetch_by_name($_);
    my @vfs = @{ $v->get_all_VariationFeatures };
    
    foreach my $vf (@vfs){        
        my $ld = $vf->get_all_LD_values;
        my @pops = @{ $vf->get_all_LD_Populations };
        my @ldvs = @{ $ld->get_variations };
        
        foreach my $pop (@pops) {
            
            if ($pop->name =~ /1000GENOMES/) {
        
                foreach my $ldv (@ldvs) {            
                    if ($ldv->stable_id ne $_) {
                        my @ldvfs = @{ $ldv->get_all_VariationFeatures };
                
                        foreach my $ldvf (@ldvfs) {
                            my @tvs = @{ $ldvf->get_all_TranscriptVariations };
                            my $r2 = $ld->get_r_square($vf, $ldvf, $pop);                                        
                    
                            foreach my $tv (@tvs) {
                                my $gene = $tv->transcript->get_Gene;
                            
                                if ($r2) {            
                                    print OUT $v->stable_id, "\t", $ldv->stable_id, "\t", $gene->external_name, "\t", $r2, "\t", $pop->name, "\n";
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }    
ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k
1

Hi Emily, Thanks for the example code. I have tried running it and am having difficulty when fetching by name. I don't know perl, so working with the API has been painful.

I believe you previously answered my question on a related blog post. I have a few remaining questions regarding your script?

  • How can I specify using only 1000 Genomes phase 3 data?
  • How can I specify specific populations in the query?
  • How do I set the minimum r2 cutoff and an basepair distance restraint?

I am looking forward to any assistance you may be able to provide. Just FYI, I am using an open science platform for my project that pays people who provide assistance. If you help and leave a comment on the relevant discussion you can get rewarded for your efforts.

ADD REPLYlink written 4.2 years ago by Daniel Himmelstein10

I just came across the Ensembl REST API. We would much prefer to use this than the perl API. Does it support retrieving SNPs in LD?

ADD REPLYlink written 4.2 years ago by Daniel Himmelstein10

There is no LD endpoint on the REST API.

ADD REPLYlink written 4.2 years ago by Emily_Ensembl18k
  • The latest Ensembl works with Phase 3. Update to the latest API to access the latest data.
  • Specify population names by editing the following line to give specific names. At the moment it's just getting population names that include the phrase 1000GENOMES, but you can put in the full population names 
    if ($pop->name =~ /1000GENOMES/) {
  • Include a cutoff by editing the following line to include a > or <
    if ($r2) { 
  • For a distance cutoff you'll need another if loop. There's no distance value for LD pairs, so you'll need to get the location of each of the variation features and do a subtraction.

ADD REPLYlink written 4.2 years ago by Emily_Ensembl18k

Great, thanks a lot Emily ! 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

Tried the script, am getting an error. Do I have to use any other modules or pass additional parameters ? 

Can't call method "get_all_VariationFeatures" on an undefined value at ensembl_ld_script.pl line 16, <IN> line 1.
 

Ran diagnostics and getting the following: 

 

shameer@foo:~/compute/ensembl_ld_data$ perl ensembl_ld_script.pl
Can't call method "get_all_VariationFeatures" on an undefined value at
        ensembl_ld_script.pl line 16, <IN> line 1 (#1)
    (F) You used the syntax of a method call, but the slot filled by the
    object reference or package name contains an undefined value.  Something
    like this will reproduce the error:
    
        $BADREF = undef;
        process $BADREF 1,2,3;
        $BADREF->process(1,2,3);
    
Uncaught exception from user code:
        Can't call method "get_all_VariationFeatures" on an undefined value at ensembl_ld_script.pl line 16, <IN> line 1.

 

Thanks again !

 

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Khader Shameer18k

Looks like it's not getting a value for $v, so isn't able to call get_all_VariationFeatures on it. The script is assuming that every line of your input file is just rsID\n. If there's anything else on the line then you need to either edit the script to only read the ID, or edit your input file.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k

Thanks for your help with code-review Emily, I was able to fix it. Will post my working version of the code here. 

My current concern is that the program is incredibly slow (100 SNPs in 48 hours; only CEU, r2>=0.5). Is that because of accessing the data directly from Ensembl ? Other thoughts to improve the speed ? 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

I'm not surprised it's slow. It's calculating all of these LDs on the fly and remember that when you're filtering by r2 you're still calculating r2 for all those that don't pass your filter - you're just not printing them.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k

True that! 

Now the run is in progress but getting the following error for random SNPs:  Is this a known issue ? Any suggestions to solve this ? 

MSG: Detected an error whilst executing SQL 'SELECT  t.transcript_id, t.seq_region_id, t.seq_region_start, t.seq_region_end, t.seq_region_strand, t.analysis_id, t.gene_id, t.is_current, t.stable_id, t.version, UNIX_TIMESTAMP(created_date), UNIX_TIMESTAMP(modified_date), t.description, t.biotype, t.status, exdb.db_name, exdb.status, exdb.db_display_name, x.xref_id, x.display_label, x.dbprimary_acc, x.version, x.description, x.info_type, x.info_text, exdb.db_release, t.source
FROM (( (transcript t)
  LEFT JOIN xref x ON x.xref_id = t.display_xref_id )
  LEFT JOIN external_db exdb ON exdb.external_db_id = x.external_db_id )
 WHERE t.stable_id = ? AND t.is_current = 1
': DBD::mysql::st execute failed: Lost connection to MySQL server during query at /data1/software/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm line 482.

STACK Bio::EnsEMBL::DBSQL::BaseAdaptor::generic_fetch /data1/software/src/ensembl/modules/Bio/EnsEMBL/DBSQL/BaseAdaptor.pm:483
STACK Bio::EnsEMBL::DBSQL::TranscriptAdaptor::fetch_by_stable_id /data1/software/src/ensembl/modules/Bio/EnsEMBL/DBSQL/TranscriptAdaptor.pm:169
STACK Bio::EnsEMBL::Variation::BaseVariationFeatureOverlap::feature /data1/software/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/BaseVariationFeatureOverlap.pm:159
STACK Bio::EnsEMBL::Variation::BaseTranscriptVariation::transcript /data1/software/src/ensembl-variation/modules/Bio/EnsEMBL/Variation/BaseTranscriptVariation.pm:84
STACK toplevel ensembl_ld_script.pl:38
Date (localtime)    = Wed Sep  3 08:56:06 2014
Ensembl API version = 75
ADD REPLYlink written 5.0 years ago by Khader Shameer18k

Looks like you're losing connection to the database. We're not aware of any reason why this would be happening - could be a connection issue at your end.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k

Check on it. 

Alternatively, wondering I can do the same query using a local installation of ensembl-human + variation databases ? Any suggestions ? 

 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

You can. You'll need to install the MySQL core and variation databases for human on your system. There's more info here.

ADD REPLYlink written 5.0 years ago by Emily_Ensembl18k
1
gravatar for Katie D'Aco
5.0 years ago by
Katie D'Aco1000
Massachusetts
Katie D'Aco1000 wrote:

If you're looking at human genomes, have you tried HapMap? They have precomputed LD tables (about 5 years old). http://hapmap.ncbi.nlm.nih.gov/downloads/ld_data/?N=D

ADD COMMENTlink written 5.0 years ago by Katie D'Aco1000

Thanks Katie, looking specifically for data from 1000Genomes. 

ADD REPLYlink written 5.0 years ago by Khader Shameer18k

Ah. I missed that part of your question

ADD REPLYlink written 5.0 years ago by Katie D'Aco1000

That's alright - I added that part after you posted your answer. 

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Khader Shameer18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 665 users visited in the last hour