Question

Question: From A List Of Gene Symbols To A Bed File With name of the chromosome and Start/end position

0

Entering edit mode

6.3 years ago

11yj3312 ▴ 20

Hi,y'all, I have a list of Gene Symbols,How can i transform Gene Symbols to a .bed file with name of the chromosome and Start/end position

gene genesymbols Bed Position chromosome • 2.4k views

ADD COMMENT • link updated 6.3 years ago by Alex Reynolds 35k • written 6.3 years ago by 11yj3312 ▴ 20

0

Entering edit mode

You should add more information, such as the genome build in which you want co-ordinates (hg19?; hg38?; mm9?; mm10?). Also, are these HGNC gene symbols? Are you only interested in the co-ordinates of the canonical isoform?

You could quite easily just download the GENCODE GTF annotation files from here and then extract the information from these using grep

There is most likely a more automated solution.

ADD REPLY • link 6.3 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2017-12-28

0

Entering edit mode

6.3 years ago

Devon Ryan 104k

A simple method would be to go to Ensembl biomart, select the relevant organism, select what you want (chromosome and start/end position) and then upload the list of gene symbols you have.

ADD COMMENT • link 6.3 years ago by Devon Ryan 104k

score 0 · Answer 2 · 2017-12-28

If you want to do things in a more automated fashion, you could install the Ensembl Perl API and then run a Perl script (like the one posted below) to grab exons.

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

my $host    = 'ensembldb.ensembl.org';
my $user    = 'anonymous';
my $dbname  = 'homo_sapiens_core_89_38';
my $port    = '3306';
my $species = 'homo_sapiens';
my $group   = 'core';
my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host =>   $host,
                                            -user =>   $user,
                                            -dbname => $dbname,
                                            -port =>   $port);

my $slice_adaptor = $db->get_SliceAdaptor();

my $slices = $slice_adaptor->fetch_all('chromosome');
foreach my $slice (@{$slices}) {
    my $chr = "chr".$slice->seq_region_name();
    my $genes = $slice->get_all_Genes();
    foreach my $gene (@{$genes}) {
        my $exons = $gene->get_all_Exons();
        my $id = $gene->external_name();
        my $exon_index = 1;
        my $exon_number = $exon_index;
        my $exon_count = scalar(@{$exons});        
        foreach my $exon (@{$exons}) {
            my $start = $exon->start();
            my $end = $exon->end();
            if ($start < $end) {
                my $stable_id = $exon->stable_id();
                my $strand = $exon->strand();
                if ($strand == 1) { 
                    $strand = "+";
                    $exon_number = $exon_index;
                } 
                elsif ($strand == -1) { 
                    $strand = "-";
                    $exon_number = $exon_count - $exon_index + 1;
                } 
                else { 
                    die "unknown value for strand\n"; 
                }
                print STDOUT join("\t", ($chr, $start, $end, $id, $exon_number, $strand))."\n";
                $exon_index++;
            }
        }
    }
}

Be sure to change the dbname and species variables depending on your needs.

Once you have exons with a Ensembl names, you can use a Python script like the following to make a translation table to map Ensembl names to HGNC symbol names.

#!/usr/bin/env python

import sys
from mygene import MyGeneInfo

hgnc_names = []
for line in sys.stdin:
    hgnc_names.append('%s' % (line.strip()))

mg = MyGeneInfo()
results = mg.querymany(hgnc_names, scopes='symbol', species='human', verbose=False)

for result in results:
    sys.stdout.write("%s\t%s\n" % (result['symbol'], result['name']))

From here, if you're working with HGNC names, you can process the Perl script output to include HGNC symbols for all exons, and then use grep to find matches for your genes of interest.