Hi,y'all, I have a list of Gene Symbols,How can i transform Gene Symbols to a .bed file with name of the chromosome and Start/end position
Hi,y'all, I have a list of Gene Symbols,How can i transform Gene Symbols to a .bed file with name of the chromosome and Start/end position
A simple method would be to go to Ensembl biomart, select the relevant organism, select what you want (chromosome and start/end position) and then upload the list of gene symbols you have.
If you want to do things in a more automated fashion, you could install the Ensembl Perl API and then run a Perl script (like the one posted below) to grab exons.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
my $host = 'ensembldb.ensembl.org';
my $user = 'anonymous';
my $dbname = 'homo_sapiens_core_89_38';
my $port = '3306';
my $species = 'homo_sapiens';
my $group = 'core';
my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host,
-user => $user,
-dbname => $dbname,
-port => $port);
my $slice_adaptor = $db->get_SliceAdaptor();
my $slices = $slice_adaptor->fetch_all('chromosome');
foreach my $slice (@{$slices}) {
my $chr = "chr".$slice->seq_region_name();
my $genes = $slice->get_all_Genes();
foreach my $gene (@{$genes}) {
my $exons = $gene->get_all_Exons();
my $id = $gene->external_name();
my $exon_index = 1;
my $exon_number = $exon_index;
my $exon_count = scalar(@{$exons});
foreach my $exon (@{$exons}) {
my $start = $exon->start();
my $end = $exon->end();
if ($start < $end) {
my $stable_id = $exon->stable_id();
my $strand = $exon->strand();
if ($strand == 1) {
$strand = "+";
$exon_number = $exon_index;
}
elsif ($strand == -1) {
$strand = "-";
$exon_number = $exon_count - $exon_index + 1;
}
else {
die "unknown value for strand\n";
}
print STDOUT join("\t", ($chr, $start, $end, $id, $exon_number, $strand))."\n";
$exon_index++;
}
}
}
}
Be sure to change the dbname
and species
variables depending on your needs.
Once you have exons with a Ensembl names, you can use a Python script like the following to make a translation table to map Ensembl names to HGNC symbol names.
#!/usr/bin/env python
import sys
from mygene import MyGeneInfo
hgnc_names = []
for line in sys.stdin:
hgnc_names.append('%s' % (line.strip()))
mg = MyGeneInfo()
results = mg.querymany(hgnc_names, scopes='symbol', species='human', verbose=False)
for result in results:
sys.stdout.write("%s\t%s\n" % (result['symbol'], result['name']))
From here, if you're working with HGNC names, you can process the Perl script output to include HGNC symbols for all exons, and then use grep
to find matches for your genes of interest.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You should add more information, such as the genome build in which you want co-ordinates (hg19?; hg38?; mm9?; mm10?). Also, are these HGNC gene symbols? Are you only interested in the co-ordinates of the canonical isoform?
You could quite easily just download the GENCODE GTF annotation files from here and then extract the information from these using
grep
There is most likely a more automated solution.