How to extract common exons of multiple transcripts of the same gene
2
0
Entering edit mode
9.3 years ago
simonjean434 ▴ 70

Hello,

I would like to extract common exons of multiple transcripts of the same gene from gencode.v19.annotation.gtf. Is there any easy way to do that?

Thanks in advance for your help

Jean

RNA-Seq • 3.2k views
ADD COMMENT
1
Entering edit mode
9.3 years ago

Perhaps make two BED files of genes and exons, and then map exons to genes:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \
   | gunzip --stdout - \
   | awk '$3=="gene"' - \
   | convert2bed -i gff - \
   > genes.bed

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gff3.gz \
   | gunzip --stdout - \
   | awk '$3=="exon"' - \
   | convert2bed -i gff - \
   > exons.bed

$ bedmap --echo --echo-map genes.bed exons.bed > genes_with_mapped_exons.bed

Once you have this list of exon IDs that correspond to each gene, you could perhaps use a Perl or Python script to process each gene-line, parse out each exon's coordinates to a key-value pair in a hash table, and find the common exons based on a duplicate key. Print those exons or exon IDs out to standard output, along with the gene ID, for instance.

ADD COMMENT
1
Entering edit mode
9.3 years ago
Tariq Daouda ▴ 220

Hi,

I'm the developer of pyGeno. pyGeno is designed to simplify this kind of operations and allow you to go further if needed. Here's an example that works for the Gene TPST2. Of course it would work for any other gene, and you could also loop through all the genes of the genome if you would like to:

from pyGeno.Genome import *

ref = Genome(name = "GRCh37.75")
gene = ref.get(Gene, name = "TPST2")[0]
#you can also make a query using the Ensembl Id 
#gene = ref.get(Gene, id = "ENSG00000128294")[0]
exons = {}

for trans in gene.get(Transcript) :
   for exon in trans.exons :
     if exon.id not in exons :
        exons[exon.id] = exon
print "Exons in common"

for e in exons :
  print e
  print 'whole sequence', e.sequence
  print 'coding sequence', e.CDS

pyGeno uses Ensembl annotations.

Hope that helps.

Cheers

ADD COMMENT
0
Entering edit mode

Hi,

I tried importing the pyGeno module. I am getting the following error:

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    from pyGeno.Genome import *
  File "C:\Python27\lib\site-packages\pyGeno\__init__.py", line 3, in <module>
    from configuration import pyGeno_init
ImportError: No module named 'configuration'

Please help me. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1837 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6