Question: How To Merge Isoforms For A Gene
gravatar for camelbbs
7.7 years ago by
camelbbs680 wrote:

Hi All,

I want to ask a tech question. Does anybody have scripts or software that can merge multiple isoforms of a gene into a reference transcript.

Ideally, that script will merge the overlapped exons between isoforms. And give a reference transcript that include the longest exons.



rnaseq • 9.2k views
ADD COMMENTlink modified 6.1 years ago by Tariq Daouda210 • written 7.7 years ago by camelbbs680

Can you provide a snippet of input, or what you expect to feed into a script? Just a few lines would help.

ADD REPLYlink written 7.7 years ago by Alex Reynolds31k

I just want to handle the refflat.gtf file: chrX hg19_refFlat exon 120073834 120073989 0.000000 - . gene_id "CT47A3"; transcript_id "CT47A3_dup2";

ADD REPLYlink written 7.7 years ago by camelbbs680

Hi camelbbs,Were you able to solve this problem?I also want to have overlapping exons merged for a gene from a gtf file.

ADD REPLYlink written 6.8 years ago by Ron1.0k

Hi Camelbbs. I'm adding this comment to all your questions: Please take some time, before you ask a question, to think more about your problems and most likely sources of answers (manuals, FAQs, Google!, etc.). When you ask a question, include some context, tell us why you ask that question, what result you need, etc. Most of your questions are vague, impossible to answer or you changed them following an answer because it became evident that it was not clear. Cheers.

ADD REPLYlink written 7.7 years ago by Eric Normandeau10k

Are you familiar with this area or I don't know why you say this.

ADD REPLYlink written 7.7 years ago by camelbbs680
gravatar for Alex Reynolds
7.7 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Perhaps the following might get you started.

For exploration purposes, I exported the refFlat table from the UCSC Genome Browser as a GTF file and saved it somewhere I can find it:

$ wget

I then extracted exons and converted the result to a BED file with BEDOPS gtf2bed‡, and passed it to an awk script that uses an associative array (hash table) to store results based on gene name:

$ gzcat refFlat.hg19.gtf.gz \
    | grep exon \
    | gtf2bed - \
    | awk '{ \
        name[$4]++; \
        if (name[$4] == 1) { \
            chr[$4] = $1; \
            start[$4] = $2; \
            stop[$4] = $3; \
            remainder[$4] = substr($0, index($0, $5)); \
        } \
        else { \
            stop[$4] = $3; \
        } \
    } \
    END { \
        for (id in name) { \
            printf("%s\t%s\t%s\t%s\t%s\n"), chr[id], start[id], stop[id], id, remainder[id]; \
        } \
    }' - \
    > mergedRefFlatExons.hg19.bed

At the first instance of an exon for a gene name, we assign values to elements of associative arrays for the gene name. Where we find two or more exons, the else condition of the if-else block changes the stop position for that gene to the stop position of the last current exon.

If this works for you, then perhaps you can extend it to meet the other condition (reference transcript with the longest exon) by keeping track of the longest current exon, which you might mark in the END block (perhaps with a custom GTF attribute printed at the end of the line).

‡ : Conversion to BED is not necessary. I did this because I am more familiar with handling BED data than GTF. If you are more familiar with GTF files and in which order attributes are stored, you can change the field assignments to elements of each associate array accordingly.

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Alex Reynolds31k

Just a note here, this script doesn't work for the scenario that the same gene name locates on different chromosomes (eg ChrX and ChrY).

ADD REPLYlink written 12 weeks ago by Yu110
gravatar for Abhi
7.7 years ago by
United States
Abhi1.5k wrote:

Bedtools is good tool for playing with genomic features. It is fast and efficient. In your case bedtools merge should work. Look for example in the manual they are pretty helpful in getting a visual picture.

hth, -Abhi

ADD COMMENTlink written 7.7 years ago by Abhi1.5k

Thanks, but i think bedtools merge can merge all the overlapped exons, ignoring the gene. I want to merge the isoforms belong to a same gene, not from different genes.

ADD REPLYlink written 7.7 years ago by camelbbs680
gravatar for Malcolm.Cook
7.7 years ago by
kansas, usa
Malcolm.Cook1.1k wrote:

R's BioConductor gives you:

  • a object oriented re-packaging of ucsc gene model (named after entrez ids)
  • the function reduce to 'merge' the exons of the isoforms into a merged 'maximal' gene
  • a tool to map between entrez ids and gene names (providing names for the merged genes)
  • the rtracklayer package to write it back out in a gff3 file

run R and, first time only, install needed packages:

# Say 'yes' to all questions.  You may need a few more packages

Now, load the gene models, 'merge' the exons within each gene, give them a name, and write a gff of the results

exonsByGene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,'gene')
mergedGene <- reduce(exonsByGene) 
mergedGene <- 
  ## Remove a few hundred 'tricky cases' like:
  ##  * trans-spliced genes (?? are there any in hg19.knownGene ??)
  ##  * genes on multiple chromosomes (eg: alt. haplotypes "chr_ctg9_hap1" )
  ##    Arguably(?) these should not be considered the 'same gene'.
  mergedGene[1==elementLengths(runValue(strand(mergedGene))) &
`]]1`<-function(i,o)o[[i]][[1]] # utility
names(mergedGene)<- # assign gene symbol for export to GFF
     sapply(names(mergedGene),`]]1`,org.Hs.egSYMBOL) # TODO: faster?


  • 3' and 5' UTR features are lost
  • the 'tricky cases'
ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Malcolm.Cook1.1k

I have merged all overlapping CDSs from each gene into consensus CSDs (I only care about the protein coding parts). Then I duplicated all the consensus CDSs as exons as well, except for the first and last one, which I have extended to the start and end of the transcript to be able to specify the UTR coordinates. Then I designated the first 3 bases of the first merged CDS as start codon and 3 bases after the last CDS as stop codon. And the frame for each consensus CDS is the frame of the most upstream original CDS that overlaps it. The merge works, but I wonder if there is a solution for these particular cases:

1- reads that should map to junctions of original exons that are no longer distinguishable as junctions in the consensus (merged) exons (imagine alternative donor and acceptor sites).

2- gene with start or stop codons split between two exons (e.g. 1 base of the start codon at the end of the first exon and the next two bases at the beginning of 2nd exon. In this case, the single first base of the start codon is the first CDS by itself, so assigning that and the next two bases on the genomic coordinates as start codon will be wrong). In the Ensembl .gtf, there are ~60 genes like that.

3- dual frame genes: genes read in two different frames in overlapping regions. Two papers using different methods (one bioinformatically through extended ORFs, the other from ribo-profiling data) estimate ~1.25% of genes to show this behavior.


ADD REPLYlink written 3.2 years ago by hossein.asgharian0
gravatar for cdsouthan
7.7 years ago by
cdsouthan1.8k wrote:

This is what Swiss-Prot have been doing for years...... The canonical ORF is the longest and max-exons by default. Literature and TrEMBL supported alternative splices are annotated in the feature lines

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by cdsouthan1.8k
gravatar for Puriney
7.3 years ago by
New York City
Puriney330 wrote:

Probably you could take a look at Though better move is to get a BED12 finally rather than BED6.

<script src="&lt;a href=" Puriney="" 6655074"="">Puriney/6655074"></script>

ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Puriney330
gravatar for Tariq Daouda
6.1 years ago by
Tariq Daouda210
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda210 wrote:


I'm the developer of pyGeno. Here's a little script that does just that for the Gene TPST2, by using segment trees:

from pyGeno.Genome import *
from import *

ref = Genome(name = "GRCh37.75")
gene = ref.get(Gene, name = "TPST2")[0]
seg = SegmentTree()

for trans in gene.get(Transcript) :
   for exon in trans.exons :
     #add the current exon position to the tree
     seg.insert(exon.start, exon.end)

#merge all exon positions

consensusSequence = ""
for exon in seg.children :
  consensusSequence += gene.chromosome.sequence[exon.x1:exon.x2]

print consensusSequence



ADD COMMENTlink written 6.1 years ago by Tariq Daouda210
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2023 users visited in the last hour