How To Merge Isoforms For A Gene
6
5
Entering edit mode
11.0 years ago
camelbbs ▴ 710

Hi All,

I want to ask a tech question. Does anybody have scripts or software that can merge multiple isoforms of a gene into a reference transcript.

Ideally, that script will merge the overlapped exons between isoforms. And give a reference transcript that include the longest exons.

Thanks,

Che

rnaseq • 13k views
ADD COMMENT
0
Entering edit mode

Can you provide a snippet of input, or what you expect to feed into a script? Just a few lines would help.

ADD REPLY
0
Entering edit mode

I just want to handle the refflat.gtf file: chrX hg19_refFlat exon 120073834 120073989 0.000000 - . gene_id "CT47A3"; transcript_id "CT47A3_dup2";

ADD REPLY
0
Entering edit mode

Hi camelbbs,Were you able to solve this problem?I also want to have overlapping exons merged for a gene from a gtf file.

ADD REPLY
0
Entering edit mode

Hi Camelbbs. I'm adding this comment to all your questions: Please take some time, before you ask a question, to think more about your problems and most likely sources of answers (manuals, FAQs, Google!, etc.). When you ask a question, include some context, tell us why you ask that question, what result you need, etc. Most of your questions are vague, impossible to answer or you changed them following an answer because it became evident that it was not clear. Cheers.

ADD REPLY
0
Entering edit mode

Are you familiar with this area or I don't know why you say this.

ADD REPLY
5
Entering edit mode
11.0 years ago

Perhaps the following might get you started.

For exploration purposes, I exported the refFlat table from the UCSC Genome Browser as a GTF file and saved it somewhere I can find it:

$ wget https://dl.dropboxusercontent.com/u/31495717/refFlat.hg19.gtf.gz

I then extracted exons and converted the result to a BED file with BEDOPS gtf2bed‡, and passed it to an awk script that uses an associative array (hash table) to store results based on gene name:

$ gzcat refFlat.hg19.gtf.gz \
    | grep exon \
    | gtf2bed - \
    | awk '{ \
        name[$4]++; \
        if (name[$4] == 1) { \
            chr[$4] = $1; \
            start[$4] = $2; \
            stop[$4] = $3; \
            remainder[$4] = substr($0, index($0, $5)); \
        } \
        else { \
            stop[$4] = $3; \
        } \
    } \
    END { \
        for (id in name) { \
            printf("%s\t%s\t%s\t%s\t%s\n"), chr[id], start[id], stop[id], id, remainder[id]; \
        } \
    }' - \
    > mergedRefFlatExons.hg19.bed

At the first instance of an exon for a gene name, we assign values to elements of associative arrays for the gene name. Where we find two or more exons, the else condition of the if-else block changes the stop position for that gene to the stop position of the last current exon.

If this works for you, then perhaps you can extend it to meet the other condition (reference transcript with the longest exon) by keeping track of the longest current exon, which you might mark in the END block (perhaps with a custom GTF attribute printed at the end of the line).

‡ : Conversion to BED is not necessary. I did this because I am more familiar with handling BED data than GTF. If you are more familiar with GTF files and in which order attributes are stored, you can change the field assignments to elements of each associate array accordingly.

ADD COMMENT
0
Entering edit mode

Just a note here, this script doesn't work for the scenario that the same gene name locates on different chromosomes (eg ChrX and ChrY).

ADD REPLY
2
Entering edit mode
11.0 years ago
Abhi ★ 1.6k

Bedtools is good tool for playing with genomic features. It is fast and efficient. In your case bedtools merge should work. Look for example in the manual they are pretty helpful in getting a visual picture.

hth, -Abhi

ADD COMMENT
3
Entering edit mode

Thanks, but i think bedtools merge can merge all the overlapped exons, ignoring the gene. I want to merge the isoforms belong to a same gene, not from different genes.

ADD REPLY
1
Entering edit mode
11.0 years ago
Malcolm.Cook ★ 1.5k

R's BioConductor gives you:

  • a object oriented re-packaging of ucsc gene model (named after entrez ids)
  • the function reduce to 'merge' the exons of the isoforms into a merged 'maximal' gene
  • a tool to map between entrez ids and gene names (providing names for the merged genes)
  • the rtracklayer package to write it back out in a gff3 file

run R and, first time only, install needed packages:

source("http://bioconductor.org/biocLite.R") 
biocLite(c("TxDb.Hsapiens.UCSC.hg19.knownGene","org.Hs.eg.db","rtracklayer","GenomicFeatures"))  
# Say 'yes' to all questions.  You may need a few more packages

Now, load the gene models, 'merge' the exons within each gene, give them a name, and write a gff of the results

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(rtracklayer)
library(org.Hs.eg.db)
exonsByGene <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,'gene')
mergedGene <- reduce(exonsByGene) 
mergedGene <- 
  ## Remove a few hundred 'tricky cases' like:
  ##  * trans-spliced genes (?? are there any in hg19.knownGene ??)
  ##  * genes on multiple chromosomes (eg: alt. haplotypes "chr_ctg9_hap1" )
  ##    Arguably(?) these should not be considered the 'same gene'.
  mergedGene[1==elementLengths(runValue(strand(mergedGene))) &
                     1==elementLengths(runValue(seqnames(mergedGene)))]  
`]]1`<-function(i,o)o[[i]][[1]] # utility
names(mergedGene)<- # assign gene symbol for export to GFF
     sapply(names(mergedGene),`]]1`,org.Hs.egSYMBOL) # TODO: faster?
export(mergedGene,'Hsapiens.UCSC.hg19.knownGene.merged.gff3')

caveats:

  • 3' and 5' UTR features are lost
  • the 'tricky cases'
ADD COMMENT
0
Entering edit mode

I have merged all overlapping CDSs from each gene into consensus CSDs (I only care about the protein coding parts). Then I duplicated all the consensus CDSs as exons as well, except for the first and last one, which I have extended to the start and end of the transcript to be able to specify the UTR coordinates. Then I designated the first 3 bases of the first merged CDS as start codon and 3 bases after the last CDS as stop codon. And the frame for each consensus CDS is the frame of the most upstream original CDS that overlaps it. The merge works, but I wonder if there is a solution for these particular cases:

1- reads that should map to junctions of original exons that are no longer distinguishable as junctions in the consensus (merged) exons (imagine alternative donor and acceptor sites).

2- gene with start or stop codons split between two exons (e.g. 1 base of the start codon at the end of the first exon and the next two bases at the beginning of 2nd exon. In this case, the single first base of the start codon is the first CDS by itself, so assigning that and the next two bases on the genomic coordinates as start codon will be wrong). In the Ensembl .gtf, there are ~60 genes like that.

3- dual frame genes: genes read in two different frames in overlapping regions. Two papers using different methods (one bioinformatically through extended ORFs, the other from ribo-profiling data) estimate ~1.25% of genes to show this behavior.

Thanks,

ADD REPLY
1
Entering edit mode
11.0 years ago
cdsouthan ★ 1.9k

This is what Swiss-Prot have been doing for years...... The canonical ORF is the longest and max-exons by default. Literature and TrEMBL supported alternative splices are annotated in the feature lines

ADD COMMENT
0
Entering edit mode
10.6 years ago
Puriney ▴ 330

Probably you could take a look at this gist. Though better move is to get a BED12 finally rather than BED6.

ADD COMMENT
0
Entering edit mode
9.3 years ago
Tariq Daouda ▴ 220

Hi,

I'm the developer of pyGeno. Here's a little script that does just that for the Gene TPST2, by using segment trees:

from pyGeno.Genome import *
from pyGeno.tools.SegmentTree import *

ref = Genome(name = "GRCh37.75")
gene = ref.get(Gene, name = "TPST2")[0]
seg = SegmentTree()

for trans in gene.get(Transcript) :
   for exon in trans.exons :
     #add the current exon position to the tree
     seg.insert(exon.start, exon.end)

#merge all exon positions
seg.flatten()

consensusSequence = ""
for exon in seg.children :
  consensusSequence += gene.chromosome.sequence[exon.x1:exon.x2]

print consensusSequence

Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 2763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6