Question

How To Remove Overlapping And Common Exons From Multiple Transcripts Of A Gene

2

Entering edit mode

12.4 years ago

Simran ▴ 40

Hi,

I have this data from where I have to remove those transcripts of a gene which have overlapping or common exons. Third column of my file has the start coordinates of all exons of a transcript and fourth column has exon end coordinates. In most of the cases there are multiple exons for each transcript separated by a semi-colon. For example:-

ENSG00000004399    ENST00000512744    129275460;129277271;129275926;129274061  129275534;129277364;129276066;129275271
ENSG00000004399    ENST00000393239    129275926;129274018;129277271;129275460     129276066;129275271;129277364;129275534
ENSG00000004399    ENST00000505665    129302968;129302474    129303067;129302512
ENSG00000001167    ENST00000353205    41065150    41065689
ENSG00000001167    ENST00000341376    41065150    41067715

The output after removing redundancy will be only those transcripts that have longest,non redundant exons.

ENSG00000004399    ENST00000393239    129275926;129274018;129277271;129275460     129276066;129275271;129277364;129275534
ENSG00000004399    ENST00000505665    129302968;129302474     129303067;129302512
ENSG00000001167    ENST00000341376    41065150    41067715

Can anyone suggest how to get this result? Thanks in advance.

exon transcript • 8.7k views

ADD COMMENT • link updated 7.8 years ago by lmanchon • 0 • written 12.4 years ago by Simran ▴ 40

1

Entering edit mode

Given that you already have an example, is this homework? What did you try so far?

ADD REPLY • link 12.4 years ago by Michael Kuhn 5.0k

0

Entering edit mode

What exactly are you trying to accomplish? Merging isoforms into a gene structure?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

Micheal, I have removed those transcripts with common and overlapping exons which had single exons. Now trying to deal with this list where there are multiple exons but then thought of asking some help here. This is not homework anyways..

ADD REPLY • link 12.4 years ago by Simran ▴ 40

0

Entering edit mode

Hi DK, I am trying to remove those transcripts from my list where the exons are already present in another transcript. E.g removing entries like ENST00000512744 and ENST00000353205 as the exons in these transcripts are already covered by other transcripts from the above list. Thus removing redundancy from my data.

ADD REPLY • link 12.4 years ago by Simran ▴ 40

0

Entering edit mode

In the example you posted, you removed the first transcript (512744) and kept the second transcript (393239). The first exon of the first transcript is: 129275460 - 129275534. The first exon of the second transcript is: 129275926 - 129276066. Why did you decide to remove the first one in that case? The first transcript exon is has an extra ~500 bases upstream of the second transcript exon.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

In the example you posted, you removed the first transcript (512744) and kept the second transcript (393239). The first exon of the first transcript is: 129275460 - 129275534. The first exon of the second transcript is: 129275926 - 129276066. Why did you decide to remove the first one in that case? The first transcript exon has an extra ~500 bases upstream of the second transcript exon.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

The exons are not sorted in ascending order. As you see, 129275460 is also present in second transcript (393239). I selected this one because the second exon 129274018 in this transcript ends at 129275271, whereas in first one (512744), it starts at 129274061 and ends at same position 129275271.

ADD REPLY • link 12.4 years ago by Simran ▴ 40

0

Entering edit mode

Ahh, I see. My mistake. So you just want to remove transcripts that are completely within another transcript.

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

yes,I think I shd have explained my question a bit better:)

ADD REPLY • link 12.4 years ago by Simran ▴ 40

0

Entering edit mode

Is there data on what gene and reference contig the transcript belongs to? What are the first and second columns of your data?

ADD REPLY • link 12.4 years ago by Damian Kao 16k

0

Entering edit mode

First column is the gene id and second is transcript id. The transcript ids belong to their respective geneids. In the above list, first three transcripts belong to same geneid.

ADD REPLY • link 12.4 years ago by Simran ▴ 40

score 4 · Answer 1 · 2011-11-25

I had a script I wrote a while back that found consensus sequences in a group of annotations from a gtf file. I modified it to work with your data. This script should remove transcripts that are completely within another transcript. I would spot check the data to see if this script worked:

import sys, operator

def overlap(coordA, coordB):
    overlapLength = -1
    if isWithin(coordA[0], coordB) or isWithin(coordA[1], coordB) or isWithin(coordB[0], coordA) or isWithin(coordB[1], coordA):
        overlapLength = min(coordA[1], coordB[1]) - max(coordA[0], coordB[0]) + 1

    return overlapLength

def isWithin(query, coords):
    start = coords[0]
    end = coords[1]
    if query >= start and query <= end:
        return True
    else:
        return False

def transcriptWithin(coordsA,coordsB):
    for coordA in coordsA:
        for coordB in coordsB:
            within = False
            ABOverlap = overlap(coordA,coordB)
            lenExonA = coordA[1] - coordA[0] + 1
            if ABOverlap == lenExonA:
                within = True
                break

        if not within:
            return False

    return True

genes = {}
inFile = open(sys.argv[1],'r')
for line in inFile:
    data = line.strip().split('\t')
    gid = data[0]
    tid = data[1]
    starts = data[2].split(';')
    ends = data[3].split(';')
    exonCoords = []
    for i in range(len(starts)):
        exonCoords.append((int(starts[i]),int(ends[i])))

    if not genes.has_key(gid):
        genes[gid] = []

    genes[gid].append((tid,exonCoords))

remove = {}
for gid, transcripts in genes.items():
    for tA in transcripts:
        for tB in transcripts:
            if tA[0] != tB[0]:
                if transcriptWithin(tA[1],tB[1]):
                    remove[tA[0]] = True
                    break

inFile.close()

inFile = open(sys.argv[1],'r')
for line in inFile:
    data = line.strip().split('\t')
    gid = data[0]
    tid = data[1]

    if not remove.has_key(tid):
        print line.strip()

save as yourName.py. Use it by: python yourName.py yourData.file > output.file

score 1 · Answer 2 · 2011-11-25

linearize the intervals using awk:

awk '{N=split($3,B,";");split($3,E,";"); for(i=1;i<=N;++i) printf("%s\t%s\t%s\t%s\n",$1,$2,B[i],E[i]);}'< file.txt 
ENSG00000004399 ENST00000512744 129275460   129275460
ENSG00000004399 ENST00000512744 129277271   129277271
ENSG00000004399 ENST00000512744 129275926   129275926
ENSG00000004399 ENST00000512744 129274061   129274061
ENSG00000004399 ENST00000393239 129275926   129275926
ENSG00000004399 ENST00000393239 129274018   129274018
ENSG00000004399 ENST00000393239 129277271   129277271
ENSG00000004399 ENST00000393239 129275460   129275460
ENSG00000004399 ENST00000505665 129302968   129302968
ENSG00000004399 ENST00000505665 129302474   129302474
ENSG00000001167 ENST00000353205 41065150    41065150
ENSG00000001167 ENST00000341376 41065150    41065150

and merge the intervals using bedtools mergeBed and re-generate the file using awk.

score 1 · Answer 3 · 2011-11-26

You can also feed bedtools ('bed12ToBed6') with a bed12 UCSC download of your transcriptome to generate a Bed 6 (one exome per line file) and then extract unique lines using 'uniq' or sortBed. You can then use mergeBed to merge all partially overlapping unique lines (differentially splice exons because of alt.UTR ...) from the bed6 file into one common line with all IDs merged in the name field. Of course you loose some info by doing so but you obtain a non-overlapping bed file which is required for some applications. Hope this helps!

score 0 · Answer 4 · 2016-07-14

hello,

i've tried the python code above on my file:

ENSG00000157764 ENST00000496384 140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734617;140719327 140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770;140726516 ENSG00000157764 ENST00000288602 140924566;140850111;140834609;140808892;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734479 140924764;140850212;140834872;140808995;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000479537 140754187;140753275;140749287;140747366;140739812;140734521 140754211;140753393;140749418;140747447;140739946;140734770 ENSG00000157764 ENST00000497784 140924566;140850111;140834609;140808892;140808237;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734597 140924658;140850212;140834872;140808995;140808316;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000469930 140924566;140850111;140834061 140924709;140850212;140834872

and nothing change:

ENSG00000157764 ENST00000496384 140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734617;140719327 140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770;140726516 ENSG00000157764 ENST00000288602 140924566;140850111;140834609;140808892;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734479 140924764;140850212;140834872;140808995;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000479537 140754187;140753275;140749287;140747366;140739812;140734521 140754211;140753393;140749418;140747447;140739946;140734770 ENSG00000157764 ENST00000497784 140924566;140850111;140834609;140808892;140808237;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734597 140924658;140850212;140834872;140808995;140808316;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000469930 140924566;140850111;140834061 140924709;140850212;140834872