How To Remove Overlapping And Common Exons From Multiple Transcripts Of A Gene
4
2
Entering edit mode
9.8 years ago
Simran ▴ 40

Hi,

I have this data from where I have to remove those transcripts of a gene which have overlapping or common exons. Third column of my file has the start coordinates of all exons of a transcript and fourth column has exon end coordinates. In most of the cases there are multiple exons for each transcript separated by a semi-colon. For example:-

ENSG00000004399    ENST00000512744    129275460;129277271;129275926;129274061  129275534;129277364;129276066;129275271
ENSG00000004399    ENST00000393239    129275926;129274018;129277271;129275460     129276066;129275271;129277364;129275534
ENSG00000004399    ENST00000505665    129302968;129302474    129303067;129302512
ENSG00000001167    ENST00000353205    41065150    41065689
ENSG00000001167    ENST00000341376    41065150    41067715

The output after removing redundancy will be only those transcripts that have longest,non redundant exons.

ENSG00000004399    ENST00000393239    129275926;129274018;129277271;129275460     129276066;129275271;129277364;129275534
ENSG00000004399    ENST00000505665    129302968;129302474     129303067;129302512
ENSG00000001167    ENST00000341376    41065150    41067715

Can anyone suggest how to get this result? Thanks in advance.

exon transcript • 6.7k views
ADD COMMENT
1
Entering edit mode

Given that you already have an example, is this homework? What did you try so far?

ADD REPLY
0
Entering edit mode

What exactly are you trying to accomplish? Merging isoforms into a gene structure?

ADD REPLY
0
Entering edit mode

Micheal, I have removed those transcripts with common and overlapping exons which had single exons. Now trying to deal with this list where there are multiple exons but then thought of asking some help here. This is not homework anyways..

ADD REPLY
0
Entering edit mode

Hi DK, I am trying to remove those transcripts from my list where the exons are already present in another transcript. E.g removing entries like ENST00000512744 and ENST00000353205 as the exons in these transcripts are already covered by other transcripts from the above list. Thus removing redundancy from my data.

ADD REPLY
0
Entering edit mode

In the example you posted, you removed the first transcript (512744) and kept the second transcript (393239). The first exon of the first transcript is: 129275460 - 129275534. The first exon of the second transcript is: 129275926 - 129276066. Why did you decide to remove the first one in that case? The first transcript exon is has an extra ~500 bases upstream of the second transcript exon.

ADD REPLY
0
Entering edit mode

In the example you posted, you removed the first transcript (512744) and kept the second transcript (393239). The first exon of the first transcript is: 129275460 - 129275534. The first exon of the second transcript is: 129275926 - 129276066. Why did you decide to remove the first one in that case? The first transcript exon has an extra ~500 bases upstream of the second transcript exon.

ADD REPLY
0
Entering edit mode

The exons are not sorted in ascending order. As you see, 129275460 is also present in second transcript (393239). I selected this one because the second exon 129274018 in this transcript ends at 129275271, whereas in first one (512744), it starts at 129274061 and ends at same position 129275271.

ADD REPLY
0
Entering edit mode

Ahh, I see. My mistake. So you just want to remove transcripts that are completely within another transcript.

ADD REPLY
0
Entering edit mode

yes,I think I shd have explained my question a bit better:)

ADD REPLY
0
Entering edit mode

Is there data on what gene and reference contig the transcript belongs to? What are the first and second columns of your data?

ADD REPLY
0
Entering edit mode

First column is the gene id and second is transcript id. The transcript ids belong to their respective geneids. In the above list, first three transcripts belong to same geneid.

ADD REPLY
4
Entering edit mode
9.8 years ago

I had a script I wrote a while back that found consensus sequences in a group of annotations from a gtf file. I modified it to work with your data. This script should remove transcripts that are completely within another transcript. I would spot check the data to see if this script worked:

import sys, operator

def overlap(coordA, coordB):
    overlapLength = -1
    if isWithin(coordA[0], coordB) or isWithin(coordA[1], coordB) or isWithin(coordB[0], coordA) or isWithin(coordB[1], coordA):
        overlapLength = min(coordA[1], coordB[1]) - max(coordA[0], coordB[0]) + 1

    return overlapLength

def isWithin(query, coords):
    start = coords[0]
    end = coords[1]
    if query >= start and query <= end:
        return True
    else:
        return False

def transcriptWithin(coordsA,coordsB):
    for coordA in coordsA:
        for coordB in coordsB:
            within = False
            ABOverlap = overlap(coordA,coordB)
            lenExonA = coordA[1] - coordA[0] + 1
            if ABOverlap == lenExonA:
                within = True
                break

        if not within:
            return False

    return True

genes = {}
inFile = open(sys.argv[1],'r')
for line in inFile:
    data = line.strip().split('\t')
    gid = data[0]
    tid = data[1]
    starts = data[2].split(';')
    ends = data[3].split(';')
    exonCoords = []
    for i in range(len(starts)):
        exonCoords.append((int(starts[i]),int(ends[i])))

    if not genes.has_key(gid):
        genes[gid] = []

    genes[gid].append((tid,exonCoords))

remove = {}
for gid, transcripts in genes.items():
    for tA in transcripts:
        for tB in transcripts:
            if tA[0] != tB[0]:
                if transcriptWithin(tA[1],tB[1]):
                    remove[tA[0]] = True
                    break

inFile.close()

inFile = open(sys.argv[1],'r')
for line in inFile:
    data = line.strip().split('\t')
    gid = data[0]
    tid = data[1]

    if not remove.has_key(tid):
        print line.strip()

save as yourName.py. Use it by: python yourName.py yourData.file > output.file

ADD COMMENT
0
Entering edit mode

thanks a lot...

ADD REPLY
1
Entering edit mode
9.8 years ago

linearize the intervals using awk:

awk '{N=split($3,B,";");split($3,E,";"); for(i=1;i<=N;++i) printf("%s\t%s\t%s\t%s\n",$1,$2,B[i],E[i]);}'< file.txt 
ENSG00000004399 ENST00000512744 129275460   129275460
ENSG00000004399 ENST00000512744 129277271   129277271
ENSG00000004399 ENST00000512744 129275926   129275926
ENSG00000004399 ENST00000512744 129274061   129274061
ENSG00000004399 ENST00000393239 129275926   129275926
ENSG00000004399 ENST00000393239 129274018   129274018
ENSG00000004399 ENST00000393239 129277271   129277271
ENSG00000004399 ENST00000393239 129275460   129275460
ENSG00000004399 ENST00000505665 129302968   129302968
ENSG00000004399 ENST00000505665 129302474   129302474
ENSG00000001167 ENST00000353205 41065150    41065150
ENSG00000001167 ENST00000341376 41065150    41065150

and merge the intervals using bedtools mergeBed and re-generate the file using awk.

ADD COMMENT
0
Entering edit mode

thanks for your reply, it is a good start to linearize the intervals first. I think mergeBEd will try to merge the overlapping exons into a single entry. That means exons from all transcripts will be merged into one. But, I want to retain the actual exon coordinates for each transcript and to remove transcripts whose exons are already covered by other bigger transcripts.

ADD REPLY
1
Entering edit mode
9.8 years ago

You can also feed bedtools ('bed12ToBed6') with a bed12 UCSC download of your transcriptome to generate a Bed 6 (one exome per line file) and then extract unique lines using 'uniq' or sortBed. You can then use mergeBed to merge all partially overlapping unique lines (differentially splice exons because of alt.UTR ...) from the bed6 file into one common line with all IDs merged in the name field. Of course you loose some info by doing so but you obtain a non-overlapping bed file which is required for some applications. Hope this helps!

ADD COMMENT
0
Entering edit mode
5.2 years ago
lmanchon • 0

hello,

i've tried the python code above on my file:

ENSG00000157764 ENST00000496384 140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734617;140719327 140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770;140726516 ENSG00000157764 ENST00000288602 140924566;140850111;140834609;140808892;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734479 140924764;140850212;140834872;140808995;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000479537 140754187;140753275;140749287;140747366;140739812;140734521 140754211;140753393;140749418;140747447;140739946;140734770 ENSG00000157764 ENST00000497784 140924566;140850111;140834609;140808892;140808237;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734597 140924658;140850212;140834872;140808995;140808316;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000469930 140924566;140850111;140834061 140924709;140850212;140834872

and nothing change:

ENSG00000157764 ENST00000496384 140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734617;140719327 140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770;140726516 ENSG00000157764 ENST00000288602 140924566;140850111;140834609;140808892;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734479 140924764;140850212;140834872;140808995;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000479537 140754187;140753275;140749287;140747366;140739812;140734521 140754211;140753393;140749418;140747447;140739946;140734770 ENSG00000157764 ENST00000497784 140924566;140850111;140834609;140808892;140808237;140807960;140801412;140800362;140794308;140787548;140783021;140781576;140777991;140776912;140754187;140753275;140749287;140739812;140734597 140924658;140850212;140834872;140808995;140808316;140808062;140801560;140800481;140794467;140787584;140783157;140781693;140778075;140777088;140754233;140753393;140749418;140739946;140734770 ENSG00000157764 ENST00000469930 140924566;140850111;140834061 140924709;140850212;140834872

ADD COMMENT

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6