Remove overlap between Exon and UTR boundaries
2
0
Entering edit mode
19 months ago

I want to remove the overlapped region between exonic and UTR (5UTR and 3UTR) regions to keep the only exonic region.

Trial version of Input dataset is:

Chr Strand  Exon_ID Exon_start  Exon_end    5UTR_start  5UTR_end    3UTR_start  3UTR_end
1   1   AT1G01010.1.exon1   3631    3913    3631    3759    0   0
1   1   AT1G01010.1.exon2   3996    4276    0   0   0   0
1   1   AT1G01010.1.exon6   5439    5899    0   0   5631    5899
1   -1  AT1G01020.1.exon1   8571    9130    8667    9130    0   0
**1 -1  AT1G01060.7.exon8   33662   34327   0   0   33662   33991
1   -1  AT1G01030.1.exon2   11649   12940   12941   13173   11649   11863**


For Instance, for exon (AT1G01010.1.exon1), the Exon strat (3631) and 5UTR start (3631) both starts from the same position (3631), 5UTR region ends at 3759 while exon ends at 3913 since there is an overlap of 128 base pairs (3759-3631) so to keep only exonic region I want to change exonic start as 3760 and exonic end will remain same. But for exon (AT1G01020.1.exon1), the Exon strat (8571) and 5UTR start (8667) both ends at the same position (9130), but there is an overlap of 463 base pairs (9130-8667) so to keep only exonic region I want to change exonic end as 8666 and exonic start will remain same.

Final output should be like this:

Chr Strand  Exon_ID Exon_start  Exon_end    5UTR_start  5UTR_end    3UTR_start  3UTR_end
1   1   AT1G01010.1.exon1   3760    3913    3631    3759    0   0
1   1   AT1G01010.1.exon2   3996    4276    0   0   0   0
1   1   AT1G01010.1.exon6   5439    5630    0   0   5631    5899
1   -1  AT1G01020.1.exon1   8571    8666    8667    9130    0   0
**1 -1  AT1G01060.7.exon8   33992   34327   0   0   33662   33991
1   -1  AT1G01030.1.exon2   11864   12940   12941   13173   11649   11863**


I have tried a few awk commands but wasn't able to get the desired output, Any help will be highly appreciated.

RNA-Seq exon UTR • 496 views
4
Entering edit mode

UTRs are exonic. If you remove UTRs from the exonic regions, the regions you are left with are not exons. If you are doing this for standard DE analysis of RNAseq, it would be very bad practice to remove the UTRs, as UTRs make up on average 30% of a transcript, and often more than 50% of a transcript, so by removing them you are removing 30-50% of your data.

4
Entering edit mode
19 months ago
JC 12k

Actually your modifications are better to have in a script, here is my solution in Perl:

#!/usr/bin/perl

use strict;
use warnings;

while (<>) {
if (/^Chr/) {
print;
next;
}
chomp;
my ($chr,$dir,$exid,$eini,$eend,$uini5,$uend5,$uini3,$uend3) = split (/\s+/,$_);
if ($dir == 1) { if ($uend5 > 0) {
print join "\t", $chr,$dir,$exid,$uend5+1,$eend,$uini5,$uend5,$uini3,$uend3; } elsif ($uini3 > 0) {
print join "\t", $chr,$dir,$exid,$eini,$uini3-1,$uini5,$uend5,$uini3,$uend3; } else { print; } } else { #$dir == -1
if ($uini5 > 0) { print join "\t",$chr,$dir,$exid,$eini,$uini5-1,$uini5,$uend5,$uini3,$uend3;
} elsif ($uend5 > 0) { print join "\t",$chr,$dir,$exid,$uend3+1,$eend,$uini5,$uend5,$uini3,$uend3;
} else {
print;
}
}
print "\n";
}


Testing it:

$perl exon.pl < coords.txt Chr Strand Exon_ID Exon_start Exon_end 5UTR_start 5UTR_end 3UTR_start 3UTR_end 1 1 AT1G01010.1.exon1 3760 3913 3631 3759 0 0 1 1 AT1G01010.1.exon2 3996 4276 0 0 0 0 1 1 AT1G01010.1.exon3 4486 4605 0 0 0 0 1 1 AT1G01010.1.exon4 4706 5095 0 0 0 0 1 1 AT1G01010.1.exon5 5174 5326 0 0 0 0 1 1 AT1G01010.1.exon6 5439 5630 0 0 5631 5899 1 -1 AT1G01020.1.exon1 8571 8666 8667 9130 0 0  And as a comment, Exon includes UTRs and CDS, so I guess what you want is to define your CDS and UTRs. ADD COMMENT 0 Entering edit mode @JC Many thanks for your help, the above solution worked very value for the trail dataset which I have provided but in the actual dataset I faced a problem where we need to change boundaries of CDS region with respect to UTR regdardless of strand specificness, I have modified the dataset and make it bold so that you can see it, I have also attached file containing values that don't change, I wasn't able to modify your script as I don't have much knowledge about perl, so I will highly appreciate if you please modify it. ADD REPLY 1 Entering edit mode I believe that error is because in line 25 it should be elsif ($uend3 > 0)

0
Entering edit mode

Sorry It was my mistake, i have modified the script but was running the old script, yes its a real data and now the only thing is to address this problem: 1 -1 AT1G01030.1.exon2 11649 12940 12941 13173 11649 11863

0
Entering edit mode

is this real data? this line:

1   -1  AT1G01030.1.exon2   11649   12940   12941   13173   11649   11863


means Exon #2 contains a 5' UTR and a 3' UTR which is unlikely

0
Entering edit mode

Just means its a one exon gene. Over 1000 genes in the human genome have only one exon, but the number is bigger in Arabidopsis, where splicing is simpler (or just less well annotated).

0
Entering edit mode

There are 5393 cases where we have both 3 and 5 UTR regions, please see the file

2
Entering edit mode
19 months ago
JC 12k

Under the new considerations (5'UTR and 3'UTR in exon) the script needs to be:

#!/usr/bin/perl

use strict;
use warnings;

while (<>) {
if (/^Chr/) {
print;
next;
}
chomp;
my ($chr,$dir,$exid,$eini,$eend,$uini5,$uend5,$uini3,$uend3) = split (/\s+/,$_);
if ($dir == 1) { if ($uend5 > 0) {
$eini =$uend5 + 1;
}
if ($uini3 > 0) {$eend = $uini3 - 1; } } else { #$dir == -1
if ($uini5 > 0) {$eend = $uini5 - 1; } if ($uend3 > 0) {
$eini =$uend3 + 1;
}
}
print join "\t", $chr,$dir, $exid,$eini, $eend,$uini5, $uend5,$uini3, $uend3; print "\n"; }  test: $ perl exon2.pl < coords.txt
Chr     Strand  Exon_ID Exon_start      Exon_end        5UTR_start      5UTR_end        3UTR_start      3UTR_end
1       1       AT1G01010.1.exon1       3760    3913    3631    3759    0       0
1       1       AT1G01010.1.exon2       3996    4276    0       0       0       0
1       1       AT1G01010.1.exon6       5439    5630    0       0       5631    5899
1       -1      AT1G01020.1.exon1       8571    8666    8667    9130    0       0
1       -1      AT1G01060.7.exon8       33992   34327   0       0       33662   33991
1       -1      AT1G01030.1.exon2       11864   12940   12941   13173   11649   11863


Please test it with your full dataset, hope this works for you.