I want to calculate the start and end positions of 5'UTR/CDS/3'UTR features in the TAIR10 Arabidopsis gene annotation gff file based on the transcript length. I thought of doing this by first calculating length of each feature, and then based on that getting feature start and end for each transcript in the gff file. I wrote a code using awk to get the feature length (based on the code here), but I am stuck with the second part. I will really appreciate if someone can help me to find my way!
Here is the test data for the first transcript in the gff annotation file:
#Chr source feature genomic_start genomic_end score strand frame attributes
Chr1 phytozomev10 gene 3631 5899 . + . ID=AT1G01010.TAIR10;Name=AT1G01010
Chr1 phytozomev10 mRNA 3631 5899 . + . ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10
Chr1 phytozomev10 five_prime_UTR 3631 3759 . + . ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 3760 3913 . + 0 ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 3996 4276 . + 2 ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 4486 4605 . + 0 ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 4706 5095 . + 0 ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 5174 5326 . + 0 ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 5439 5630 . + 0 ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 three_prime_UTR 5631 5899 . + . ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
And my desired output should have 3 additional columns for feature length and positions at the end of original gff:
#Chr source feature genomic_start genomic_end score strand frame ID feature_length feature_start feature_end
Chr1 phytozomev10 five_prime_UTR 3631 3759 . + . AT1G01010.1 129 1 129
Chr1 phytozomev10 CDS 3760 3913 . + 0 AT1G01010.1 154 130 283
Chr1 phytozomev10 CDS 3996 4276 . + 2 AT1G01010.1 281 284 564
Chr1 phytozomev10 CDS 4486 4605 . + 0 AT1G01010.1 120 565 684
Chr1 phytozomev10 CDS 4706 5095 . + 0 AT1G01010.1 390 685 1074
Chr1 phytozomev10 CDS 5174 5326 . + 0 AT1G01010.1 153 1075 1227
Chr1 phytozomev10 CDS 5439 5630 . + 0 AT1G01010.1 192 1228 1419
Chr1 phytozomev10 three_prime_UTR 5631 5899 . + . AT1G01010.1 269 1420 1688
grep "AT1G01010" Athaliana_167_TAIR10.gene.gff3 |\
awk '$3!="gene" && $3!="mRNA"' |\
awk '$9~/Parent/{
sub(/.*Parent=/,"",$9);
sub(/.TAIR10;.*/,"",$9);
ID=$9;
L[ID]=$5-$4+1
} {for(i in L){print $0,L[i]}}'