Closed:Converting genomic coordinates in gff to transcript coordinates
1
0
Entering edit mode
9.5 years ago
Saima ▴ 10

I want to calculate the start and end positions of 5'UTR/CDS/3'UTR features in the TAIR10 Arabidopsis gene annotation gff file based on the transcript length. I thought of doing this by first calculating length of each feature, and then based on that getting feature start and end for each transcript in the gff file. I wrote a code using awk to get the feature length (based on the code here), but I am stuck with the second part. I will really appreciate if someone can help me to find my way!

Here is the test data for the first transcript in the gff annotation file:

#Chr     source           feature          genomic_start     genomic_end     score     strand     frame     attributes
Chr1     phytozomev10     gene             3631              5899            .         +          .         ID=AT1G01010.TAIR10;Name=AT1G01010
Chr1     phytozomev10     mRNA             3631              5899            .         +          .         ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10
Chr1     phytozomev10     five_prime_UTR   3631              3759            .         +          .         ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     CDS              3760              3913            .         +          0         ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     CDS              3996              4276            .         +          2         ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     CDS              4486              4605            .         +          0         ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     CDS              4706              5095            .         +          0         ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     CDS              5174              5326            .         +          0         ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     CDS              5439              5630            .         +          0         ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1     phytozomev10     three_prime_UTR  5631              5899            .         +          .         ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964

And my desired output should have 3 additional columns for feature length and positions at the end of original gff:

#Chr     source           feature          genomic_start     genomic_end     score     strand     frame     ID              feature_length     feature_start     feature_end
Chr1     phytozomev10     five_prime_UTR   3631              3759            .         +          .         AT1G01010.1     129                1                 129
Chr1     phytozomev10     CDS              3760              3913            .         +          0         AT1G01010.1     154                130               283
Chr1     phytozomev10     CDS              3996              4276            .         +          2         AT1G01010.1     281                284               564
Chr1     phytozomev10     CDS              4486              4605            .         +          0         AT1G01010.1     120                565               684
Chr1     phytozomev10     CDS              4706              5095            .         +          0         AT1G01010.1     390                685               1074
Chr1     phytozomev10     CDS              5174              5326            .         +          0         AT1G01010.1     153                1075              1227
Chr1     phytozomev10     CDS              5439              5630            .         +          0         AT1G01010.1     192                1228              1419
Chr1     phytozomev10     three_prime_UTR  5631              5899            .         +          .         AT1G01010.1     269                1420              1688
grep "AT1G01010" Athaliana_167_TAIR10.gene.gff3 |\
  awk '$3!="gene" && $3!="mRNA"' |\
  awk  '$9~/Parent/{
    sub(/.*Parent=/,"",$9);
    sub(/.TAIR10;.*/,"",$9);
    ID=$9;
    L[ID]=$5-$4+1
    } {for(i in L){print $0,L[i]}}'
awk gene-feature • 715 views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6