How to parse GFF3 file to get introns number and length from CDS coordinates?
0
0
Entering edit mode
6.1 years ago

Hello! I am new into Python and I have a project - I have to extract introns of all genes from big gff3 file. So the input looks like this (i cut the last cell with adnotations to make it clearer):

Chr1    phytozome8_0    gene    11218   12435   .   +   .
Chr1    phytozome8_0    mRNA    11218   12435   .   +   .
Chr1    phytozome8_0    five_prime_UTR  11218   11797   .   +   .
Chr1    phytozome8_0    CDS 11798   12060   .   +   0
Chr1    phytozome8_0    CDS 12152   12317   .   +   1
Chr1    phytozome8_0    three_prime_UTR 12318   12435   .   +   .


and I want to get input like: whole first line with gene information, next the number of introns (which is number of CDS-1) and the lenght of introns which is difference between end of one CDS and start of another (in this case 12152-12060). Is there any possibility to achieve what I want with some Python script?

gene intron gff3 python • 3.2k views
1
Entering edit mode

What have you tried so far? My first attempt would be to implement or find a function gaps which returns the coordinates of all gaps in an interval (mRNA) given a set of intervals partially covering the interval. Further, you will need a library that can read and write valid GFF. Search for the BioPython packages that can do this.

0
Entering edit mode

Note, your task is not feasible with the GFF from your example. Strictly speaking, this is not even GFF because it lacks the last annotation column with ID and Parent annotation. Without knowing which parent mRNA a CDS or UTR derives from, you cannot assign them to their mRNA. If this is the full example given to you please ask your supervisor to provide you with a valid example.

0
Entering edit mode

In my file there is a last column with adnotation but I did not paste it in here, sorry. So far I found some libraries to parse GFF files like gffutils and I'm currently learning how to use it.

0
Entering edit mode

Another issue: introns is number of exons - 1, not number of CDS - 1. Unless perhaps you are working with a very primitive organism. CDS annotations are merely a subset of exons, so if your GFF has exon lines (it should), use these instead. Otherwise you will have to consider the UTRs, which are often separate exons.