Question: How to parse GFF3 file to get introns number and length from CDS coordinates?
gravatar for organizacja7
3.6 years ago by
organizacja70 wrote:

Hello! I am new into Python and I have a project - I have to extract introns of all genes from big gff3 file. So the input looks like this (i cut the last cell with adnotations to make it clearer):

Chr1    phytozome8_0    gene    11218   12435   .   +   .
Chr1    phytozome8_0    mRNA    11218   12435   .   +   .
Chr1    phytozome8_0    five_prime_UTR  11218   11797   .   +   .
Chr1    phytozome8_0    CDS 11798   12060   .   +   0
Chr1    phytozome8_0    CDS 12152   12317   .   +   1
Chr1    phytozome8_0    three_prime_UTR 12318   12435   .   +   .

and I want to get input like: whole first line with gene information, next the number of introns (which is number of CDS-1) and the lenght of introns which is difference between end of one CDS and start of another (in this case 12152-12060). Is there any possibility to achieve what I want with some Python script?

python intron gff3 gene • 2.2k views
ADD COMMENTlink written 3.6 years ago by organizacja70

What have you tried so far? My first attempt would be to implement or find a function gaps which returns the coordinates of all gaps in an interval (mRNA) given a set of intervals partially covering the interval. Further, you will need a library that can read and write valid GFF. Search for the BioPython packages that can do this.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Michael Dondrup47k

Note, your task is not feasible with the GFF from your example. Strictly speaking, this is not even GFF because it lacks the last annotation column with ID and Parent annotation. Without knowing which parent mRNA a CDS or UTR derives from, you cannot assign them to their mRNA. If this is the full example given to you please ask your supervisor to provide you with a valid example.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Michael Dondrup47k

In my file there is a last column with adnotation but I did not paste it in here, sorry. So far I found some libraries to parse GFF files like gffutils and I'm currently learning how to use it.

ADD REPLYlink written 3.6 years ago by organizacja70

Another issue: introns is number of exons - 1, not number of CDS - 1. Unless perhaps you are working with a very primitive organism. CDS annotations are merely a subset of exons, so if your GFF has exon lines (it should), use these instead. Otherwise you will have to consider the UTRs, which are often separate exons.

ADD REPLYlink written 3.6 years ago by apa@stowers480
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1307 users visited in the last hour