Entering edit mode
8.4 years ago
organizacja7
•
0
Hello! I am new into Python and I have a project - I have to extract introns of all genes from big gff3 file. So the input looks like this (i cut the last cell with adnotations to make it clearer):
Chr1 phytozome8_0 gene 11218 12435 . + .
Chr1 phytozome8_0 mRNA 11218 12435 . + .
Chr1 phytozome8_0 five_prime_UTR 11218 11797 . + .
Chr1 phytozome8_0 CDS 11798 12060 . + 0
Chr1 phytozome8_0 CDS 12152 12317 . + 1
Chr1 phytozome8_0 three_prime_UTR 12318 12435 . + .
and I want to get input like: whole first line with gene information, next the number of introns (which is number of CDS-1) and the lenght of introns which is difference between end of one CDS and start of another (in this case 12152-12060). Is there any possibility to achieve what I want with some Python script?
What have you tried so far? My first attempt would be to implement or find a function
gaps
which returns the coordinates of all gaps in an interval (mRNA) given a set of intervals partially covering the interval. Further, you will need a library that can read and write valid GFF. Search for the BioPython packages that can do this.Note, your task is not feasible with the GFF from your example. Strictly speaking, this is not even GFF because it lacks the last annotation column with ID and Parent annotation. Without knowing which parent mRNA a CDS or UTR derives from, you cannot assign them to their mRNA. If this is the full example given to you please ask your supervisor to provide you with a valid example.
In my file there is a last column with adnotation but I did not paste it in here, sorry. So far I found some libraries to parse GFF files like gffutils and I'm currently learning how to use it.
Another issue: introns is number of exons - 1, not number of CDS - 1. Unless perhaps you are working with a very primitive organism. CDS annotations are merely a subset of exons, so if your GFF has exon lines (it should), use these instead. Otherwise you will have to consider the UTRs, which are often separate exons.