Hi All,
Nowadays, I am looking for a script for obtaining summary statistics such as; transcript number, base numbers, length, intron length etc., using GTF/GFF3 file(s) and genome ?
Thank you for all your help !
Hi All,
Nowadays, I am looking for a script for obtaining summary statistics such as; transcript number, base numbers, length, intron length etc., using GTF/GFF3 file(s) and genome ?
Thank you for all your help !
There are several solutions for that: _*Updated to put everything in one place_
In Perl I use agat_sp_statistics.pl
from the gff toolkit AGAT. See here for an example sample of the output. This solution has advantage to work with any kind of GTF/GFF flavor (even not sorted and with errors).
In Python GAG
is a good solution for that purpose: http://genomeannotation.github.io/GAG. From a directory where you have your genome (genome.fasta) and your annotation (genome.gff), you launch GAG, then you load the files by typing "load" (by default it will look for genome.fasta and genome.gff), and finaly you type "info" and you will have a complete summary statistics of your annotation. It works perfectly fine with gff3 format.
In Perl+bash there is GFF-Ex, when I tried it, it din't work for me. (Maybe due to the specific gff flavour I was using)
In Bash using awk
or grep
commands
There are solutions in R, see here for an example.
Using GenomeTools with the command gt stat
Related posts:
A: Analysis gff3 file
Plot statistics from gtf/gff file
bedtools probably does a lot of what you want, have a look at its documentation and usage examples.
Dear h.mon,
BedTools is perfect in many aspect, during my search for gff3 parsing, I encountered some very useful tools;
It can also be useful for gff (probaly works for gtf if small changes made) parsing.