Hello! I have a gff3 provided by AUGUSTUS and need to build a table with the name of the predicted genes, their sizes, introns and exons of numbers for each gene. But against a problem because the file contains comments on its entire length. I have about 67,000 genes, but so would take too long to do it one by one. Does anyone have an idea what should I do?
The question is about statistics per gene. So my answer is maybe not pertinent, but here is an answer for global statistics:
GAG is really good for that: here the github, and here the publication..
But as it was not exhaustive enough for my purpose I wrote my own script called
gff3_sp_statistics.pl but you have to properly install the GAAS repository.
agat_sp_statistics.pl available in AGAT.
You can get that type of result (even more if the fasta file is provided too):
Number of genes 27707 Number of mrnas 27707 Number of mrnas with utr both sides 9985 Number of mrnas with at least one utr 20693 Number of cdss 27707 Number of exons 131919 Number of five_prime_utrs 15301 Number of three_prime_utrs 15377 Number of exon in cds 119534 Number of exon in five_prime_utr 21204 Number of exon in three_prime_utr 21696 Number of intron in cds 91827 Number of intron in exon 104212 Number of intron in five_prime_utr 5903 Number of intron in three_prime_utr 6319 Number of single exon gene 1232 Number of single exon mrna 1232 mean mrnas per gene 1.0 mean cdss per mrna 1.0 mean exons per mrna 4.8 mean five_prime_utrs per mrna 0.6 mean three_prime_utrs per mrna 0.6 mean exons per cds 4.3 mean exons per five_prime_utr 1.4 mean exons per three_prime_utr 1.4 mean introns in cdss per mrna 3.3 mean introns in exons per mrna 3.8 mean introns in five_prime_utrs per mrna 0.2 mean introns in three_prime_utrs per mrna 0.2 Total gene length 346693759 Total mrna length 334573649 Total cds length 25184373 Total exon length 42796985 Total five_prime_utr length 3907368 Total three_prime_utr length 13705244 Total intron length per cds 270026348 Total intron length per exon 291880876 Total intron length per five_prime_utr 11456694 Total intron length per three_prime_utr 10085428 mean gene length 12512 mean mrna length 12075 mean cds length 908 mean exon length 324 mean five_prime_utr length 255 mean three_prime_utr length 891 mean cds piece length 210 mean five_prime_utr piece length 184 mean three_prime_utr piece length 631 mean intron in cds length 2940 mean intron in exon length 2800 mean intron in five_prime_utr length 1940 mean intron in three_prime_utr length 1596 Longest genes 330825 Longest mrnas 330825 Longest cdss 49575 Longest exons 26237 Longest five_prime_utrs 8910 Longest three_prime_utrs 22461 Longest cds piece 26237 Longest five_prime_utr piece 8273 Longest three_prime_utr piece 22461 Longest intron into cds part 189721 Longest intron into exon part 189721 Longest intron into five_prime_utr part 37945 Longest intron into three_prime_utr part 102332 Shortest genes 6 Shortest mrnas 6 Shortest cdss 6 Shortest exons 1 Shortest five_prime_utrs 1 Shortest three_prime_utrs 1 Shortest intron into cds part 5 Shortest intron into exon part 5 Shortest intron into five_prime_utr part 21 Shortest intron into three_prime_utr part 21
You can get some basic information with shell commands. For example, if you wanted the number of genes you could run
grep -c $'\tgene\t' augustus.gff3, of course replacing
augustus.gff3 with your actual filename. If you want to get total exon or intron counts, just change the feature type in your command from
However, what you're asking for is a little more involved. It's not terribly complicated, but will require a bit of programming in a language like Python, Ruby, or Perl.