How to extract summary statistics from GFF3 /GTF file?
1
0
Entering edit mode
11 months ago
BioinfoBee • 0

Hello,

Curious if anyone is aware of a tool that can generate a summary statistic of annotated genes from gff3/gtf file. I found agat_sp_statistics.pl very useful but it seems to generate overinflated value for gene length, specifically max gene/mRNA length (for. e.g., > 200,000bp). Kindly suggest!

Regards,
B

gff3 gene-annotation summary-statistics • 1.6k views
ADD COMMENT
1
Entering edit mode

Does not sound crazy to me, the longest human gene is 2 300 000 bp (dystrophin) You can try the basic statistics script (_sq_) to be sure there is no wrong modification made by AGAT with the script you mentioned (_sp_)

ADD REPLY
0
Entering edit mode

You can also easily check that using an awk command

ADD REPLY
0
Entering edit mode

I am going to tag developer of AGAT: Juke34

ADD REPLY
0
Entering edit mode

GenoMax Thanks. Juke34 . any suggestion on why agat_sp_statistics.pl is producing a inflated maximum gene length number?

ADD REPLY
1
Entering edit mode
11 months ago
Sasha ▴ 830

Hi!

You could try using the gffutils Python library as an alternative to the AGAT toolkit for extracting summary statistics from GFF3/GTF files. gffutils is a flexible and efficient library for working with GFF and GTF files in a variety of formats.

Here's an example of how to use gffutils to extract summary statistics from a GFF3 file:

import gffutils

# Create a database from the GFF3 file
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)

# Extract summary statistics
gene_count = db.count_features_of_type("gene")
exon_count = db.count_features_of_type("exon")
mRNA_count = db.count_features_of_type("mRNA")

print("Number of genes:", gene_count)
print("Number of exons:", exon_count)
print("Number of mRNAs:", mRNA_count)

This code snippet creates a database from the input GFF3 file, counts the number of genes, exons, and mRNAs, and prints the results. You can modify this example to extract other summary statistics as needed.

To calculate gene_lengths maybe try:

# Create a database from the GFF3/GTF file
db = gffutils.create_db("input.gff3", dbfn="input.db", force=True, keep_order=True, merge_strategy="merge", sort_attribute_values=True)

# Calculate gene lengths
gene_lengths = {}
for gene in db.features_of_type("gene"):
    gene_length = 0
    for exon in db.children(gene, featuretype="exon"):
        exon_length = exon.end - exon.start + 1
        gene_length += exon_length
    gene_lengths[gene.id] = gene_length

# Print gene lengths
for gene_id, length in gene_lengths.items():
    print(f"{gene_id}: {length}")

DISCLAIMER: I'm using my chatbot here (https://tinybio.cloud) to help generate this answer. This answer has not been tested and may be incorrect. You can download it from the website.

ADD COMMENT
0
Entering edit mode

@sasha While it is appreciated that you are tagging your answers as generated by your chatbot, are these answers tested and proven to be correct?

ADD REPLY
0
Entering edit mode

hi!

No, unfortunately; they are not as this would be difficult to totally ascertain per question. I'm doing a bit of manual curation here to see what makes sense and what does not. If you would like for me to tag the responses differently or hold off until a question has been active for longer than a couple of days - let me know what you think is best. Obviously want to keep the quality of the answers high.

ADD REPLY
0
Entering edit mode

In that case consider adding a disclaimer in your last line (that has info about the chatbot) that the answer is not tested and that you would appreciate receiving feedback if the answer works. Based on the feedback if the answer does not work then consider deleting it completely or moving it to a comment.

ADD REPLY
1
Entering edit mode

Will do this going forward and will monitor previous threads as well.

ADD REPLY

Login before adding your answer.

Traffic: 1431 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6