Why does my script to update a gffutils database take too long?
0
0
Entering edit mode
17 months ago

Hello,

I need to update a gtf file that was made by cuffcompare. Cuffcompare outputs exons only, but I need a transcript line for downstream processing. So I searched about and found that gffutils would automatically infer the transcript when entering it in the database, then you can just write it out from the database - so far so good; this bit works and finishes rapidly.

However, the transcript line also needs to report some 'attributes' information from the exons (namely, 'class_code', 'nearest_ref' and 'mutations' - I added the latter to the gtf). So I wrote the generatorModifyTranscripts() function below, to use gffutils.FeatureDB.update() to add these attributes from the exons to their parent transcript. The script works with very small gtf files, even when they include my custom 'mutations' attribute, so that doesn't seem to be the problem (i.e. cut down to less that 1000 lines). But when I input a more normal size gtf, even only ~100,000 lines, it takes forever (possibly literally, it's been running for hours). Since I have used 'merge_strategy="replace"' and the id (primary key) of the feature seems unchanged when I run the tiny files, I can't think what could be taking so long in updating to database.

I am new to Python, I usually use C or Java, so sorry if I've stuffed up something obvious! (But also I kinda hope so.) Hope you can help me, thanks! :)

import gffutils
def generatorModifyTranscripts(db):
    for featureFamily in db.iter_by_parent_childs(featuretype='transcript'):
        for exon in featureFamily[1:]:
            for exon_attr in exon.attributes:
                if exon_attr == "class_code" or exon_attr == "nearest_ref" and exon_attr not in featureFamily[0].attributes:
                    featureFamily[0][exon_attr] = exon[exon_attr]
                if exon_attr == "mutations":
                    if exon_attr not in featureFamily[0].attributes:
                        featureFamily[0][exon_attr] = exon[exon_attr]
                    else:
                        mutation_string = featureFamily[0]["mutations"][0] + "." + exon["mutations"][0]
                        featureFamily[0]["mutations"] = [mutation_string]
        yield featureFamily[0]

db = gffutils.create_db('input.gtf', 'gtf.db', force=True, disable_infer_genes=True)
db.update(generatorModifyTranscripts(db),  disable_infer_transcripts=True,disable_infer_genes=True,merge_strategy="replace")
with open('output.gtf', 'w') as fout:
    for feature in db.all_features():
        fout.write(str(feature) + '\n')
gffutils gtf • 315 views
ADD COMMENT

Login before adding your answer.

Traffic: 1761 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6