Question

gff summary file generating script

0

Entering edit mode

3.6 years ago

3335098459 ▴ 30

Hello,

I have little experience in python programming and I have just beginning to learn this scripting language. I have a script and a bunch of .gff genome files. I am using this script to summarize the information in my gff files.

#!/usr/bin/env python2

from Bio.SeqIO.FastaIO import SimpleFastaParser
from Bio.Seq import translate
from Bio.Seq import reverse_complement
import os
import string
import random
import sys

''' given all the gff files, summarise them to create a big CSV file with all the details of these genomes the details:
1. Number of CDSs
2. Number of pseudogene
3. Number of other elements in GFF files
4. Genome size After summary -> add ST, pathotype and metadata and see how stratifying the data changes anything and will give a general description of the pan-genome'''


def read_gff(gff_file):     categories = ["hypothetical protein", "transposase",
        "pseudogene", "conjuga", "phage", "fimbrial", "plasmid",
         "crispr", "resistance", "virulence", "secretion system"]   counts = {}


for cat in categories:
    counts[cat] = 0
try:
    f = open(gff_file)
except IOError:
    print("Could not read file:", gff_file)
    return counts

for line in f:
    line = line.lower()
    if line.startswith("##fasta"):
        break
    if line.startswith("#"):
        continue
    toks = line.strip().split()
    product = toks[2]
    if product not in counts:
        counts[product] = 0
    counts[product] += 1
    for cat in categories:
        if cat in line:
            counts[cat] += 1

f.close()

return counts

header = ["cds", "trna", "hypothetical protein", "transposase",
    "pseudogene", "conjuga", "phage", "fimbrial", "plasmid",
     "crispr", "resistance", "virulence", "secretion system"]

out = open("gff_summary.csv", "w")
out.write("ID, file_name," + ",".join(header) + "\n")

cnt = 0
with open(sys.argv[1]) as f:
    for line in f:
        toks = line.strip().split("\t")
        if line.startswith("ID"):
            annot_loc_index = toks.index("Annotation_Location")
            continue
        ID = toks[0]
        files = toks[annot_loc_index].split(",")
        for f1 in files:
            print(f1)
            counts = read_gff(f1)
            out.write(ID + "," + f1)
            for cat in header:
                out.write("," + str(counts[cat]))
            out.write("\n")
        cnt += 1


out.close()

I have run this script with the following command. I have tried with both wild card and single. but I didnot worked.

root@h:/home/fan/monas/script/gff_summaries# python summarise_gffs.py /home/fuan/monas/gff_combine/*.gff

But the following error keeps coming up.

Traceback (most recent call last):
  File "summarise_gffs.py", line 71, in <module>
    files = toks[annot_loc_index].split(",")
NameError: name 'annot_loc_index' is not defined

Somebody please help me in understanding this error. So that I can resolve this. Thanks

Assembly genome gene • 625 views

ADD COMMENT • link 3.6 years ago by 3335098459 ▴ 30