How can I amend the output of a DIAMOND python script?
0
0
Entering edit mode
12 weeks ago
eam-hmc • 0

Hello, please help a struggling wet-lab biologist. I am trying to identify Cluster of Orthologous Groups (COG) categories for a list of gene sequences.

I configured a COG database and used DIAMOND to blast query sequences against this database. This provides a results.cog file that in my case is formatted like so:

AqS1v2_00030    WP_046860143.1  71.4    706 197 1   28  2145    11  711 0.0 1003
AqS1v2_00030    WP_119679565.1  72.1    709 192 2   31  2157    12  714 0.0 999
AqS1v2_00030    WP_059399293.1  70.8    715 202 3   16  2157    7   715 0.0 982

I am using a Python script (included here at the end) to analyse this results.cog file which generates an output file that looks like so:

1.4435847095507564  125  COG0683  | E
1.4435847095507564  125  COG0834  | ET  
1.3973899988451322  121  COG4663  | Q

The first column is the percentage abundance for that particular COG. The second column contains the number of raw hits for that particular COG. The third column contains both the COG ID and the COG category, represented by one or more letters.

So my question is: Can I edit the script so that the output file includes the query ID from column 1 of the results.cog file so that each COG is corresponds to a specific gene. Producing something like this:

1.4435847095507564  125  COG0683  | E    AqS1v2_03414
1.4435847095507564  125  COG0834  | ET   AqS1v2_05420  
1.3973899988451322  121  COG4663  | Q    AqS1v2_02733

The column position of the query ID isn't important, so long as it's in there somewhere.

Here is the Python script I used:

# imports
import operator, sys, time, gzip, re

# String searching function:
import pickle


def string_find(usage_term):
    for idx, elem in enumerate(sys.argv):
        this_elem = elem
        next_elem = sys.argv[(idx + 1) % len(sys.argv)]
        if elem == usage_term:
             return next_elem

t0 = time.perf_counter()

# loading starting file
if "-I" in sys.argv:
    infile_name = string_find("-I")
else:
    sys.exit ("WARNING: infile must be specified using '-I' flag.")

infile = open (infile_name, "r")

# setting up databases
hit_count_db = {}
unique_seq_db = {}
read_id_db = {}
line_counter = 0

# reading through the infile
for line in infile:
    line_counter += 1
    splitline = line.split("\t")
    if line_counter % 1000000 == 0:
        t99 = time.perf_counter()
        print (str(line_counter)[:-6] + "M lines processed so far in " + str(t99-t0) + " seconds.")

    unique_seq_db[splitline[0]] = 1

    if "-P" in sys.argv:
        read_id_db[splitline[0]] = splitline[1]

    try:
        hit_count_db[splitline[1]] += 1
    except KeyError:
        hit_count_db[splitline[1]] = 1
        continue

t1 = time.perf_counter()

# results reporting
print ("\nAnalysis of " + infile_name + " complete.")
print ("Number of total lines: " + str(line_counter))
print ("Number of unique sequences: " + str(len(unique_seq_db)))
print ("Time elapsed: " + str(t1-t0) + " seconds.")

infile.close()

# time to search for these in the reference database
if "-D" in sys.argv:
    db_name = string_find("-D")
else:
    sys.exit( "No database file indicated; skipping database search step.")

# IO
db = open (db_name, "r")
if "-P" in sys.argv:
    partial_outfile_name = string_find("-P")
    partial_outfile = open(partial_outfile_name, "w")

print ("\nStarting database analysis now.")

t2 = time.perf_counter()

# building a dictionary of the reference database
db_hier_dictionary = {}
db_line_counter = 0
db_error_counter = 0

for line in db:
    if line.startswith(">") == True:
        db_line_counter += 1
        splitline = line.split(" ")
        new_list = [s.replace(">", "") for s in splitline]
        # ID, the hit returned in DIAMOND results
        db_id = str(new_list[0])

        # name and functional description
        if "NO COG FOUND" in splitline[1]:
            db_hier = "NO HIERARCHY"
        else:
            hier_split = line.split("|")
            db_hier = hier_split[-2] + " | " + hier_split[-1].strip()

        # add to dictionaries
        db_hier_dictionary[db_id] = db_hier

        # line counter to show progress
        if db_line_counter % 1000000 == 0:                          # each million
            t95 = time.perf_counter()
            print (str(db_line_counter) + " lines processed so far in " + str(t95-t2) + " seconds.")

t3 = time.perf_counter()

print ("\nSuccess!")
print ("Time elapsed: " + str(t3-t2) + " seconds.")
print ("Number of lines: " + str(db_line_counter))
print ("Number of errors: " + str(db_error_counter))

# printing out the partial outfile
if "-P" in sys.argv:
    for entry in read_id_db.keys():
        partial_outfile.write(entry + "\t" + read_id_db[entry] + "\t" + db_hier_dictionary[read_id_db[entry]] + "\n")

# condensing down the identical matches
condensed_hit_db = {}

#test code
a_file = open("data.pkl", "wb")
pickle.dump(db_hier_dictionary, a_file)
a_file.close()

for entry in hit_count_db.keys():
    org = db_hier_dictionary[entry]
    if org in condensed_hit_db.keys():
        condensed_hit_db[org] += hit_count_db[entry]
    else:
        condensed_hit_db[org] = hit_count_db[entry]

# dictionary output and summary
print ("\nDictionary database assembled.")
print ("Time elapsed: " + str(t3-t2) + " seconds.")
print ("Number of errors: " + str(db_error_counter))

print ("\nTop ten hierarchy matches:")
for k, v in sorted(condensed_hit_db.items(), key=lambda kv: -kv[1])[:10]:
    try:
        print (str(v) + "\t" + k )
    except KeyError:
        print (str(v) + "\tWARNING: Key not found for " + k)
        continue

# creating the outfiles
if "-O" in sys.argv:
    outfile_name = string_find("-O")
else:
    outfile_name = infile_name[:-44] + ".hierarchy"

outfile = open (outfile_name, "w")

# writing the output
error_counter = 0
for k, v in sorted(condensed_hit_db.items(), key=lambda kv: -kv[1]):
    try:
        q = v * 100 / float(line_counter)
        outfile.write (str(q) + "\t" + str(v) + "\t" + k + "\n")
    except KeyError:
        outfile.write (str(q) + "\t" + str(v) + "\tWARNING: Key not found for " + k + "\n")
        error_counter += 1
        continue

print ("\nAnnotations saved to file: '" + outfile_name + "'.")
print ("Number of errors: " + str(error_counter))

db.close()
outfile.close()

Any help would be much appreciated! I've been battling all evening.

Python Diamond COG_analysis • 348 views
ADD COMMENT
1
Entering edit mode

Ooopppssss, it's a huge script there.

Why don't you try eggNOG-mapper for COG mapping, it also use alignment in the backend and specifically designed for this purpose?

Additionally, if you are using blast you can add different columns such as q_id, percent_identity etc etc. directly in the command. This blog might help.

Regards,

Nitin N.

ADD REPLY
0
Entering edit mode

wow, that was 1 million times easier than what I was attempting! Thanks Nitin.

ADD REPLY

Login before adding your answer.

Traffic: 2514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6