COG database merger python code update (part-2)
0
0
Entering edit mode
3.8 years ago
MSRS ▴ 590

Thank you very much for your patience. Thank you very much Ram. I want to merge GO data into fasta file for the DIAMOND database preparation. There is an existing code that works on the COG_2014 database. where the running code is ./merger.py prot2003-2014.fa cog2003-2014.csv cognames2003-2014.tab like that. Here the input files look

  1. prot2003-2014.fa
    >gi|103485500|ref|YP_615061.1| glutathione S-transferase-like protein [Sphingopyxis alaskensis RB2256]
MKLFIGNKAYSSWSLRGWLAARHSGLPFEEVTVPLYNEEWNQRREGDEFAPSGGKVPILWDGADIVVWDSLAIIDYLNEK
TGGTRGYWPDDMAARAMARSMAAEMHSSFAALRREHSMNIRRIYPAAELTPEVQADVIRILQIWAEARARFGGEGDYLFG
DWSAADMMFAPVVTRFITYSIPLPRFALPYAQAVISHPHMQEWIGGAQAEDWVIEKFEGPVEG
>gi|103485503|ref|YP_615064.1| hypothetical protein Sala_0005 [Sphingopyxis alaskensis RB2256]
MIARLLILIARAWQLGPSRVLPPTCRYAPSCSEYAIVALRRHGAIKGGWIATKRLLRCHPWGGHGYDPVP
>gi|103485501|ref|YP_615062.1| ribosome biogenesis GTP-binding protein YsxC [Sphingopyxis alaskensis RB2256]
MSEIELEPGADPERAERARKLFSGPIAFLKSAPALQHLPVPSVPEIAFAGRSNVGKSSLLNALTNRNGLARTSVTPGRTQ
ELNYFDVGEPPVFRLVDMPGYGFAKAPKDVVRKWRFLINDYLRGRQVLKRTLVLIDSRHGIKDVDRDVLEMLDTAAVSYR
LVLTKADKIKASALADVHAATEAEARKHPAAHPEVIATSSEKGMGIAELRTAVLEAVEL
  
  1. cog2003-2014.csv
domain-id genome-name protein-id  protein-length  domain-start    domain-end  COG-id  membership-class
103485501 Sphingopyxis_alaskensis_RB2256_uid58351 103485501   219 1   219 COG0218 0
103485500 Sphingopyxis_alaskensis_RB2256_uid58351 103485500   223 1   223 COG0625 0
103485503 Sphingopyxis_alaskensis_RB2256_uid58351 103485503   70  1   70  COG0759 0
  
  1. cognames2003-2014.tab
COG   func    name
COG0625   O   Glutathione S-transferase
COG0218   D   GTP-binding protein EngB required for normal cell division
COG0759   M   Membrane-anchored protein YidD, putatitve component of membrane protein insertase Oxa1/YidC/SpoIIIJ
  

for that existing python code is

    #!/usr/bin/env python

import sys, time

prot_file = open(sys.argv[1], "r")          # prot2003-2014.fa (note: unzipped)
cog_file = open(sys.argv[2], "r")           # cog2003-2014.csv
cog_trans_file = open(sys.argv[3], "r")     # cognames2003-2014.tab
outfile = open("merged_cogs.fa", "w")

# Building the dictionary to compare COGs to the GI ID
cog_db = {}
t0 = time.clock()

for line in cog_file:
    split = line.split(",")
    cog_db[split[0]] = split[6]     # GI ID == COG ID

cog_file.close()
t1 = time.clock()

print "Cog file read.  Time elapsed: " + str(t1-t0) + " seconds."

# Building the dictionary to add the function code to the COGs
trans_cog_db = {}

for line in cog_trans_file:
    trans_cog_db[line.split("\t")[0]] = line.split("\t")[1]
cog_trans_file.close()

# Using the dictionary to write to the outfile the protein sequences with COG info
error_count = 0

for line in prot_file:
    if line[0] == ">":
        gi_id = line.strip().split("|")[1]
        try:
            outfile.write(line.strip() + " | " + cog_db[gi_id] + " | " + trans_cog_db[cog_db[gi_id]] + "\n")
        except KeyError:
            error_count += 1
            outfile.write(line.strip() + " | NO COG FOUND | NA\n")
            continue
    else:
        outfile.write(line)

t2 = time.clock()
prot_file.close()
outfile.close()

print "Protein file analyzed.  Time elapsed: " + str(t2-t1) + " seconds."
print "Number of sequences without a COG: " + str(error_count)

The output file looks like

>gi|103485500|ref|YP_615061.1| glutathione S-transferase-like protein [Sphingopyxis alaskensis RB2256] | COG0625 | O
MKLFIGNKAYSSWSLRGWLAARHSGLPFEEVTVPLYNEEWNQRREGDEFAPSGGKVPILWDGADIVVWDSLAIIDYLNEK
TGGTRGYWPDDMAARAMARSMAAEMHSSFAALRREHSMNIRRIYPAAELTPEVQADVIRILQIWAEARARFGGEGDYLFG
DWSAADMMFAPVVTRFITYSIPLPRFALPYAQAVISHPHMQEWIGGAQAEDWVIEKFEGPVEG
>gi|103485503|ref|YP_615064.1| hypothetical protein Sala_0005 [Sphingopyxis alaskensis RB2256] | COG0759 | M
MIARLLILIARAWQLGPSRVLPPTCRYAPSCSEYAIVALRRHGAIKGGWIATKRLLRCHPWGGHGYDPVP
>gi|103485501|ref|YP_615062.1| ribosome biogenesis GTP-binding protein YsxC [Sphingopyxis alaskensis RB2256] | COG0218 | D
MSEIELEPGADPERAERARKLFSGPIAFLKSAPALQHLPVPSVPEIAFAGRSNVGKSSLLNALTNRNGLARTSVTPGRTQ
ELNYFDVGEPPVFRLVDMPGYGFAKAPKDVVRKWRFLINDYLRGRQVLKRTLVLIDSRHGIKDVDRDVLEMLDTAAVSYR
LVLTKADKIKASALADVHAATEAEARKHPAAHPEVIATSSEKGMGIAELRTAVLEAVEL

Now needs code modification for COG_2020

Here

  1. cog-20.fa
>AAM01497_1 Glutamate-1-semialdehyde aminotransferase [Methanopyrus kandleri AV19]
MGYEDEFPESLELFKRAERVMPGGVSSPVRRFDPYPFYVERAEGSRLYTVDGHVLIDYCLAFGPLILGHAHPEVVEAVVERVREGFHYGTPTLPELKLAEKVVELVPNVEKVRLVNTGTEATMSAIRLARAYTGREKIVKFEGCYHGAHDAVLVRAGSGASELGAPDSPGIPESVAENTLVCPFNDVEAFVETVERFDEEIGAVIVEPVLGNAGCVPPDEEFLKVLREYCDGTERLLIFDEVITGFRLELGGAQEYYGIDADLVCLGKILGGGLPIGAFGGPEEYMSRVAPEGKVYQAGTFNGNPVSATAGLVTLEVLERERPYDELSSKAERLASALEDGLEDRGIEGVVNRVESMFQVYFGIEEVRDYADVNSADHDAFKRFHRELLEHGVWIAASNYEAWFLSIAHTETDLERTEEAFEEALDRLTG
>AAM04025_1 glutamate-1-semialdehyde 2,1-aminomutase [Methanosarcina acetivorans C2A]
MVSEVTLDKSRQMYEKAKTLIPGGVSSPVRAIKPYPFYTASADGSKIRDLDGNEYIDYCLAYGPAVLGHNHPVIKAAIKEQLDKGWLYGTPTELEVTLAEKVAGYYPSIDMLRFVSTGTEATMSALRLARGFTRKNKFIKIEGGFHGAHDAVLVKAGSGATTLGEPDSLGIPADFTKYTLQAPYNDIETMTTLVEKNRDDLAAVIIEPVLGNIGPILPLPGYLKELRKLTKENDVLLIFDEVITGFRLAMGGAQEYFGVVPDMTTLGKIVGGGLPIGVFGGRREIMEMIAPSGAVYQAGTFSGSPCSVAAGIAVLDYLKKEDIHAKLNSTGDYMRAVVSEIVEDEGLDYTVCGIASMFKIFFGAEPHNYQEALKCDKEGYLSFFHRMLANGVFLPPSQFETNFISAAHSEEDIEKTLEAYVENL
>ABK77793_1 glutamate-1-semialdehyde aminotransferase [Cenarchaeum symbiosum A]
MDLEREYRAKTGGSARIFARSKKYHVGGVSHNIRFYEPYPFVTRSASGKHLVDVDGNKYVDYWMGHWSLILGHAPAPVRSAVEGQLRRGWIHGTVNEQTMNLSEIIRGAVSVAEKTRYVTSGTEAVMYAARLARAHTGRKIIAKADGGWHGYASGLLKSVNWPYDVPESGGLVDEEHSISIPYNDLEGSLDVLGRAGDDLACVIIEPLLGGGGCIPADEDYLRGIQEFVHSRGALLVLDEIVTGFRFRFGCAYAAAGLDPDIVALGKIVGGGFPIGVICGKDEVMEISNTISHAKSDRAYIGGGTFSANPATMTAGAAALGELKKRKGTIYPRINSMGDDARDKLSKIFGNRVSVTGRGSLFMTHFVQDGAGRVSNAADAAACDVELLHRYHLDMITRDGIFFLPGKLGAISAAHSKADLKTMYSASERFAEGL
  
  1. cog-20.cog.csv
Gene_ID   NCBI_Assembly_ID    Protein_ID  Protein_length  COGfootprint    Lengthfrootprint    COG_ID  reserved    COG_membership_class    PSI-BLAST_bit_score PSI-BLAST_e_value   COG_length  Protein_footprint_COG
CENSYa_1168   GCA_000200715.1 ABK77793.1  434 1-434   434 COG0001 COG0001 0   342 1.86E-114   432 1-429
MK0280    GCA_000007185.1 AAM01497.1  430 1-430   430 COG0001 COG0001 0   570 1E-200  432 3-431
MA_0581   GCA_000007345.1 AAM04025.1  424 1-424   424 COG0001 COG0001 0   574 1E-200  432 3-426
  
  1. cog-20.def.tab

     COG  func    name Gene_associated Functional_pathway_COG PubMed_ID PDB_ID
    COG0001   H   Glutamate-1-semialdehyde aminotransferase   HemL    Heme biosynthesis       2CFB
      

Want the output like

>AAM01497_1 Glutamate-1-semialdehyde aminotransferase [Methanopyrus kandleri AV19] | COG0001 | H
MGYEDEFPESLELFKRAERVMPGGVSSPVRRFDPYPFYVERAEGSRLYTVDGHVLIDYCLAFGPLILGHAHPEVVEAVVERVREGFHYGTPTLPELKLAEKVVELVPNVEKVRLVNTGTEATMSAIRLARAYTGREKIVKFEGCYHGAHDAVLVRAGSGASELGAPDSPGIPESVAENTLVCPFNDVEAFVETVERFDEEIGAVIVEPVLGNAGCVPPDEEFLKVLREYCDGTERLLIFDEVITGFRLELGGAQEYYGIDADLVCLGKILGGGLPIGAFGGPEEYMSRVAPEGKVYQAGTFNGNPVSATAGLVTLEVLERERPYDELSSKAERLASALEDGLEDRGIEGVVNRVESMFQVYFGIEEVRDYADVNSADHDAFKRFHRELLEHGVWIAASNYEAWFLSIAHTETDLERTEEAFEEALDRLTG
>AAM04025_1 glutamate-1-semialdehyde 2,1-aminomutase [Methanosarcina acetivorans C2A] | COG0001 | H
MVSEVTLDKSRQMYEKAKTLIPGGVSSPVRAIKPYPFYTASADGSKIRDLDGNEYIDYCLAYGPAVLGHNHPVIKAAIKEQLDKGWLYGTPTELEVTLAEKVAGYYPSIDMLRFVSTGTEATMSALRLARGFTRKNKFIKIEGGFHGAHDAVLVKAGSGATTLGEPDSLGIPADFTKYTLQAPYNDIETMTTLVEKNRDDLAAVIIEPVLGNIGPILPLPGYLKELRKLTKENDVLLIFDEVITGFRLAMGGAQEYFGVVPDMTTLGKIVGGGLPIGVFGGRREIMEMIAPSGAVYQAGTFSGSPCSVAAGIAVLDYLKKEDIHAKLNSTGDYMRAVVSEIVEDEGLDYTVCGIASMFKIFFGAEPHNYQEALKCDKEGYLSFFHRMLANGVFLPPSQFETNFISAAHSEEDIEKTLEAYVENL
>ABK77793_1 glutamate-1-semialdehyde aminotransferase [Cenarchaeum symbiosum A] | COG0001 | H
MDLEREYRAKTGGSARIFARSKKYHVGGVSHNIRFYEPYPFVTRSASGKHLVDVDGNKYVDYWMGHWSLILGHAPAPVRSAVEGQLRRGWIHGTVNEQTMNLSEIIRGAVSVAEKTRYVTSGTEAVMYAARLARAHTGRKIIAKADGGWHGYASGLLKSVNWPYDVPESGGLVDEEHSISIPYNDLEGSLDVLGRAGDDLACVIIEPLLGGGGCIPADEDYLRGIQEFVHSRGALLVLDEIVTGFRFRFGCAYAAAGLDPDIVALGKIVGGGFPIGVICGKDEVMEISNTISHAKSDRAYIGGGTFSANPATMTAGAAALGELKKRKGTIYPRINSMGDDARDKLSKIFGNRVSVTGRGSLFMTHFVQDGAGRVSNAADAAACDVELLHRYHLDMITRDGIFFLPGKLGAISAAHSKADLKTMYSASERFAEGL

Here, the Protein_Id is the identifier of fa file to CSV file and from CSV file to .tab file GO id is the identifier. Thank you very much.SR

python COG database • 1.4k views
ADD COMMENT
1
Entering edit mode

What have you tried? You'd need to change a few lines in the python script based on which IDs map to the other inputs. Maybe try switching to pandas as you're dealing with a couple of tabular files?

ADD REPLY
0
Entering edit mode

Thank you very much Ram. I want to merge GO data into fasta file for the DIAMOND database preparation. Thank you.

ADD REPLY
1
Entering edit mode

The only difference I see is the change from the gi to the AA* identifier between the two FASTA files. Change the gi_id based indexing logic in the loop a little and test it. This is a good learning exercise, you should be able to solve this on your own.

ADD REPLY
0
Entering edit mode

Thank you very much Ram. There are a few differences like from cog-20.fa file has the value like AAM01497_1 but in CSV file contains the same character in the third column like that AAM01497.1. cog_db[split[3]] = split[7]. I am stack and confused being a very beginning learner of python.

ADD REPLY
1
Entering edit mode

Use a lot of print statements to understand what is going on. I am confident that you can figure this out.

ADD REPLY
0
Entering edit mode

I am really sorry, I can not figure out the

if line[0] == ">":
        gi_id = line.strip().split("|")[1] part.

Need expertise assistance.

ADD REPLY
1
Entering edit mode

That statement picks the gi identifier from the FASTA header.

if line[0] == ">":    #if the first character in the line is a ">" .i.e the line is a header line in FASTA
    gi_id = line.strip().split("|")[1]    # the gi_id is obtained by 
                                                     # 1. removing all white space characters from both ends of the line (strip())
                                                     # 2. splitting the line into multiple chunks using "|" as the delimiter (split("))
                                                     # 3. picking the second chunk from the above split ([1])

There is no gi in the new FASTA, but you can repurpose the code and logic to pick a different ID instead.

ADD REPLY
0
Entering edit mode

Thank you so much. Beg pardon if any mistake.

ADD REPLY

Login before adding your answer.

Traffic: 1583 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6