Question: How to extract sequence ID and COG identifiers from a file?
0
gravatar for Anuj Tyagi
2.5 years ago by
Anuj Tyagi0
Anuj Tyagi0 wrote:

Hi, I have a text file containing sequence IDs and COG identifiers in following format:

ID=ANJKHBIP_00005;eC_number=6.3.4.2;Name=pyrG_1;dbxref=COG:COG0504;gene=pyrG_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0A7E5;locus_tag=ANJKHBIP_00005;product=CTP synthase

ID=ANJKHBIP_00037;eC_number=1.2.1.5;Name=puuC_1;dbxref=COG:COG1012;gene=puuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P23883;locus_tag=ANJKHBIP_00037;product=NADP/NAD-dependent aldehyde dehydrogenase PuuC

ID=ANJKHBIP_00038;eC_number=3.6.3.34;Name=fhuC_1;dbxref=COG:COG1120;gene=fhuC_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P07821;locus_tag=ANJKHBIP_00038;product=Iron(3+)-hydroxamate import ATP-binding protein FhuC

ID=ANJKHBIP_00043;Name=rpsJ_1;dbxref=COG:COG0051;gene=rpsJ_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:Q6N4T6;locus_tag=ANJKHBIP_00043;product=30S ribosomal protein S10

From this file I want to extract sequence IDs and COGs in tab separated format something like below:

ANJKHBIP_00005   COG0504

ANJKHBIP_00037   COG1012

and so on...........

I will highly appreciate if anyone can please suggest me a simple code to do this. I tried to use grep and cut options but could not get the things in proper format required above.

next-gen annotation • 1.3k views
ADD COMMENTlink modified 2.5 years ago by Bastien Hervé4.9k • written 2.5 years ago by Anuj Tyagi0

Did you try something in Perl or Python ?

ADD REPLYlink written 2.5 years ago by Bastien Hervé4.9k
3
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum133k wrote:
 tr ";" "\n" < input.txt | grep -E '^(ID|dbxref=COG:)' | cut -d '=' -f 2 | paste - -
ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum133k
1

I don't think OP wants to keep COG: information

tr ";" "\n" < input.txt | grep -E '^(ID|dbxref=COG:)' | cut -d '=' -f 2 | cut -d ':' -f 2 | paste - -
ADD REPLYlink written 2.5 years ago by Bastien Hervé4.9k

Thanks a lot. it was keeping COG: information earlier. Now it is in exactly my required format.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Anuj Tyagi0
3
gravatar for Bastien Hervé
2.5 years ago by
Bastien Hervé4.9k
Karolinska Institutet, Sweden
Bastien Hervé4.9k wrote:

I think you can do this trick with a single Unix command line but as you are actually learning (I guess) how to parse a text file, it's more interesting to understand the step by step process to parse any file at your convinience.

This is a python script.

#Make your new file ready
new_file = open("your_new_file.csv", "a")
#Open txt file
with open("your_file.txt", 'r') as cog_file:
    #For each line of this file
    for line in cog_file:
        #If the line start with ID
        if line.startswith('ID'):
            #Split the line on ; and put the result in an array
            items = line.split(";")
            #For each item in array
            for i in items:
                #You can get a key/value splitting on =
                key = i.split("=")[0]
                value = i.split("=")[1]
                #If the key is ID
                if key == "ID":
                    #Write it in your new file
                    new_file.write(value)
                #If the key is dbxref
                if key == "dbxref":
                    new_file.write("\t")
                    #Split the COG term on : to remove COG:
                    just_cog_code = value.split(":")[1]
                    #Write it in your new file
                    new_file.write(just_cog_code)
            #Line back
            new_file.write("\n")
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Bastien Hervé4.9k

Hi Bastien, thanks for this script. Yes, I have been learning python but still not very familiar with file parsing. Interesting!

ADD REPLYlink written 2.5 years ago by Anuj Tyagi0
1
gravatar for cpad0112
2.5 years ago by
cpad011214k
Hyderabad India
cpad011214k wrote:

Assuming that gene is always after cog id:

$ grep -Po -i '(?<=ID=).*(?=;gene)' ids.text | sed 's/;.*:/\t/g'
ANJKHBIP_00005  COG0504
ANJKHBIP_00037  COG1012
ANJKHBIP_00038  COG1120
ANJKHBIP_00043  COG0051

with sed:

$ sed 's/ID=\(\w\+\).*\(COG[0-9]\+\).*/\1\t\2/g' ids.text 
ANJKHBIP_00005  COG0504
ANJKHBIP_00037  COG1012
ANJKHBIP_00038  COG1120
ANJKHBIP_00043  COG0051
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by cpad011214k

Thanks. yes it does the job.

ADD REPLYlink written 2.5 years ago by Anuj Tyagi0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2370 users visited in the last hour
_