Question

Change headers of hundreds fasta files

2

Entering edit mode

2.2 years ago

Oscar ▴ 30

I'm having some problems trying to change headers of hundreds of fasta files. Each fasta file is a gene sequence for several species, but for some species header is different for each gene, for example:

>EOG7B0H7N|Ectop_sp|C227106_a_3_0_l_281|.|.|Zoractes_sp
>EOG7B0H7P|Ectop_sp|s6255_L_9215_0_a_59_2_l_2007|.|.|Zoractes_sp
>EOG7B0H87|Ectop_sp|C242868_a_14_0_l_390|.|.|Zoractes_sp
>EOG7B3CGS|Ectop_sp|C272142_a_50_0_l_1449|.|.|Zootermopsis_sp
>EOG7B67Q7|Ectop_sp|C265168_a_16_0_l_886|.|.|Zoractes_sp

The structure of fasta files, for the first gene is something like

>sp1
>sp2
>sp3
>EOG7B67Q7|Ectop_sp|C265168_a_16_0_l_886|.|.|Zoractes_sp 
>sp4

I want to rename header for this species, that contains for example, the name Ectop_sp only:

>sp1 
>sp2 
>sp3 
>Ectop_sp
>sp4

Thanks for the help.

fasta header • 1.2k views

ADD COMMENT • link updated 2.2 years ago by caleb_dume ▴ 60 • written 2.2 years ago by Oscar ▴ 30

0

Entering edit mode

To summarize, for all description lines in your FASTA files that contain Ectop_sp somewhere, you'd like the entire description line reduced to just the following?:

>Ectop_sp

Are your FASTA files in a directory all alone? What extension do they have? The end of the name is all the same so you can easily iterate on them to do the replacement as opposed to any files you'd not want to touch?

ADD REPLY • link 2.2 years ago by Wayne ★ 2.0k

0

Entering edit mode

1- Yes the .fasta are a in directory called "header" 2- the structure of header is the same, but in all the fasta files I have different species with the same header structure as for Ectop_sp:

>EOG7B0H7N|Lach_sp|C365643_4_0|.|.|Pediculus_humanus
AKVPPLCAANYPMEPWRCFFGYRIYPFLSAPLFVFQWLFDEAQMTADNVGTPVTKHQWDYIHKMGDSLRNSFQNVSAVFAPSCIAHCVLTMKEWHSVKINDVSLPEAMRCWELSMDPPERSLQDLYLLTHPNSSSSPTTMLADAPLLTGSADVYKTVDELTRKSLEKRRRRKHHKRKLKNEQRGQGRKRRKGQGKKNKNR

>EOG7B0H7N|Eliop_mexicanus|C139724_a_4_0_l_346|.|.|Pediculus_humanus
GMADSGWFLDREPYSVDQHAPLAADAIMLGVPLWHGKVPTLCAAHYPFEPWRCFFGYRIYPFLTAPLFVFQWLFDEAQMTADNVGTPVTKHQWDYIHRMGDSLRNSFQNVSAVFA

>EOG7B0H7N|Ectop_sp|C227106_a_3_0_l_281|.|.|Pediculus_humanus
EKSKEMIKQHLSKRSITCNDGSPSGFYHRPSDGSNRWIVFLEGGWYCYDQQSCHDRWMNQRHLMSSKLWPPV

>EOG7B0H7N|Val_badio|C46245_a_3_0_l_554|.|.|Pediculus_humanus
HHIGRCSWPQCNPSCPKLHNPYTGEEMDFIDLLKSFGLDMESVANALGVDIITLNNMEHEELLKMLTQ

I use some .py that I founded somewhere, it works for one of mi files, but I don't know how to used with al the 2160 genes I have:

#!/usr/bin/python

#Handle to open your file with Python.
fasta_handle = open("EOG7B0H7K.aa.summarized.fa", "r")
#outputfile to generate the modificated fasta file
output_handle = open("EOG7B0H7K.fasta", "w+")

#For each line of your file
for line in fasta_handle:
    #If it is a title, starts with @ in your example. (You can change the @ to > if you want.
    if ">" in line:
        #If it is a title, we want to rename it. As I can see in your header structure, ":" can separate the header.
        header_split = line.strip().split("|")
        print header_split

        #The result of the split is a list of the different part separated by a ":".
        #You want to conserve the first 3 parts, and change the 4, right?
        modification = ">" #What you want to add
        new_header = modification + header_split[1]
        print new_header

        #We write the result in the output file, with '\n' as return to the new line.
        output_handle.write(new_header + '\n') 
    #If the line is a part of the sequence, we have to right it without modifications.
    else:
        output_handle.write(line)

#Closing files
fasta_handle.close()
output_handle.close()

ADD REPLY • link updated 2.2 years ago by finswimmer 16k • written 2.2 years ago by Oscar ▴ 30

1

Entering edit mode

The reason I asked for clarification about the first part is because I think you can do that part with simply a find-and-replace using a regular expression, essentially fancy find-and-replace, to look for any lines that begin with > and then have any number of characters, with Ectop_sp in there followed by any number of letters and spaces then process the find and replace. I've been meaning to try sd that is supposed to be easier to use and faster than sed for this sort of thing. And faster than Python. However, usually for things like this the speed isn't overly critical as under a minute vs. 20 minutes isn't that big of a difference if you are only doing it once. So it looks like you solved that part. (By the way, I was going to point you to a temporary MyBinder session where you'd be able to run sd, since I didn't want to make you install it on your machine. You would have needed to archive/compress your directory and then upload it to the session. And then download your results after.)

My follow-up question was meant to address the looping over the files applying the main processing step to each. The looping on the files in the directory for that I was going to suggest in Python using glob or fnmatch modules to look for any and all fasta file in the directory and execute the find-and-replace on them. I'm just more used to doing that in Python; however, it looks like caleb solved that too with a nice bash loop.

So if you get stuck let me know if you'd like my version that would work on a remote temporary session.

ADD REPLY • link 2.2 years ago by Wayne ★ 2.0k

score 2 · Answer 1 · 2022-02-25

Building off of what you already have you could slightly modify your python file to accept command line arguments. Your python file would look something like:

#!/usr/bin/python

import sys

fasta_in = sys.argv[1]
fasta_out = sys.argv[2]

with open(fasta_in, 'r') as fasta_handle, open(fasta_out, 'w') as output_handle:
    #For each line of your file
    for line in fasta_handle:
        #If it is a title, starts with @ in your example. (You can change the @ to > if you want.
        if ">" in line:
            #If it is a title, we want to rename it. As I can see in your header structure, ":" can separate the header.
            header_split = line.strip().split("|")
            print header_split

            #The result of the split is a list of the different part separated by a ":".
            #You want to conserve the first 3 parts, and change the 4, right?
            modification = ">" #What you want to add
            new_header = modification + header_split[1]
            print new_header

            #We write the result in the output file, with '\n' as return to the new line.
            output_handle.write(new_header + '\n') 
        #If the line is a part of the sequence, we have to right it without modifications.
        else:
            output_handle.write(line)

Then while in the directory where the fasta files are housed, execute a bash for loop like this inserting the name of the python script in place of <script.py>:

for fastaFile in *fa; do python <script.py> $fastaFile ${fastaFile%%.*}.fasta; done