Question: Customize fasta headers
0
gravatar for bkvijay.jayaraman
3.3 years ago by
bkvijay.jayaraman0 wrote:

Hi, I have a fasta file with 300 protein sequences. I intend to construct a phylogenetic tree with it. I would want only the accession number and the organism name in the fasta header and remove the rest of the information. Can anybody suggest how to do this? I have a linux based system with perl and python installed.

For example, i want to convert a header like this:

 >gi|685204428|gb|AIN98665.1| fumarate hydratase, putative [Leishmania panamensis]

to a header like this

>Leishmania panamensis| AIN98665.1

Some sequences have multiple headers. Would that be a problem?

regards Vijay

sequence edit • 1.7k views
ADD COMMENTlink modified 3.3 years ago by moranr250 • written 3.3 years ago by bkvijay.jayaraman0
4

Strictly speaking, yours is not a right FASTA. Anything following the first space/tab is not part of the sequence name. Renaming fasta like this may confuse other tools.

ADD REPLYlink written 3.3 years ago by lh331k
2
gravatar for RamRS
3.3 years ago by
RamRS23k
Houston, TX
RamRS23k wrote:

Sequences cannot have multiple headers, AFAIK. Are you sure those are not just headers with empty sequences?

You can use a combination of bioawk and sed to work for you. The sed command would be like:

echo $header | sed -re 's/^>.*gb[|]([A-Z0-9.]+)[|].*[[]([aA-zZ ]+)[]]/>\2| \1/'
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by RamRS23k
1

"Block quotes" in post editor used by OP was messing up the display. The headers are regular fasta type (see above).

ADD REPLYlink written 3.3 years ago by genomax70k

I ignored that part - my regex includes the > sign in the header. It's OP's statements on "multiple headers" that has me curious.

ADD REPLYlink written 3.3 years ago by RamRS23k
0
gravatar for moranr
3.3 years ago by
moranr250
Ireland
moranr250 wrote:

HI,

You could use biopython, something like (ive done no testing)

from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(file),'fasta') 
output_file="editedHeaders.fa" 
myRecords=list() 
for fasta in fasta_sequences:
            originalID=fasta.id 
            newIDList=fasta.description.split('[')
            species=newIDList[-1].replace(']','')
            newID=species + "|" + newIDList[0].split('|')[3] 
            fasta.id=newID
            fasta.description=""
            myRecords.append(fasta)

SeqIO.write(myRecords, output_file, "fasta")

I would also add a dict to save the original headers and pickle it so you can look up if needed.

Good luck

ADD COMMENTlink written 3.3 years ago by moranr250
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1289 users visited in the last hour