Question: Renaming FASTA headers while keeping some previous information?
0
gravatar for Alec Watanabe
16 days ago by
Alec Watanabe20 wrote:

Hi community,

I know this is a common type of question since there are lots of posts about FASTA headers, but since I don't have a good basis in informatics or with sed's syntax I don't really know how to make the right command. So, I have a FASTA file that looks like this:

>CDS CDS FIG00947432: hypothetical protein 2322:2624 forward MW:11399
MSSKYPVAYAVGQKIKSLRKSQGYTVFQLAKEIDISEQQLFRYERGVNRIDIDCLVRVLE
VLGVNIGSFFEEVTGGMAQEIERNEQHIPSHFDSKALSIF
>CDS CDS Thiosulfate reductase cytochrome B subunit 3993:4772 reverse MW:29554
MTSIWGAELHYTPDYWPVWLMAAGLLIVAMIAVLVIHGLLRYALAPKHTGHYEEERVYLY
SKAIRFWHWGNALLFILLLLSGFLGHFSIGNVTSMVLLHKICGFVLIAFWIGFILINLTT
SNGVHYKVRFSGLIGRCIKQARFYLYGIMKGEPHPFAATETDKFNPLQQLAYLGVMFGLV
PLLLVTGLLCLYPEVLGYGYWMLKAHLVLGIVALMFICAHFYLCTLGDTFTQTFRSMVDG
HHRHQKHDNHRSANEKVEH
>CDS CDS Thiosulfate reductase electron transport protein phsB 4769:5344 reverse MW:21362
MNNNKQFVMLHDEKRCIGCQACTVACKVINDIPEGFSRLQVQIQGPHNDEAGNPHYQFFR
VSCQHCEDAPCVSVCPTGASFIDENGIVQVKKELCIGCDYCVGACPYHVRYINPMTHIAD
KCNFCSDTRLTEGELPACVSVCPTDALAFGRIDSPEIQAWIKQKSVYQYQLDNVGKPSLF
RRKEIHQGDKA
>CDS CDS Thiosulfate reductase precursor (EC 1.-.-.-) 5359:7638 reverse MW:83512
MSISRRSFIKGMGVGCVGCTVSSLPPGALAFNPVDSLKGQSTLTPSLCEMCSYRCPIEAQ
VVNNKTVFIQGNRNAEHQSSRVCARGGSGVSLVNDPNRIVKPMKHKGPRGAGEWEVISWE
QAYKEIAEKMNAIKQNYGAESISFSSKSGSLSSHLFHLAAAFGSPNTFTHASTCPAGKAI
AASVMMGGDLKMDLANSKYILSFGHNLYEGIEVAETHELMTAQERGAKLVSFDPRLSVVS
SKADEWFAIRPGGDLPVLMAMCHILIKEDLYDKEFVEKFTVGFPQLKDVLQETTPEWAQA
HSDVPAKDIVRIAREIAAKAPHALIMPGHRATFNKEEINMRRMIFTFNALLGNIEREGGL
YQKKAATKYNKLAGIAVAPELAKPSVKGMPEITAKRIDATAPQFKYINKGGGIVQSIIDS
TLEGVPYQTKAWIMSRHNPFQTVSCRPDLEKAAQKLDLIVSCDVYLSESAAYADYLLPEC
TYLERDEEVADVSGLNPAYALRQQVVEPIGDTKPSWLIWMELGKALGLEACFPWENMGVR
QLYQVNGSEELYKEMHKKGYISYGVPLLLREPSYVKAFVDQYPDAIKQVDSNNTMEKALS
FKSPSGLIEIYSEELESRLENYGIPRFHNFPLKEKDELYFIQGKVAVHTNGATQYVPLLA
ELMWENPVWLHPETAKNHGIKHGDEIILENSVGKEKARALITEGIRPDTVFVYMGSGAKA
GAKTAATTTGVHCGNLLPHEISPVSGTDVHTSGVRISRA

I want to rename all FASTA headers so it contains a number after the first CDS. The output would be like this:

>CDS1 CDS FIG00947432: hypothetical protein 2322:2624 forward MW:11399
MSSKYPVAYAVGQKIKSLRKSQGYTVFQLAKEIDISEQQLFRYERGVNRIDIDCLVRVLE
VLGVNIGSFFEEVTGGMAQEIERNEQHIPSHFDSKALSIF
>CDS2 CDS Thiosulfate reductase cytochrome B subunit 3993:4772 reverse MW:29554
MTSIWGAELHYTPDYWPVWLMAAGLLIVAMIAVLVIHGLLRYALAPKHTGHYEEERVYLY
SKAIRFWHWGNALLFILLLLSGFLGHFSIGNVTSMVLLHKICGFVLIAFWIGFILINLTT
SNGVHYKVRFSGLIGRCIKQARFYLYGIMKGEPHPFAATETDKFNPLQQLAYLGVMFGLV
PLLLVTGLLCLYPEVLGYGYWMLKAHLVLGIVALMFICAHFYLCTLGDTFTQTFRSMVDG
HHRHQKHDNHRSANEKVEH
>CDS3 CDS Thiosulfate reductase electron transport protein phsB 4769:5344 reverse MW:21362
MNNNKQFVMLHDEKRCIGCQACTVACKVINDIPEGFSRLQVQIQGPHNDEAGNPHYQFFR
VSCQHCEDAPCVSVCPTGASFIDENGIVQVKKELCIGCDYCVGACPYHVRYINPMTHIAD
KCNFCSDTRLTEGELPACVSVCPTDALAFGRIDSPEIQAWIKQKSVYQYQLDNVGKPSLF
RRKEIHQGDKA
>CDS4 CDS Thiosulfate reductase precursor (EC 1.-.-.-) 5359:7638 reverse MW:83512
MSISRRSFIKGMGVGCVGCTVSSLPPGALAFNPVDSLKGQSTLTPSLCEMCSYRCPIEAQ
VVNNKTVFIQGNRNAEHQSSRVCARGGSGVSLVNDPNRIVKPMKHKGPRGAGEWEVISWE
QAYKEIAEKMNAIKQNYGAESISFSSKSGSLSSHLFHLAAAFGSPNTFTHASTCPAGKAI
AASVMMGGDLKMDLANSKYILSFGHNLYEGIEVAETHELMTAQERGAKLVSFDPRLSVVS
SKADEWFAIRPGGDLPVLMAMCHILIKEDLYDKEFVEKFTVGFPQLKDVLQETTPEWAQA
HSDVPAKDIVRIAREIAAKAPHALIMPGHRATFNKEEINMRRMIFTFNALLGNIEREGGL
YQKKAATKYNKLAGIAVAPELAKPSVKGMPEITAKRIDATAPQFKYINKGGGIVQSIIDS
TLEGVPYQTKAWIMSRHNPFQTVSCRPDLEKAAQKLDLIVSCDVYLSESAAYADYLLPEC
TYLERDEEVADVSGLNPAYALRQQVVEPIGDTKPSWLIWMELGKALGLEACFPWENMGVR
QLYQVNGSEELYKEMHKKGYISYGVPLLLREPSYVKAFVDQYPDAIKQVDSNNTMEKALS
FKSPSGLIEIYSEELESRLENYGIPRFHNFPLKEKDELYFIQGKVAVHTNGATQYVPLLA
ELMWENPVWLHPETAKNHGIKHGDEIILENSVGKEKARALITEGIRPDTVFVYMGSGAKA
GAKTAATTTGVHCGNLLPHEISPVSGTDVHTSGVRISRA

The motive is simple. I'll be submitting this FASTA file to softwares like SurfG+, MEDpipe and inmembrane. Since this file does not contain CDSs with any kind of identification or enumeration, the software's output wouldn't tell me which one is which. Thank you all in advance.

headers fasta rename • 132 views
ADD COMMENTlink modified 15 days ago by hlfzeus50 • written 16 days ago by Alec Watanabe20

This is a commonly asked question here. You should find multiple threads to help with this. Use google to do an external search against Biostars. Internal Biostars search engine is not the best.

Here is one: How To Rename FASTA Headers

ADD REPLYlink modified 16 days ago • written 16 days ago by GenoMax42k
3
gravatar for hlfzeus
15 days ago by
hlfzeus50
hlfzeus50 wrote:

Dear Alec, I would suggest you to try the "Rename header" option of SEDA (http://www.sing-group.org/seda/). Section 3.8.4 "Add prefix/suffix" of the manual explains you how to easily achieve what you want: add an index after the header id. Do not hesitate contact me if you need some help.

Regards,

Hugo.

ADD COMMENTlink written 15 days ago by hlfzeus50
1

Dear Hugo,

Thank you! SEDA works fantastically! I was able to correctly rename all headers. Unfortunately when I submitted the file to the pipeline I'm working with, it didn't accept the format. It says the file is not in FASTA format. Maybe I did something wrong when editing the file (I manually made this fasta file). I'm gonna use a FASTA file automatically made from an EMBL one. It's gonna take more time because I'll have to manually check the sequences (the CDSs do not have info about locus_tag in this specific file, so I need to check each sequence by it's product). The problem will be if I find any hypothetical protein. After analysis in the pipeline, the number of sequences I'll need to check will be reduced to around 80, so it's not that bad. It's strange though. I was paying so much attention when I made the fasta file. Checked if the translation from nucleotide to AA was correct, column size is ok, headers are in the same format as other files. Anyways, for this specific question, your answer is really helpful so I'll just choose it as final answer. Thank you!

ADD REPLYlink written 15 days ago by Alec Watanabe20
1
gravatar for Vijay Lakhujani
15 days ago by
Vijay Lakhujani1.7k
India
Vijay Lakhujani1.7k wrote:

I suggest you start familiarizing yourself with seqkit which is a super fast fasta/q manipulation tool kit. It's really easy to setup (you just need a file, no installation technically) and the documentation is too good.

I know many non-technical people who are fairly comfortable using it. See this post for example

ADD COMMENTlink written 15 days ago by Vijay Lakhujani1.7k
seqkit replace -p "^(.+?) " -r "\${1}{nr} " seqs.fa
ADD REPLYlink written 15 days ago by shenwei3563.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 951 users visited in the last hour