how to change fasta headers?
3
2
Entering edit mode
3.1 years ago

Dear all I have a file containing multiple fasta sequnces like

>lcl|NZ_CP018664.1_prot_WP_000637306.1_3741 [locus_tag=AUO97_RS19225] [protein=hypothetical protein] [protein_id=WP_000637306.1] [location=complement(4001198..4001389)] [gbkey=CDS]
MIVSNNFAVPYYLNVRKEKGMTAYYWATHQSQLALFDSYELAYRFYFPSRHILIRSEIKAFAQ
>lcl|NZ_CP018664.1_prot_WP_000572517.1_3742 [locus_tag=AUO97_RS19230] [protein=FadR family transcriptional regulator] [protein_id=WP_000572517.1] [location=complement(4001417..4002115)] [gbkey=CDS]
MIEQIQKRSLVDEVIHVIRQNIKNDIWKVDEKIPTEPELVQGLGVGRNTIREAIKILEYLGVLEVKQGLG

I want to change headers of this file like

>WP_000637306.1
MIVSNNFAVPYYLNVRKEKGMTAYYWATHQSQLALFDSYELAYRFYFPSRHILIRSEIKAFAQ
>WP_000572517.1
MIEQIQKRSLVDEVIHVIRQNIKNDIWKVDEKIPTEPELVQGLGVGRNTIREAIKILEYLGVLEVKQGLG
sequence • 2.3k views
ADD COMMENT
1
Entering edit mode
3.1 years ago

An awk solution:

$ awk '$0 ~ "^>" {match($0, /protein_id=([0-9A-Z_\.]+)/, protein); print ">"protein[1]; next;}1' input.fa > output.fa

fin swimmer

ADD COMMENT
1
Entering edit mode
3.1 years ago
$ sed '/>/ s/.*_\([A-Z]\{2\}_[0-9]\+.[0-9]\).*/>\1/g' test.fa

or

$ sed '/>/ s/.*=\([A-Z]\+_[0-9]\+.[0-9]\).*/>\1/g' test.fa

>WP_000637306.1
MIVSNNFAVPYYLNVRKEKGMTAYYWATHQSQLALFDSYELAYRFYFPSRHILIRSEIKAFAQ
>WP_000572517.1
MIEQIQKRSLVDEVIHVIRQNIKNDIWKVDEKIPTEPELVQGLGVGRNTIREAIKILEYLGVLEVKQGLG

For case insensitive ID:

$ sed '/>/ s/.*_\([A-Za-z]\{2\}_[0-9]\+.[0-9]\).*/>\1/g' test.fa
ADD COMMENT
0
Entering edit mode

Thanku so much it really works!!

ADD REPLY
0
Entering edit mode

and if the sequences are like

>WP_000580188.1hypotheticalprotein[Acinetobacterbaumannii]
MIGQQRNILATLGIDVWIPRTQVCQKNNAHTLWRDQVVEPHESITVPTIDVPAFEQKNTQPQVLEIPKVVEEPPIVVAEVSQPEILVEKPKVIEQETITPFELQAYCLEKCVIFVDVTALETEEKQLWANIQKAKVGQYSELRWPFPLAAYQDQRGVGSYIQGFLDAVAAEKKILCLGKCTYIQHANIIHLASLKEMLDKPLLKKRLWQLMQDNNE 
>WP_000807316.1MULTISPECIES:glycerol-3-phosphatedehydrogenase(NAD(P)(+))[Acinetobactercalcoaceticus/baumanniicomplex]
 MAEFKFTDLVEPVAV

How can I change to

>WP_000580188.1
MIGQQRNILATLGIDVWIPRTQVCQKNNAHTLWRDQVVEPHESITVPTIDVPAFEQKNTQPQVLEIPKVVEEPPIVVAEVSQPEILVEKPKVIEQETITPFELQAYCLEKCVIFVDVTALETEEKQLWANIQKAKVGQYSELRWPFPLAAYQDQRGVGSYIQGFLDAVAAEKKILCLGKCTYIQHANIIHLASLKEMLDKPLLKKRLWQLMQDNNE 

>WP_000807316.1
MAEFKFTDLVEPVAV
ADD REPLY
0
Entering edit mode

Hello sharmatina189059,

  • Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
    code_formatting
  • There is no connection between your input and output example. Please review this and provide a correct one.

Thanks.

fin swimmer

ADD REPLY
0
Entering edit mode

I have edited my post. Thanks!!

ADD REPLY
0
Entering edit mode
$ sed '/>/ s/\(^>[0-9A-Z_]\+\.[0-9]\+\).*/\1/g' input.fa > output.fa
ADD REPLY
0
Entering edit mode

I wonder how you get to (or where you got them from) those kind of fasta headers? they're really violating all possible rules for fasta header formatting ...

ADD REPLY
0
Entering edit mode

But I have retrieved it from NCBI ftp site and the format is same. AM I doing some error?

ADD REPLY
0
Entering edit mode

don't think it's you.

The ones in your original post are OK, but the ones you posted here above are missing all the spaces. if you retrieve them this way from NCBI I would let them know that something is off

ADD REPLY
0
Entering edit mode

try this: sharmatina189059

$ sed '/>/ s/^\(>\w\+.\w\).*/\1/g' file.fa

or

$ sed '/>/ s/^\(.*\.\w\).*/\1/g' file.fa

>WP_000580188.1
MIGQQRNILATLGIDVWIPRTQVCQKNNAHTLWRDQVVEPHESITVPTIDVPAFEQKNTQPQVLEIPKVVEEPPIVVAEVSQPEILVEKPKVIEQETITPFELQAYCLEKCVIFVDVTALETEEKQLWANIQKAKVGQYSELRWPFPLAAYQDQRGVGSYIQGFLDAVAAEKKILCLGKCTYIQHANIIHLASLKEMLDKPLLKKRLWQLMQDNNE 
>WP_000807316.1
 MAEFKFTDLVEPVAV
ADD REPLY
0
Entering edit mode
3.1 years ago
Hugo ▴ 340

Dear colleague, I think that you may also find useful the SEDA software (http://sing-group.org/seda/ ).

As you can see in the manual (https://www.sing-group.org/seda/manual ), it has a lot of different customizable operations to process FASTA files, including operations to reformat sequence headers as you need (see the Rename header operation).

It also includes a specific NCBI Rename operation (https://www.sing-group.org/seda/manual/operations.html#ncbi-rename ) aimed to replace accession codes with species names and taxonomy information.

With best regards,

Hugo.

ADD COMMENT

Login before adding your answer.

Traffic: 1889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6