Question

Edit FASTA header to add organism name after the accession number using perl or sed.

0

Entering edit mode

5.9 years ago

MB ▴ 50

I have multiple FASTA files consisting of more than a thousand FASTA sequences with FASTA header as follows:

>KXL50728 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:153:473:1 gene:FE78DRAFT_27124 transcript:KXL50728 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
>KXL50729 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:642:809:1 gene:FE78DRAFT_126205 transcript:KXL50729 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein

I want to edit these headers as follows:

>KXL50728Acidomycesrichmondensis
>KXL50729Acidomycesrichmondensis

Could anybody please tell me how to do it using Perl or using sed command (most preferable)?

Perl FASTA sed • 2.3k views

ADD COMMENT • link 5.9 years ago by MB ▴ 50

0

Entering edit mode

Thanks to all, it worked!

ADD REPLY • link 5.9 years ago by MB ▴ 50

1

Entering edit mode

You're welcome.

Please be so kind to mark all answers as accepted. Doing so everyone can see that this solve your problem.

fin swimmer

ADD REPLY • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2018-05-20

Hello MB,

try this:

$ sed 's/^\(>\S*\).*/\1Acidomycesrichmondensis/' your.fasta > new.fasta

What we are asking sed to do is: In every line which startet with > keep any character until the first non-whitespace character and replace the rest of the line with Acidomycesrichmondensis.

In the regex:

^ matches for the line start
(...) build a group, so we can output it later
\S* matches for as many non-whitespace characters as possible
.* matches for any other character

In the substition:

\1 print the first group we defined in the regex
replaces the rest of line with Acidomycesrichmondensis

Another way is to use awk:

$ awk -F " " '{if($0 ~ "^>") {print $1"Acidomycesrichmondensis"} else {print $0}}' your.fasta > new.fasta

fin swimmer

score 2 · Accepted Answer · 2018-05-20

$ sed  '/>/ s/\s.*/Acidomycesrichmondensis/' test.fa 
$ awk '/>/ {gsub (" .*","Acidomycesrichmondensis", $0)}1' test.fa

input:

$ cat test.fa 
>KXL50728 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:153:473:1 gene:FE78DRAFT_27124 transcript:KXL50728 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
atgc
>KXL50729 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:642:809:1 gene:FE78DRAFT_126205 transcript:KXL50729 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
atgc

output:

$ awk '/>/ {gsub (" .*","Acidomycesrichmondensis", $0)}1' test.fa 
>KXL50728Acidomycesrichmondensis
atgc
>KXL50729Acidomycesrichmondensis
atgc
$ sed  '/>/ s/\s.*/Acidomycesrichmondensis/' test.fa 
>KXL50728Acidomycesrichmondensis
atgc
>KXL50729Acidomycesrichmondensis
atgc