Question: Edit FASTA header to add organism name after the accession number using perl or sed.
0
gravatar for MB
10 months ago by
MB20
MB20 wrote:

I have multiple FASTA files consisting of more than a thousand FASTA sequences with FASTA header as follows:

>KXL50728 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:153:473:1 gene:FE78DRAFT_27124 transcript:KXL50728 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
>KXL50729 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:642:809:1 gene:FE78DRAFT_126205 transcript:KXL50729 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein

I want to edit these headers as follows:

>KXL50728Acidomycesrichmondensis
>KXL50729Acidomycesrichmondensis

Could anybody please tell me how to do it using Perl or using sed command (most preferable)?

sed fasta perl • 393 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by MB20

Thanks to all, it worked!

ADD REPLYlink written 10 months ago by MB20
1

You're welcome.

Please be so kind to mark all answers as accepted. Doing so everyone can see that this solve your problem.

fin swimmer

ADD REPLYlink written 10 months ago by finswimmer11k

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 10 months ago by WouterDeCoster37k
2
gravatar for finswimmer
10 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello MB,

try this:

$ sed 's/^\(>\S*\).*/\1Acidomycesrichmondensis/' your.fasta > new.fasta

What we are asking sed to do is: In every line which startet with > keep any character until the first non-whitespace character and replace the rest of the line with Acidomycesrichmondensis.

In the regex:

  • ^ matches for the line start
  • (...) build a group, so we can output it later
  • \S* matches for as many non-whitespace characters as possible
  • .* matches for any other character

In the substition:

  • \1 print the first group we defined in the regex
  • replaces the rest of line with Acidomycesrichmondensis

Another way is to use awk:

$ awk -F " " '{if($0 ~ "^>") {print $1"Acidomycesrichmondensis"} else {print $0}}' your.fasta > new.fasta

fin swimmer

ADD COMMENTlink modified 10 months ago • written 10 months ago by finswimmer11k
2
gravatar for cpad0112
10 months ago by
cpad011211k
India
cpad011211k wrote:
$ sed  '/>/ s/\s.*/Acidomycesrichmondensis/' test.fa 
$ awk '/>/ {gsub (" .*","Acidomycesrichmondensis", $0)}1' test.fa

input:

$ cat test.fa 
>KXL50728 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:153:473:1 gene:FE78DRAFT_27124 transcript:KXL50728 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
atgc
>KXL50729 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:642:809:1 gene:FE78DRAFT_126205 transcript:KXL50729 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
atgc

output:

$ awk '/>/ {gsub (" .*","Acidomycesrichmondensis", $0)}1' test.fa 
>KXL50728Acidomycesrichmondensis
atgc
>KXL50729Acidomycesrichmondensis
atgc
$ sed  '/>/ s/\s.*/Acidomycesrichmondensis/' test.fa 
>KXL50728Acidomycesrichmondensis
atgc
>KXL50729Acidomycesrichmondensis
atgc
ADD COMMENTlink modified 10 months ago • written 10 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1269 users visited in the last hour