How To Remove The Comments From A List Of Fasta Sequences
Hello,

I would like to know how to remove the comments from a list of FASTA sequences: I think that awk could provide a good solution, but I am not able to deal with it for this purpose. I welcome all the possible solutions, but those Bash-based are preferred.

A little bit of history for those who are interested... The fasta/Pearson sequence format as described in the FASTA documentation describes the both the contents of the, commonly used, header line ('>') and additional comment lines (starting with ';') as "comments". In common usage only the header lines are used, and most programs don't support the comment lines. See the Wikipedia article (http://en.wikipedia.org/wiki/FASTA_format) for a description of the full format.

To my knowledge, FASTA format doesn't include "comments": it has a header (ID + description), then sequence. Can you give an example of what you want to remove?

Maybe I call "comment" what you call "description", sorry. An example could be:

It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

PS: Thanks for the edit, neilfws: in effect it is very hard to deal with a Phocoenidae using awk :D.

Use sed:

$curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" | head -n 3 >gi|25|emb|X53813.1| Blue Whale heavy satellite DNA TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT for example, to only keep the gi (for the lines starting with '>', only keep the word after '>gi|' and print it with the prefix '>gi_' )$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" |\
sed '/^>/s/^>gi|$$[0-9]*$$|.*/>gi_\1/' |head -n 3
>gi_25
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT
If you want to remove the description and if your headers are structured like that: "> + id + space + description", then sed can help:

sed -e 's/^$$>[^[:space:]]*$$.*/\1/' my.fasta > mymodified.fasta
I choose this answer because is the one I used, but I also tried the others and they perfectly work. Thanks to you all!

Just to add to the sed comments here, the command I would use is:

sed 's/ .*//' myfile.fasta

If I were to use awk, I'd do

awk '{print \$1}' myfile.fasta

Both of these assume you don't have any spaces in your sequences.

I like to err on the side of simplicity when dealing with regexes - if it's hard for me to read/understand, it's hard for me to make sure it's working correctly. Pierre's solution is certainly easier to adapt to removing/modifying different parts, though.