How To Remove The Comments From A List Of Fasta Sequences
3
2
Entering edit mode
12.2 years ago
Anima Mundi ★ 2.9k

Hello,

I would like to know how to remove the comments from a list of FASTA sequences: I think that awk could provide a good solution, but I am not able to deal with it for this purpose. I welcome all the possible solutions, but those Bash-based are preferred.

fasta awk bash • 9.3k views
ADD COMMENT
2
Entering edit mode

A little bit of history for those who are interested... The fasta/Pearson sequence format as described in the FASTA documentation describes the both the contents of the, commonly used, header line ('>') and additional comment lines (starting with ';') as "comments". In common usage only the header lines are used, and most programs don't support the comment lines. See the Wikipedia article (http://en.wikipedia.org/wiki/FASTA_format) for a description of the full format.

ADD REPLY
0
Entering edit mode

To my knowledge, FASTA format doesn't include "comments": it has a header (ID + description), then sequence. Can you give an example of what you want to remove?

ADD REPLY
0
Entering edit mode

Maybe I call "comment" what you call "description", sorry. An example could be:

mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none

I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLY
0
Entering edit mode

Maybe I call "comment" what you call "description", sorry. An example could be:

mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none".

It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLY
0
Entering edit mode

Maybe I call "comment" what you call "description", sorry. An example could be: >mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none. I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLY
0
Entering edit mode

PS: Thanks for the edit, neilfws: in effect it is very hard to deal with a Phocoenidae using awk :D.

ADD REPLY
4
Entering edit mode
12.2 years ago

Use sed:

$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" | head -n 3

>gi|25|emb|X53813.1| Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT

for example, to only keep the gi (for the lines starting with '>', only keep the word after '>gi|' and print it with the prefix '>gi_' )

$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" |\
  sed '/^>/s/^>gi|\([0-9]*\)|.*/>gi_\1/' |head -n 3
>gi_25
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT
ADD COMMENT
4
Entering edit mode
12.2 years ago

If you want to remove the description and if your headers are structured like that: "> + id + space + description", then sed can help:

sed -e 's/^\(>[^[:space:]]*\).*/\1/' my.fasta > mymodified.fasta
ADD COMMENT
0
Entering edit mode

I choose this answer because is the one I used, but I also tried the others and they perfectly work. Thanks to you all!

ADD REPLY
3
Entering edit mode
12.2 years ago
Fwip ▴ 500

Just to add to the sed comments here, the command I would use is:

sed 's/ .*//' myfile.fasta

If I were to use awk, I'd do

awk '{print $1}' myfile.fasta

Both of these assume you don't have any spaces in your sequences.

I like to err on the side of simplicity when dealing with regexes - if it's hard for me to read/understand, it's hard for me to make sure it's working correctly. Pierre's solution is certainly easier to adapt to removing/modifying different parts, though.

ADD COMMENT

Login before adding your answer.

Traffic: 1505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6