Question: How To Remove The Comments From A List Of Fasta Sequences
2
gravatar for Anima Mundi
9.0 years ago by
Anima Mundi2.8k
Italy
Anima Mundi2.8k wrote:

Hello,

I would like to know how to remove the comments from a list of FASTA sequences: I think that awk could provide a good solution, but I am not able to deal with it for this purpose. I welcome all the possible solutions, but those Bash-based are preferred.

fasta awk bash • 7.3k views
ADD COMMENTlink modified 9.0 years ago by Frédéric Mahé3.1k • written 9.0 years ago by Anima Mundi2.8k
2

A little bit of history for those who are interested... The fasta/Pearson sequence format as described in the FASTA documentation describes the both the contents of the, commonly used, header line ('>') and additional comment lines (starting with ';') as "comments". In common usage only the header lines are used, and most programs don't support the comment lines. See the Wikipedia article (http://en.wikipedia.org/wiki/FASTA_format) for a description of the full format.

ADD REPLYlink written 9.0 years ago by Hamish3.2k

To my knowledge, FASTA format doesn't include "comments": it has a header (ID + description), then sequence. Can you give an example of what you want to remove?

ADD REPLYlink written 9.0 years ago by Neilfws49k

Maybe I call "comment" what you call "description", sorry. An example could be:

mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none

I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLYlink written 9.0 years ago by Anima Mundi2.8k

Maybe I call "comment" what you call "description", sorry. An example could be:

mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none".

It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLYlink written 9.0 years ago by Anima Mundi2.8k

Maybe I call "comment" what you call "description", sorry. An example could be: >mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none. I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLYlink written 9.0 years ago by Anima Mundi2.8k

PS: Thanks for the edit, neilfws: in effect it is very hard to deal with a Phocoenidae using awk :D.

ADD REPLYlink written 9.0 years ago by Anima Mundi2.8k
4
gravatar for Pierre Lindenbaum
9.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

Use sed:

$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" | head -n 3

>gi|25|emb|X53813.1| Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT

for example, to only keep the gi (for the lines starting with '>', only keep the word after '>gi|' and print it with the prefix '>gi_' )

$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" |\
  sed '/^>/s/^>gi|\([0-9]*\)|.*/>gi_\1/' |head -n 3
>gi_25
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT
ADD COMMENTlink written 9.0 years ago by Pierre Lindenbaum134k
4
gravatar for Frédéric Mahé
9.0 years ago by
France, Montpellier, CIRAD
Frédéric Mahé3.1k wrote:

If you want to remove the description and if your headers are structured like that: "> + id + space + description", then sed can help:

sed -e 's/^\(>[^[:space:]]*\).*/\1/' my.fasta > mymodified.fasta
ADD COMMENTlink modified 9.0 years ago • written 9.0 years ago by Frédéric Mahé3.1k

I choose this answer because is the one I used, but I also tried the others and they perfectly work. Thanks to you all!

ADD REPLYlink written 9.0 years ago by Anima Mundi2.8k
3
gravatar for Fwip
9.0 years ago by
Fwip490
United States
Fwip490 wrote:

Just to add to the sed comments here, the command I would use is:

sed 's/ .*//' myfile.fasta

If I were to use awk, I'd do

awk '{print $1}' myfile.fasta

Both of these assume you don't have any spaces in your sequences.

I like to err on the side of simplicity when dealing with regexes - if it's hard for me to read/understand, it's hard for me to make sure it's working correctly. Pierre's solution is certainly easier to adapt to removing/modifying different parts, though.

ADD COMMENTlink written 9.0 years ago by Fwip490
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2060 users visited in the last hour
_