Question: How to remove the header line with ">" in a fasta file
0
gravatar for bright602
2.9 years ago by
bright60230
bright60230 wrote:

Hi,

I am a beginner in bioinformatics. I have a fasta file like below (plz ignore the "|")

> Chr3:183153228-183153246
TGGAAAGGACGAAACACCGCG
> Chr3:183286843-183286861
CTAGAAATAGCAAGTTAAA

How do I remove the header so that I can extract the sequence

TGGAAAGGACGAAACACCGCG
CTAGAAATAGCAAGTTAAA

Thank you for your help.

sequencing genome • 4.2k views
ADD COMMENTlink modified 2.9 years ago by genomax66k • written 2.9 years ago by bright60230
4
gravatar for Asaf
2.9 years ago by
Asaf5.5k
Israel
Asaf5.5k wrote:

On Linux grep -v ">" file name

ADD COMMENTlink written 2.9 years ago by Asaf5.5k

Just as an alternative

sed '/^>/d' foo.fa > out.fa
ADD REPLYlink written 2.9 years ago by venu6.1k
3
gravatar for ablanchetcohen
2.9 years ago by
ablanchetcohen1.2k
Canada
ablanchetcohen1.2k wrote:

If you just want to remove all lines starting with ">", you could just use grep, among other options.

grep -v ">" file.fasta > file_without_header.txt
ADD COMMENTlink modified 2.9 years ago by RamRS21k • written 2.9 years ago by ablanchetcohen1.2k
3
gravatar for Antonio R. Franco
2.9 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

I am wondering why do you want to do that. If you erase that line, all of your sequences will be mixed up without any distinction one to the other unless all of them use a single line. In addition, most programs can handle that line by maintaining the identity of each sequence

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Antonio R. Franco4.0k
1

This was recently discussed in a similar thread on Biostars and I had posted the reason below, which I will reproduce here.

Reason I do this sometimes is to cluster (sort|uniq) and/or count number of unique sequences.

ADD REPLYlink written 2.9 years ago by genomax66k

has sense this way..

ADD REPLYlink written 2.9 years ago by Antonio R. Franco4.0k

If you want to do that, you can pipe everything through: grep -v '>' file.fasta | sort | uniq | wc -l to get the number of unique sequences, or grep -v '>' file.fasta | sort | uniq -c to get the number of times each sequence appears. However, clustering would be different than counting. The examples here gets you counts, if you want to cluster (you'll want to change the headers), than you would need some script to do so. I recommend biopython to parse your sequences for ease of use. Again, all assuming the op is on linux, and no coding experience. Alternatively, you can use collapser from fastx_toolkit

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by st.ph.n2.4k

Can you clarify what is the meaning of clustering in this thread?

ADD REPLYlink written 2.9 years ago by Antonio R. Franco4.0k

I agree, what's your reasoning for doing this?

ADD REPLYlink written 2.9 years ago by st.ph.n2.4k

In fact my point of why would one want to do that, what is the motivation behind it.

ADD REPLYlink written 2.9 years ago by ivivek_ngs4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1839 users visited in the last hour