Question: How to delete last 5 characters off of FASTA header?
0
gravatar for angela1
6 months ago by
angela10
angela10 wrote:

Hi,

I am trying to remove the last 5 characters from my FASTA header in my sequencing data. I have ≈400,000 sequences and have tried to use sed command in terminal to do this for me.

Input text:

>1-4-8.45  
TAGGGAGA

Expected Output:

>1-4           
TAGGGAGA

How can I use sed command to remove the last 5 characters from my FASTA headers?

sed fasta header • 430 views
ADD COMMENTlink modified 6 months ago by wm490 • written 6 months ago by angela10
1
gravatar for wm
6 months ago by
wm490
China
wm490 wrote:

using sed, this solution is not consider the white spaces in header.

$ sed '/^>/s/.\{5\}$//' in.fa

for fasta and fastq file, bioawk https://github.com/lh3/bioawk is also good option, it can separate the $name and $comment in header.

$ bioawk -cfastx '{id=substr($name, 0, length($name) - 5); print ">"id"\n"$seq}'
ADD COMMENTlink written 6 months ago by wm490
0
gravatar for RamRS
6 months ago by
RamRS30k
Baylor College of Medicine, Houston, TX
RamRS30k wrote:

What have you tried? This sort of problem has been addressed on the site multiple times

sed can match the first character of each line to pick lines where an operation is performed - you can use that to restrict the operation to just header lines. You can also capture the last five characters with the regex (.{5})$.

Please use these hints to get to the solution.

ADD COMMENTlink modified 6 months ago • written 6 months ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1931 users visited in the last hour