Question: How To Rename FASTA Headers
0
gravatar for mollysil
2.7 years ago by
mollysil0
mollysil0 wrote:

Hi,

I have very little experience with scripts. I want to change my FASTA sequence headers (I have 100's of FASTA sequences per file) from very long headers to headers with the sample name (CM1) and then ascending numbers. Basically, I want to go from this:

>M00505:63:000000000AWRE0:1:1101:11224:1105_1:N:0:NCTACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1101:11224:1105_2:N:0:NCTACGCT+CTAAGCCT;size=290797;
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGGTCTCACTGATTCCTAACCCCTGATGAACCGTAATGCCATTAATTTGGTGTTGCGGGGAATTTGGACTGAAGGGGGAAAAAATTAGAGTGTTTAAAGCAAGCTA
>M00505:63:000000000AWRE0:1:1107:23836:6960_1:N:0:GCTACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1107:23836:6960_2:N:0:GCTACGCT+CTAAGCCT;size=2;
GGGTTAGTAGGTTGGTCAGCCCTCTGGTATGTACTGGTCTCACTGATTCCTCCTTTTCCATGAACCGTAATGCCATTAATTTTGAATTGCGGGAAATTTGAACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
>M00505:63:000000000AWRE0:1:1103:16981:11028_1:N:0:GATACGCT+CTAAGCCT/M00505:63:000000000AWRE0:1:1103:16981:11028_2:N:0:GATACGCT+CTAAGCCT;size=1;
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGRTCTCACTGATTCCTCCTTCCTGACGAACTGTAATGCCATTAATTTGGTGTTGCAGGRAATTTGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT

To this:

>CM1_1
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGGTCTCACTGATTCCTAACCCCTGATGAACCGTAATGCCATTAATTTGGTGTTGCGGGGAATTTGGACTGAAGGGGGAAAAAATTAGAGTGTTTAAAGCAAGCTA
>CM1_2
GGGTTAGTAGGTTGGTCAGCCCTCTGGTATGTACTGGTCTCACTGATTCCTCCTTTTCCATGAACCGTAATGCCATTAATTTTGAATTGCGGGAAATTTGAACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT
>CM1_3
GGGTTAGTAGGTTGGTCATGCCTCTGGTATGTACTGRTCTCACTGATTCCTCCTTCCTGACGAACTGTAATGCCATTAATTTGGTGTTGCAGGRAATTTGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAAGCT

I am able to do this with two separate scripts using:

sed 's/>.*/&CM1/' file.fa > output.fa  
cat output.fa | perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' > output2.fa

But I would like to do it all in one step. Any ideas?
THANK YOU!! Molly

header fasta • 3.3k views
ADD COMMENTlink modified 20 months ago by h.mon27k • written 2.7 years ago by mollysil0
3
gravatar for venu
2.7 years ago by
venu6.2k
Germany
venu6.2k wrote:

You can do following

sed '/^>/d' file.fa | wc -l # say you got 100

Then

for i in {1..100}; do echo CM1_$i; done | paste - <(sed '/^>/d' file.fa) | sed -e 's/^/>/' -e 's/\t/\n/' > new_file.fa

Update: Note that this assumes each sequences in the file is single line without any newline characters.

cat file.fa

>chr1
AGGATGACAGA
>chr2
GACTTTAAGAC

number of sequences in file.fa

sed '/^>/d' file.fa | wc -l
# 2

for loop to generate headers like CM1_1, CM1_2

for i in {1..2}; do echo CM1_$i; done
# CM1_1
# CM1_2

Paste these new headers to sequences whose headers are removed

for i in {1..2}; do echo CM1_$i; done | paste - <(sed '/^>/d' file.fa)

# CM1_1   AGGATGACAGA
# CM1_2   GACTTTAAGAC

Finally arrange it to a standard fasta format

for i in {1..2}; do echo CM1_$i; done | paste - <(sed '/^>/d' file.fa) |sed -e 's/^/>/' -e 's/\t/\n/'

# >CM1_1
# AGGATGACAGA
# >CM1_2
# GACTTTAAGAC
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by venu6.2k

im bit new to scripting so can you explain what does the for loop after initialization does from (1..100) i understood since the total no of lines perhaps , so what after that loop does if you can explain it would be really good....

ADD REPLYlink written 2.7 years ago by krushnach80570

Check the updated answer.

ADD REPLYlink written 2.7 years ago by venu6.2k

This was a cool approach! When I did your suggestion, it got the header name correct (finally!) but it cut huge portions of my sequences... How do I get it to not do that? (I first linearized all the fasta files and then ran your scripts...but I still get this issue of cutting sequences). Now it looks like:

CM1_1 GGGTCAGCAGGTTGGTCGTGCCACTGGTATGAACTGGCCTTGCTGATTCCTCCTTCTTGAAGAACCGTGATGCCATTAAT CM1_2 TTGGTGTTGCGGGGAAACAGGACTGTTACTTTGAAAAAATTAGAGTGTTTAAAGCAGGCTAACGCTTGAATACATTAGCA CM1_3 TGGAATAATAAAATAGGACGTTTAATCCTATTTTGTTGGTTTCTAGGGTTGACGTA

But my sequences are supposed to be a lot longer than that. Thanks so much for your input! Any suggestion for this problem?

ADD REPLYlink written 2.7 years ago by mollysil0
1
gravatar for Brian Bushnell
2.7 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can do that with the BBMap package:

rename.sh in=x.fasta out=y.fasta prefix=CM1
ADD COMMENTlink written 2.7 years ago by Brian Bushnell16k

I am using Linux. rename doesn't work. Is there a linux version of this?

ADD REPLYlink written 2.7 years ago by mollysil0
1

It does work in linux. Could you post how you've used?

ADD REPLYlink written 2.7 years ago by venu6.2k
1

Its only dependency is having Java installed.

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k
1

I think @mollysil 's confusion is that BBMap package must be installed in order for this command to work. It's true that rename.sh is not a standard Unix tool...

ADD REPLYlink written 2.7 years ago by novice920
1

True! Unfortunate, but true :)

I'm kidding, of course. BBMap is irrelevant to at least 99.9% of Linux users.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Brian Bushnell16k

Can you provide me with step by step instructions on how to get BBMap? I have java but the tar script on the bbmap website didn't work. (BBMap_35.74.tar.gz: Cannot open: No such file or directory). My apologies for my ignorance of this stuff. Thanks for all your help!!

ADD REPLYlink written 2.7 years ago by mollysil0

If you are in the same directory as BBMap_35.74.tar.gz (which is not the latest version, by the way; I recommend using the latest version, which is 36.67), you would execute:

tar -xzf BBMap_35.74.tar.gz

It sounds like you are in a different directory. What do you get when you run "ls"?

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Brian Bushnell16k

Yes you are correct. I hadn't downloaded the tar.gz file into my working directory. Did that. Now your script ran just fine. After installing with that code, do I still need to download certain tools within BBMap? Because the rename script doesn't work still. If so, what is the code for that?

ADD REPLYlink written 2.7 years ago by mollysil0

As long as Java is installed and you are using the bash shell, everything should work fine... Can you post the exact command and screen output?

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k

Also, will your original rename script do both of the two tasks I want (adding the CM1 sample number AND add ascending OTU numbers, in that order)? Or will it just put the CM1 in the header? Which part of the code is for adding the ascending OTU numbers?

ADD REPLYlink written 2.7 years ago by mollysil0

The command I posted will rename the sequences like this:

CM1_1

CM1_2

...etc. The default is to use ascending numbers. You can get more information by running rename.sh with no parameters or just looking at it in a text editor.

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k

I got it all to work. Thank you so much for your assistance!!

ADD REPLYlink written 2.7 years ago by mollysil0
1
gravatar for shenwei356
2.7 years ago by
shenwei3564.7k
China
shenwei3564.7k wrote:

Here's a solution using SeqKit.

It provides executable binary files for Windows/Linux/Mac OS, so just download and run, no any dependencies :)

for f in *.fasta *.fa; do
    seqkit replace -p '.+' -r 'CM1_{nr}' $f > $f.rename.fa
done
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by shenwei3564.7k

Can you provide me with the code for downloading SeqKit on the Linux platform?

ADD REPLYlink written 2.7 years ago by mollysil0

well

wget https://github.com/shenwei356/seqkit/releases/download/v0.4.0/seqkit_linux_amd64.tar.gz
tar -zxvf seqkit_linux_amd64.tar.gz
mkdir -p ~/bin
mv seqkit ~/bin
ADD REPLYlink written 2.7 years ago by shenwei3564.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour