rename mutiple fasta header for mutiple fasta files
3
0
Entering edit mode
7.2 years ago

Hi,

I have mutiple fasta file and I want to change the header, for this I am using -

awk '/^>/{print ">C1_" ++i; next}{print}' C1_pandaseq.fasta > C1_pandaseq_new.fasta

input fasta-

>M03419:60:656544:1:1101:25150:3877:1
CCTACGGGTGGCTGCAGTGGGGAATTTTGGACAA
>M03419:60:656544:1:1101:8498:4267:1
ACTACGGGAGGCAGCAGGGGGGAATTTTGGACAATGGAAACCAGGG
>M03419:60:656544:1:1101:7884:4445:1
CCTACGGGTGGCAGCAGTGGGGAATATTGGACAATGCAACCCTGATCCAGC

output fasta-

>C1_1
CCTACGGGTGGCTGCAGTGGGGAATTTTGGACAAAAAAAAAAAAAAAA
>C1_2
ACTACGGGAGGCAGCAGGGGGGAATTTTGGACAATGGAAACCAGGG
>C1_3
CCTACGGGTGGCAGCAGTGGGGAATATTGGACAATGCAACCCTGATCCAGC

Similarly i have mutiple fasta file, which looks like-

C2_pandaseq.fasta
C4_pandaseq.fasta
C5_pandaseq.fasta
C8_pandaseq.fasta
T2_pandaseq.fasta
T7_pandaseq.fasta

So I need to rename all the fasta file header, e.g.,

for fasta file C2_pandaseq.fasta
>C2_1
AAAAAAAAAAAAAAAAAAAAAAA
>C2_2
ATTTTGGGGGGGGCCCCCCCGGGGGGGA

for fasta file C4_pandaseq.fasta
>C4_1
AAAAAAAAAAAAAAAAAAAAAAA
>C4_2
ATTTTGGGGGGGGCCCCCCCGGGGGGGA

and so on...

For each fasta file, i need to rename fasta header according to the file name only. Therefore, I need to write a for loop for this, but i dont know how can i do that.

Any help. Thanks

genome next-gen sequencing sequence • 5.0k views
ADD COMMENT
0
Entering edit mode

Hello bioinformaticssrm2011!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=74234

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

I understand, but I was not able to use the script mentioned there for my work. Though I appreciate there help.

ADD REPLY
4
Entering edit mode
7.2 years ago
 ls *_pandaseq.fasta | cut -d "_" -f 1 | while read PREFIX; do awk -v P=${PREFIX} '/^>/{print ">" P "_" ++i; next}{print}' ${PREFIX}_pandaseq.fasta > ${PREFIX}_pandaseq_new.fasta ; done
ADD COMMENT
1
Entering edit mode

Thanks, Pierre. I learned something new here since I don't typically use awk. I first tried to reference your answer on this post, having a similar approach with cut, but couldn't figure out how to pass the variable to the awk statement for the headers. Now I know to use -v.

ADD REPLY
0
Entering edit mode

Thank you Pierre. It works.

ADD REPLY
1
Entering edit mode
7.2 years ago

Combining seqkit and rush:

ls *_pandaseq.fasta \
    | rush 'cat {} | seqkit replace -p ".+" -r "{^_pandaseq.fasta}_{nr}" > {.}.fa'

Explain:

  • Seqkit is used to rename fasta header. {nr} means number of record, i.e. 1, 2, 3 ....
  • rush is a GNU parallel like tool.
    • {} is the input. e.g., C1_pandaseq.fasta.
    • {^_pandaseq.fasta} is used to remove suffix _pandaseq.fasta. e.g., C1_pandaseq.fasta becomes C1.
    • {.} removes last file extension. e.g., C1_pandaseq.fasta becomes C1_pandaseq.

A dry run example:

$ ls *_pandaseq.fasta
C1_pandaseq.fasta  C4_pandaseq.fasta

$ ls *_pandaseq.fasta \
      | rush 'cat {} | seqkit replace -p ".+" -r "{^_pandaseq.fasta}_{nr}" > {.}.fa' --dry-run
cat C4_pandaseq.fasta | seqkit replace -p ".+" -r "C4_{nr}" > C4_pandaseq.fa
cat C1_pandaseq.fasta | seqkit replace -p ".+" -r "C1_{nr}" > C1_pandaseq.fa
ADD COMMENT
0
Entering edit mode
7.2 years ago
st.ph.n ★ 2.7k

Looks like you almost have it. See this post for using awk.

If you want to use python (2.7):

#!/usr/bin/env python

import sys

inpfile = sys.argv[1]

outfile = open(inp.split('.fasta')[0] + '_new.fasta', 'w')

with open(inp, 'r') as f:
           numb = 0
           for line in f:
                       if line.startswith('>'):
                                    numb += 1
                                    print >> outfile, '>' + inp.split("_")[0] + str(numb), '\n', next(f).strip()

To run save as rename_headers.py, or whatever you want. List your files in a text file: ls -1 *_pandaseq.fasta > files.txt and run with cat files.txt | xargs -n 1 python rename_headers.py

This assumes all your fasta files are single line. If you have multi-line fasta files, you can linearize with an awk statment from Pierre.

ADD COMMENT
1
Entering edit mode

No matter how much I love python, for simple jobs like this it's the best to use the available gnu/command line tools. It's quite pointless to write a python script everytime you need to get something done :p

Oh and avoid the print >> outfile synthax, which is old synthax which shouldn't be used anymore. Instead, use outfile.write("yourtexthere")

ADD REPLY
0
Entering edit mode

@WouterDeCoster - I first approached this with awk, see comment on Pierre's answer. In regards to syntax, I noted using Python 2.7, and still need to get used to 3+.

ADD REPLY

Login before adding your answer.

Traffic: 2266 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6