Question: rename mutiple fasta header for mutiple fasta files
0
gravatar for bioinformaticssrm2011
2.6 years ago by
India
bioinformaticssrm201190 wrote:

Hi,

I have mutiple fasta file and I want to change the header, for this I am using -

awk '/^>/{print ">C1_" ++i; next}{print}' C1_pandaseq.fasta > C1_pandaseq_new.fasta

input fasta-

>M03419:60:656544:1:1101:25150:3877:1
CCTACGGGTGGCTGCAGTGGGGAATTTTGGACAA
>M03419:60:656544:1:1101:8498:4267:1
ACTACGGGAGGCAGCAGGGGGGAATTTTGGACAATGGAAACCAGGG
>M03419:60:656544:1:1101:7884:4445:1
CCTACGGGTGGCAGCAGTGGGGAATATTGGACAATGCAACCCTGATCCAGC

output fasta-

>C1_1
CCTACGGGTGGCTGCAGTGGGGAATTTTGGACAAAAAAAAAAAAAAAA
>C1_2
ACTACGGGAGGCAGCAGGGGGGAATTTTGGACAATGGAAACCAGGG
>C1_3
CCTACGGGTGGCAGCAGTGGGGAATATTGGACAATGCAACCCTGATCCAGC

Similarly i have mutiple fasta file, which looks like-

C2_pandaseq.fasta
C4_pandaseq.fasta
C5_pandaseq.fasta
C8_pandaseq.fasta
T2_pandaseq.fasta
T7_pandaseq.fasta

So I need to rename all the fasta file header, e.g.,

for fasta file C2_pandaseq.fasta
>C2_1
AAAAAAAAAAAAAAAAAAAAAAA
>C2_2
ATTTTGGGGGGGGCCCCCCCGGGGGGGA

for fasta file C4_pandaseq.fasta
>C4_1
AAAAAAAAAAAAAAAAAAAAAAA
>C4_2
ATTTTGGGGGGGGCCCCCCCGGGGGGGA

and so on...

For each fasta file, i need to rename fasta header according to the file name only. Therefore, I need to write a for loop for this, but i dont know how can i do that.

Any help. Thanks

ADD COMMENTlink modified 2.6 years ago by shenwei3564.8k • written 2.6 years ago by bioinformaticssrm201190

Hello bioinformaticssrm2011!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=74234

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 2.6 years ago by genomax71k

I understand, but I was not able to use the script mentioned there for my work. Though I appreciate there help.

ADD REPLYlink written 2.6 years ago by bioinformaticssrm201190
4
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:
 ls *_pandaseq.fasta | cut -d "_" -f 1 | while read PREFIX; do awk -v P=${PREFIX} '/^>/{print ">" P "_" ++i; next}{print}' ${PREFIX}_pandaseq.fasta > ${PREFIX}_pandaseq_new.fasta ; done
ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum122k
1

Thanks, Pierre. I learned something new here since I don't typically use awk. I first tried to reference your answer on this post, having a similar approach with cut, but couldn't figure out how to pass the variable to the awk statement for the headers. Now I know to use -v.

ADD REPLYlink written 2.6 years ago by st.ph.n2.5k

Thank you Pierre. It works.

ADD REPLYlink written 2.6 years ago by bioinformaticssrm201190
1
gravatar for shenwei356
2.6 years ago by
shenwei3564.8k
China
shenwei3564.8k wrote:

Combining seqkit and rush:

ls *_pandaseq.fasta \
    | rush 'cat {} | seqkit replace -p ".+" -r "{^_pandaseq.fasta}_{nr}" > {.}.fa'

Explain:

  • Seqkit is used to rename fasta header. {nr} means number of record, i.e. 1, 2, 3 ....
  • rush is a GNU parallel like tool.
    • {} is the input. e.g., C1_pandaseq.fasta.
    • {^_pandaseq.fasta} is used to remove suffix _pandaseq.fasta. e.g., C1_pandaseq.fasta becomes C1.
    • {.} removes last file extension. e.g., C1_pandaseq.fasta becomes C1_pandaseq.

A dry run example:

$ ls *_pandaseq.fasta
C1_pandaseq.fasta  C4_pandaseq.fasta

$ ls *_pandaseq.fasta \
      | rush 'cat {} | seqkit replace -p ".+" -r "{^_pandaseq.fasta}_{nr}" > {.}.fa' --dry-run
cat C4_pandaseq.fasta | seqkit replace -p ".+" -r "C4_{nr}" > C4_pandaseq.fa
cat C1_pandaseq.fasta | seqkit replace -p ".+" -r "C1_{nr}" > C1_pandaseq.fa
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by shenwei3564.8k
0
gravatar for st.ph.n
2.6 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

Looks like you almost have it. See this post for using awk.

If you want to use python (2.7):

#!/usr/bin/env python

import sys

inpfile = sys.argv[1]

outfile = open(inp.split('.fasta')[0] + '_new.fasta', 'w')

with open(inp, 'r') as f:
           numb = 0
           for line in f:
                       if line.startswith('>'):
                                    numb += 1
                                    print >> outfile, '>' + inp.split("_")[0] + str(numb), '\n', next(f).strip()

To run save as rename_headers.py, or whatever you want. List your files in a text file: ls -1 *_pandaseq.fasta > files.txt and run with cat files.txt | xargs -n 1 python rename_headers.py

This assumes all your fasta files are single line. If you have multi-line fasta files, you can linearize with an awk statment from Pierre.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by st.ph.n2.5k
1

No matter how much I love python, for simple jobs like this it's the best to use the available gnu/command line tools. It's quite pointless to write a python script everytime you need to get something done :p

Oh and avoid the print >> outfile synthax, which is old synthax which shouldn't be used anymore. Instead, use outfile.write("yourtexthere")

ADD REPLYlink written 2.6 years ago by WouterDeCoster40k

@WouterDeCoster - I first approached this with awk, see comment on Pierre's answer. In regards to syntax, I noted using Python 2.7, and still need to get used to 3+.

ADD REPLYlink written 2.6 years ago by st.ph.n2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1754 users visited in the last hour