Question

rename mutiple fasta header for mutiple fasta files

0

Entering edit mode

7.2 years ago

bioinformaticssrm2011 ▴ 90

Hi,

I have mutiple fasta file and I want to change the header, for this I am using -

awk '/^>/{print ">C1_" ++i; next}{print}' C1_pandaseq.fasta > C1_pandaseq_new.fasta

input fasta-

>M03419:60:656544:1:1101:25150:3877:1
CCTACGGGTGGCTGCAGTGGGGAATTTTGGACAA
>M03419:60:656544:1:1101:8498:4267:1
ACTACGGGAGGCAGCAGGGGGGAATTTTGGACAATGGAAACCAGGG
>M03419:60:656544:1:1101:7884:4445:1
CCTACGGGTGGCAGCAGTGGGGAATATTGGACAATGCAACCCTGATCCAGC

output fasta-

>C1_1
CCTACGGGTGGCTGCAGTGGGGAATTTTGGACAAAAAAAAAAAAAAAA
>C1_2
ACTACGGGAGGCAGCAGGGGGGAATTTTGGACAATGGAAACCAGGG
>C1_3
CCTACGGGTGGCAGCAGTGGGGAATATTGGACAATGCAACCCTGATCCAGC

Similarly i have mutiple fasta file, which looks like-

C2_pandaseq.fasta
C4_pandaseq.fasta
C5_pandaseq.fasta
C8_pandaseq.fasta
T2_pandaseq.fasta
T7_pandaseq.fasta

So I need to rename all the fasta file header, e.g.,

for fasta file C2_pandaseq.fasta
>C2_1
AAAAAAAAAAAAAAAAAAAAAAA
>C2_2
ATTTTGGGGGGGGCCCCCCCGGGGGGGA

for fasta file C4_pandaseq.fasta
>C4_1
AAAAAAAAAAAAAAAAAAAAAAA
>C4_2
ATTTTGGGGGGGGCCCCCCCGGGGGGGA

and so on...

For each fasta file, i need to rename fasta header according to the file name only. Therefore, I need to write a for loop for this, but i dont know how can i do that.

Any help. Thanks

genome next-gen sequencing sequence • 5.0k views

ADD COMMENT • link updated 7.2 years ago by shenwei356 8.4k • written 7.2 years ago by bioinformaticssrm2011 ▴ 90

0

Entering edit mode

Hello bioinformaticssrm2011!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=74234

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

I understand, but I was not able to use the script mentioned there for my work. Though I appreciate there help.

ADD REPLY • link 7.2 years ago by bioinformaticssrm2011 ▴ 90

1

Entering edit mode

7.2 years ago

shenwei356 8.4k

Combining seqkit and rush:

ls *_pandaseq.fasta \
    | rush 'cat {} | seqkit replace -p ".+" -r "{^_pandaseq.fasta}_{nr}" > {.}.fa'

Explain:

Seqkit is used to rename fasta header. {nr} means number of record, i.e. 1, 2, 3 ....
rush is a GNU parallel like tool.
- {} is the input. e.g., C1_pandaseq.fasta.
- {^_pandaseq.fasta} is used to remove suffix _pandaseq.fasta. e.g., C1_pandaseq.fasta becomes C1.
- {.} removes last file extension. e.g., C1_pandaseq.fasta becomes C1_pandaseq.

A dry run example:

$ ls *_pandaseq.fasta
C1_pandaseq.fasta  C4_pandaseq.fasta

$ ls *_pandaseq.fasta \
      | rush 'cat {} | seqkit replace -p ".+" -r "{^_pandaseq.fasta}_{nr}" > {.}.fa' --dry-run
cat C4_pandaseq.fasta | seqkit replace -p ".+" -r "C4_{nr}" > C4_pandaseq.fa
cat C1_pandaseq.fasta | seqkit replace -p ".+" -r "C1_{nr}" > C1_pandaseq.fa

ADD COMMENT • link 7.2 years ago by shenwei356 8.4k

0

Entering edit mode

7.2 years ago

st.ph.n ★ 2.7k

Looks like you almost have it. See this post for using awk.

If you want to use python (2.7):

#!/usr/bin/env python

import sys

inpfile = sys.argv[1]

outfile = open(inp.split('.fasta')[0] + '_new.fasta', 'w')

with open(inp, 'r') as f:
           numb = 0
           for line in f:
                       if line.startswith('>'):
                                    numb += 1
                                    print >> outfile, '>' + inp.split("_")[0] + str(numb), '\n', next(f).strip()

To run save as rename_headers.py, or whatever you want. List your files in a text file: ls -1 *_pandaseq.fasta > files.txt and run with cat files.txt | xargs -n 1 python rename_headers.py

This assumes all your fasta files are single line. If you have multi-line fasta files, you can linearize with an awk statment from Pierre.

ADD COMMENT • link 7.2 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

No matter how much I love python, for simple jobs like this it's the best to use the available gnu/command line tools. It's quite pointless to write a python script everytime you need to get something done :p

Oh and avoid the print >> outfile synthax, which is old synthax which shouldn't be used anymore. Instead, use outfile.write("yourtexthere")

ADD REPLY • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

@WouterDeCoster - I first approached this with awk, see comment on Pierre's answer. In regards to syntax, I noted using Python 2.7, and still need to get used to 3+.

ADD REPLY • link 7.2 years ago by st.ph.n ★ 2.7k

score 4 · Accepted Answer · 2017-02-16

4

Entering edit mode

7.2 years ago

Pierre Lindenbaum 161k

 ls *_pandaseq.fasta | cut -d "_" -f 1 | while read PREFIX; do awk -v P=${PREFIX} '/^>/{print ">" P "_" ++i; next}{print}' ${PREFIX}_pandaseq.fasta > ${PREFIX}_pandaseq_new.fasta ; done

ADD COMMENT • link 7.2 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Thanks, Pierre. I learned something new here since I don't typically use awk. I first tried to reference your answer on this post, having a similar approach with cut, but couldn't figure out how to pass the variable to the awk statement for the headers. Now I know to use -v.