Question: Rename FASTA files according to FASTA file header
0
gravatar for cerulean
3 months ago by
cerulean0
cerulean0 wrote:

In a typical FASTA file, how can the header be used as its filename (i.e., replace the current file name with header ID) ?

I have multiple such FASTA files. I have been scouring the internet to find a simple script that I can use in LINUX to obtain the output, but to no avail. I am not well-versed with programming language or any computational language for that matter, which is why this task is proving to be quite an obstacle for me! Please help!

sequence • 410 views
ADD COMMENTlink modified 3 months ago by Matt Shirley8.4k • written 3 months ago by cerulean0
2

What does cat *.fasta | grep -e '>' | head -n 10 give you? It would be helpful to see the structure of the sequence headers, in order to provide a proper solution.

ADD REPLYlink written 3 months ago by st.ph.n2.2k

> AAA64362/A/Japan/305+/1957

> AAA64363/A/RI/5-/1957

> AAA64364/A/Japan/305-/1957

...and so on

ADD REPLYlink modified 3 months ago • written 3 months ago by cerulean0

Ok, so how do you want the files named according to these headers?

ADD REPLYlink written 3 months ago by st.ph.n2.2k

The full header name as the filename with underscore as separator will be ideal.

ADD REPLYlink modified 3 months ago • written 3 months ago by cerulean0

There are no underscores in the header, do you want the slash (/) to be replaced with the underscores?

ADD REPLYlink written 3 months ago by Sej Modha2.8k
1

I agree, using forward slash in filenames is not the best idea

ADD REPLYlink written 3 months ago by lieven.sterck1.4k
3
gravatar for lieven.sterck
3 months ago by
lieven.sterck1.4k
Belgium, Ghent, VIB
lieven.sterck1.4k wrote:

if you extend it a little you should be safe :

mv seq.fasta $(head -1 seq.fasta | cut -f1 -d ' ' | tr -d '>' ).fasta

This will also take the header up to the first space

and if you want to execute it for a bunch of files :

for i in *.fasta; do 
 mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).fasta
done
ADD COMMENTlink modified 3 months ago • written 3 months ago by lieven.sterck1.4k
1

I still wouldn't adivse this because that doesn't catch all the potential special characters. You've got to deal with ampersands, pipe symbols, colons, and all sorts of other things. (Which is why I didn't bother extending).

If you know your sequence headers very well, then you might be OK!

ADD REPLYlink written 3 months ago by jrj.healey4.2k
1

true.

but let's assume people are becoming aware from the fact they should not use any special characters in their fasta header IDs ;-) . OK for the pipe symbol but then just add

 | cut -f1 -d '|'

to it. I think it will nonetheless be much faster/efficient then processing the files with python or other.

ADD REPLYlink modified 3 months ago • written 3 months ago by lieven.sterck1.4k

This provides the perfect output! Thanks!

I was able to set the special character as '/'.

ADD REPLYlink modified 3 months ago • written 3 months ago by cerulean0

Thanks, this really did it!

Is there an online compendium or some collection of such useful commands using awk, sed, grep etc that will help me become a better bioinformatician? I can of course Google! But any specific suggestion/advice will expedite my search!

ADD REPLYlink written 3 months ago by cerulean0
2

Not really, but I keep a personal list of reminders here https://github.com/jrjhealey/bioinfo-tools

I steal them from around the internet when I spot them.

ADD REPLYlink written 3 months ago by jrj.healey4.2k

Thank you so much!!!

ADD REPLYlink written 3 months ago by cerulean0

Same here, I use a similar approach as in collecting them in our lab's wiki page as I come across them.

Nice page of useful oneliners jrj.healey , thx.

ADD REPLYlink written 3 months ago by lieven.sterck1.4k

You're welcome, though I can't take that much credit as they're mostly if not entirely stolen from others XD

ADD REPLYlink written 3 months ago by jrj.healey4.2k
0
gravatar for Sej Modha
3 months ago by
Sej Modha2.8k
Glasgow, UK
Sej Modha2.8k wrote:

A Python3 solution: It assumes that all fasta files are in the present working directory.

#!/usr/bin/env python3
import os
from Bio import SeqIO

pwd=os.getcwd()

#print(pwd)
for file in os.listdir(pwd):
    #print(file)
    if r'.fa' in file :
        #print(file)
        myfastalist=list(SeqIO.parse(file,'fasta'))
        for record in myfastalist:
            header=record.id
            #print(header)
            outfasta=str(header+'.fa')
            #print(outfasta)
            outfile=open(outfasta,'w')
            outfile.write('>'+str(header)+'\n'+str(record.seq)+'\n')
        outfile.close()
ADD COMMENTlink modified 3 months ago • written 3 months ago by Sej Modha2.8k
0
gravatar for jrj.healey
3 months ago by
jrj.healey4.2k
United Kingdom
jrj.healey4.2k wrote:

A really easy way to do it would be (for single, not multifastas - unless you want the multifasta named according to the first entry):

mv seq.fasta $(head -1 seq.fasta).fasta

I wouldn't advise it though, as having the > in the file name can cause issues and any spaces or special characters etc will be a bit of a nightmare.

ADD COMMENTlink modified 3 months ago • written 3 months ago by jrj.healey4.2k
0
gravatar for Anima Mundi
3 months ago by
Anima Mundi2.3k
Italy
Anima Mundi2.3k wrote:

Hello,

assuming you are working in a UNIX environment, open a terminal window and type nano foo.py

Paste there the following:

for line in open('seq.fasta'):
    if '>' in line:

        filename = line.replace('\n', '').replace('>','').replace(' ','') + '.fasta'
    elif line != '\n':
        text_file = open(filename, 'w')
        text_file.write(line)
        text_file.close()

Press CTRL + O, then CTRL + X

Type pwd in terminal. Copy all your FASTA files and put them in that folder.

Type cat *.fasta >> seq.fasta, and finally (requires Python 2.7) python foo.py

Hope this helps!

ADD COMMENTlink written 3 months ago by Anima Mundi2.3k
0
gravatar for Matt Shirley
3 months ago by
Matt Shirley8.4k
Cambridge, MA
Matt Shirley8.4k wrote:

You can use pyfaidx for this:

pip install pyfaidx
faidx -x input.fasta

See here for detailed usage.

ADD COMMENTlink written 3 months ago by Matt Shirley8.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1560 users visited in the last hour