Question

Fasta File Vs Fa File

6

Entering edit mode

12.8 years ago

Zhshqzyc ▴ 520

Hello, I am going to download fa files by chromsome from UCSC

Eventually I want to get a single fasta file. So I have to merge them together.

How to merge fa files into one fa file?
How to convert a fa file to a fasta file?

Thanks,

fasta sequence • 41k views

ADD COMMENT • link updated 12.8 years ago by brentp 24k • written 12.8 years ago by Zhshqzyc ▴ 520

0

Entering edit mode

As noted below .fasta = .fa = .fsa, see http://en.wikipedia.org/wiki/FASTA_format#File_extension

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Casey Bergman 18k

Ram · Answer 1 · 2011-06-30

7

Entering edit mode

12.8 years ago

Neilfws 49k

cat *.fa > newfile.fa
Files with suffix ".fa" are already in fasta format; no conversion required

ADD COMMENT • link 12.8 years ago by Neilfws 49k

3

Entering edit mode

You do not want to do that. Each chromosome has its own sequence and header for a reason. If you merge 2 chromosome sequences and give them one header, the resulting sequence is meaningless. It would imply that the 2 chromosomes were contiguous, which is not the case.

ADD REPLY • link 12.8 years ago by Neilfws 49k

2

Entering edit mode

You asked how to merge fasta files. You did not specify that they need to be uncompressed first. The link in your question describes the use of gunzip for uncompression, so I assumed that you had read the information and were comfortable with that part. Anyway, the answer is to first run "gunzip *.fa.gz".

ADD REPLY • link 12.8 years ago by Neilfws 49k

0

Entering edit mode

But http://www.mail-archive.com/genome@soe.ucsc.edu/msg02192.html use a different command. So which one is correct?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Zhshqzyc ▴ 520

0

Entering edit mode

zcat encodes the text in a compression format, if you are going to parse the file, you will want to use cat. Also, if you are going to specify which files to merge, without merging all of them, use cat name.fa >> newfile.fa to append the name.fa to the end of newfile.fa

ADD REPLY • link 12.8 years ago by Burlappsack ▴ 690

0

Entering edit mode

Yes, I meant that do they have the same header? So while merging them only keep one header(first line), but cat *.fa seems kept all headers. The link seemed useing sed to remove headers. Thus it is my question.

ADD REPLY • link 12.8 years ago by Zhshqzyc ▴ 520

0

Entering edit mode

I agree with neilws. But, if you still want to, use $ sed '1d' chr*.fa > file.fa ... and then edit the file an ad the ID that you want.

ADD REPLY • link 12.8 years ago by Geparada ★ 1.5k

0

Entering edit mode

So can we say the code in the link http://www.mail-archive.com/genome@soe.ucsc.edu/msg02192.html is wrong?

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Zhshqzyc ▴ 520

0

Entering edit mode

No, is not, just another sed line to do it. Often there are multiples ways to do the same.

ADD REPLY • link 12.8 years ago by Geparada ★ 1.5k

score 7 · Answer 2 · 2011-06-30

7

Entering edit mode

12.8 years ago

brentp 24k

I use something like this shell script to get a single fasta (which, as @Neil says, is the same as .fa):

URL=http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
rm -f hg18.fa
for chrom in `seq 1 22` X Y
do
    wget -O - $URL/chr${chrom}.fa.gz | zcat -c >> hg18.fa
done

ADD COMMENT • link 12.8 years ago by brentp 24k

0

Entering edit mode

Small nitpick: $URL/${chrom}.fa.gz should be: $URL/chr${chrom}.fa.gz (each filename begins with chr).

ADD REPLY • link 12.8 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

Small nitpick: $URL/${chrom}.fa.gz should be: $URLchr${chrom}.fa.gz (each filename begins with chr). Also, you have a superfluous slash at the end of the URL parameter.

ADD REPLY • link 12.8 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

Small nitpick: $URL/${chrom}.fa.gz should be: $URL/chr${chrom}.fa.gz (each filename begins with chr). Also, you have a superfluous slash at the end of the URL parameter.

ADD REPLY • link 12.8 years ago by Jeroen Van Goey 2.3k

0

Entering edit mode

edited. thanks.

ADD REPLY • link 12.8 years ago by brentp 24k

0

Entering edit mode

very simple, yet very useful. just one question: is there any particular reason why don't you include _hap, _random and MT chromosomes? wouldn't they be needed for mapping purposes for instance?

ADD REPLY • link 12.8 years ago by Jorge Amigo 14k

0

Entering edit mode

@Jorge. No specific reason, but the hap and randoms are easily added as extra rows inside the for loop. And the MT after X Y.

ADD REPLY • link 12.8 years ago by brentp 24k

0

Entering edit mode

yes, I can see that adding extra rows is fairly simple. I was just curious about the reasons why anyone would consider (or not) the extra information of the human genome. thanks for the reply.

ADD REPLY • link 12.8 years ago by Jorge Amigo 14k

score 3 · Answer 3 · 2011-07-01

-2- fa and fasta are the same, but if you mean extract individual sequences from a multifasta file, you can use biopython. For example, I made this script:

import sys
from Bio import SeqIO

def IDfinder(fasta,ID):

        f = open(fasta)
        for seq_record in SeqIO.parse(f, "fasta"):
                if seq_record.id==ID:
                        print ">" + seq_record.id + '\n' + seq_record.seq
        f.close()

if __name__ == '__main__':
        IDfinder(sys.argv[1],sys.argv[2])

To use it, you just have to do a copy/paste into a text file and save as IDextractor.py (or the name you want). Then you can use it to select a fasta by the ID into a multifasta file. For example:

get a multifasta file:

$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/refMrna.fa.gz
$ gzip -d refMrna.fa.gz

then use the script

$ python IDextractor.py refMrna.fa NM_018864 
>NM_018864
ggggttttaatggggcgggacttcctgtcggagcaatccccgttacctccggaagagccgaagaaccgagccctcggacgccggcggttgagcatcgatcgcggtgcgctcgcgcgagataatggcagacccttggcaggagtgcatggactatgcagtaatcctcgcgaggcaagctggagagatgattcgtgaagctttaaaaaatgagatggatgtcatgattaaaagttctccagccgacttggtaacagttactgaccaaaaagttgaaaaaatgctcatgtcttctataaaggaaaagtatccatgtcacagcttcattggtgaagagtctgtggcagctggggagaagacggtcttcacagagcagcccacgtgggtcattgaccccattgatggaacgactaacttcgtgcatcggtttccctttgtagctgtttcaattggcttccttgtgaataaagagatggagtttggaattgtgtacagctgtgtggaagataagatgtacaccggcaggaaagggaaaggtgccttttgtaacggtcagaagcttcaggtgtcccagcaggaagacattaccaagtcactcttggtgaccgagttgggctcgtccagaaagcccgagactttacggatcgttctctccaacatggaaaagctgtgttccatccccatccatggaatccggagtgttggaacagctgctgttaatatgtgccttgtggcaacgggaggagcagatgcctattatgagatgggaatccactgctgggacatggcgggagctggcatcattgtcaccgaggcaggcggagtgctcatggatgtcacgggtggaccgttcgatctgatgtctcggagaataattgccgcaaatagtataacattagccaaaagaatagccaaagaaattgagataatacctttgcaaagagacgacgaaagctagtcacagagaacagtgtccagctccagtgtcatccttgctgtccctggggtgtttcagatggatggtgtcactgatttagactgaactttgaggtcctgattttaaaatggaaactttttttttacagatgacatattcaaaattagatggaatatttgattattgaaagaaaatttgcatgtagtaatattcttggggaaaatatacaaaaagtatacttaatgaactagccattgaaattgtccctagtccttatgatccccttcaacttaatgtactgtttatatgcataattctcaattacaaagtttctttttgtaagtggctttctctatgttccagaagccatatttgattaagtctaaaggctgtaacaagctggctctccctgtgcagagggcctttgtgttttattaatcactgtaagatagtgcctggcccagtgcctgtcagacagtaggcagtctgaagtccacacctgacaatgcgtgctcgaagctgcagctgctgcctctaatgcgtcacagtaagataaccaccctcctgttgcgaggtagaagttacttcactgtcctttttatatttcttattgctatgccatttcacaggatcgtgctgccagagacgactgcttctagtggacatttctgcagttagtacactgctgtatgttgtaggttctgcttaaagctgccgtgctaaagagattttcacagacatcttccaggtacctggtctagttagtggcagggatatgttttacaaaaggcagctttctcattcagatccgtaccctggtgctgacctgtgtactgtggtgtaatggtgaactttttgatttctttccagacttgctgaatttcatcactgctaactctagatgctctctctataaggtcttgggcctctcaaactcaagaaaatttaatggctcctattcctttgttaaagggttaattcattgtctagccttggcccttggcatatgaacagatgttttgctcttagtatgtttgaaccttgcatttgatacaatgaagtgtttttgtaagtttcaaggcagttatcttgattttggggggatttaatatattaaagctatataatactcagatttgggcactgtaatgactatatctgtgctgttaattacatgtatttaaaacgtcacgtaccatgtaaattctattacaagacaggttgctttgcaattaaatttattttagttaagacttaggaataccattttctttcattgtattcatttgcgtatcccaggctgccctcagaattgttgcatacccgaggatgaacttgaacttgtgacggctctgcttttctctcttaagttctgggatgcagagaagatggccacaggccaccacacacagtttctgtggtgctggagactgcacagggccacacgtgtacttagcgtaagcactctgctgcccaagctgcgctccagcccatgaacacacgtggaattaaaggagtaattaatgatatcttatcaaagttaatagcctcagccctttttaggggttttgagtttagttacagatatttgaagctaatattggttatgaatattcactttttgcatatagattttcccactatagataaacacttaatactttccc

As you see, you have to call python, then writhe the name of the script, the name of your multifasta file and the ID of the sequence which are you looking for.

$ python IDextractor.py file.fa ID

And you MUST have biopython installed to run this script

$ sudo apt-get install python-biopython

-1-

To concatenate multiple fastas, as neilfws said, you can use cat. For example:

Get the fastas:

$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*

unzip

$ wget gzip -d chr*.fa.gz

Concatenate into one fasta:

$ cat chr*.fa > allgenome.fa

note that > is used to redirect the screen output into a file.

I hope its helps. Cheers!