Fasta File Vs Fa File
3
3
Entering edit mode
10.4 years ago
Zhshqzyc ▴ 490

Hello, I am going to download fa files by chromsome from UCSC

Eventually I want to get a single fasta file. So I have to merge them together.

  1. How to merge fa files into one fa file?
  2. How to convert a fa file to a fasta file?

Thanks,

fasta sequence • 30k views
ADD COMMENT
0
Entering edit mode

As noted below .fasta = .fa = .fsa, see http://en.wikipedia.org/wiki/FASTA_format#File_extension

ADD REPLY
7
Entering edit mode
10.4 years ago
Neilfws 49k
  1. cat *.fa > newfile.fa
  2. Files with suffix ".fa" are already in fasta format; no conversion required
ADD COMMENT
3
Entering edit mode

You do not want to do that. Each chromosome has its own sequence and header for a reason. If you merge 2 chromosome sequences and give them one header, the resulting sequence is meaningless. It would imply that the 2 chromosomes were contiguous, which is not the case.

ADD REPLY
2
Entering edit mode

You asked how to merge fasta files. You did not specify that they need to be uncompressed first. The link in your question describes the use of gunzip for uncompression, so I assumed that you had read the information and were comfortable with that part. Anyway, the answer is to first run "gunzip *.fa.gz".

ADD REPLY
0
Entering edit mode

But http://www.mail-archive.com/genome@soe.ucsc.edu/msg02192.html use a different command. So which one is correct?

ADD REPLY
0
Entering edit mode

zcat encodes the text in a compression format, if you are going to parse the file, you will want to use cat. Also, if you are going to specify which files to merge, without merging all of them, use cat name.fa >> newfile.fa to append the name.fa to the end of newfile.fa

ADD REPLY
0
Entering edit mode

Yes, I meant that do they have the same header? So while merging them only keep one header(first line), but cat *.fa seems kept all headers. The link seemed useing sed to remove headers. Thus it is my question.

ADD REPLY
0
Entering edit mode

I agree with neilws. But, if you still want to, use $ sed '1d' chr*.fa > file.fa ... and then edit the file an ad the ID that you want.

ADD REPLY
0
Entering edit mode

So can we say the code in the link http://www.mail-archive.com/genome@soe.ucsc.edu/msg02192.html is wrong?

ADD REPLY
0
Entering edit mode

No, is not, just another sed line to do it. Often there are multiples ways to do the same.

ADD REPLY
7
Entering edit mode
10.4 years ago
brentp 23k

I use something like this shell script to get a single fasta (which, as @Neil says, is the same as .fa):

URL=http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/
rm -f hg18.fa
for chrom in `seq 1 22` X Y
do
    wget -O - $URL/chr${chrom}.fa.gz | zcat -c >> hg18.fa
done
ADD COMMENT
0
Entering edit mode

Small nitpick: $URL/${chrom}.fa.gz should be: $URL/chr${chrom}.fa.gz (each filename begins with chr).

ADD REPLY
0
Entering edit mode

Small nitpick: $URL/${chrom}.fa.gz should be: $URLchr${chrom}.fa.gz (each filename begins with chr). Also, you have a superfluous slash at the end of the URL parameter.

ADD REPLY
0
Entering edit mode

Small nitpick: $URL/${chrom}.fa.gz should be: $URL/chr${chrom}.fa.gz (each filename begins with chr). Also, you have a superfluous slash at the end of the URL parameter.

ADD REPLY
0
Entering edit mode

edited. thanks.

ADD REPLY
0
Entering edit mode

very simple, yet very useful. just one question: is there any particular reason why don't you include _hap, _random and MT chromosomes? wouldn't they be needed for mapping purposes for instance?

ADD REPLY
0
Entering edit mode

@Jorge. No specific reason, but the hap and randoms are easily added as extra rows inside the for loop. And the MT after X Y.

ADD REPLY
0
Entering edit mode

yes, I can see that adding extra rows is fairly simple. I was just curious about the reasons why anyone would consider (or not) the extra information of the human genome. thanks for the reply.

ADD REPLY
3
Entering edit mode
10.4 years ago
Geparada ★ 1.4k

-2- fa and fasta are the same, but if you mean extract individual sequences from a multifasta file, you can use biopython. For example, I made this script:

import sys
from Bio import SeqIO

def IDfinder(fasta,ID):

        f = open(fasta)
        for seq_record in SeqIO.parse(f, "fasta"):
                if seq_record.id==ID:
                        print ">" + seq_record.id + '\n' + seq_record.seq
        f.close()

if __name__ == '__main__':
        IDfinder(sys.argv[1],sys.argv[2])

To use it, you just have to do a copy/paste into a text file and save as IDextractor.py (or the name you want). Then you can use it to select a fasta by the ID into a multifasta file. For example:

get a multifasta file:

$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/refMrna.fa.gz
$ gzip -d refMrna.fa.gz

then use the script

$ python IDextractor.py refMrna.fa NM_018864 
>NM_018864
ggggttttaatggggcgggacttcctgtcggagcaatccccgttacctccggaagagccgaagaaccgagccctcggacgccggcggttgagcatcgatcgcggtgcgctcgcgcgagataatggcagacccttggcaggagtgcatggactatgcagtaatcctcgcgaggcaagctggagagatgattcgtgaagctttaaaaaatgagatggatgtcatgattaaaagttctccagccgacttggtaacagttactgaccaaaaagttgaaaaaatgctcatgtcttctataaaggaaaagtatccatgtcacagcttcattggtgaagagtctgtggcagctggggagaagacggtcttcacagagcagcccacgtgggtcattgaccccattgatggaacgactaacttcgtgcatcggtttccctttgtagctgtttcaattggcttccttgtgaataaagagatggagtttggaattgtgtacagctgtgtggaagataagatgtacaccggcaggaaagggaaaggtgccttttgtaacggtcagaagcttcaggtgtcccagcaggaagacattaccaagtcactcttggtgaccgagttgggctcgtccagaaagcccgagactttacggatcgttctctccaacatggaaaagctgtgttccatccccatccatggaatccggagtgttggaacagctgctgttaatatgtgccttgtggcaacgggaggagcagatgcctattatgagatgggaatccactgctgggacatggcgggagctggcatcattgtcaccgaggcaggcggagtgctcatggatgtcacgggtggaccgttcgatctgatgtctcggagaataattgccgcaaatagtataacattagccaaaagaatagccaaagaaattgagataatacctttgcaaagagacgacgaaagctagtcacagagaacagtgtccagctccagtgtcatccttgctgtccctggggtgtttcagatggatggtgtcactgatttagactgaactttgaggtcctgattttaaaatggaaactttttttttacagatgacatattcaaaattagatggaatatttgattattgaaagaaaatttgcatgtagtaatattcttggggaaaatatacaaaaagtatacttaatgaactagccattgaaattgtccctagtccttatgatccccttcaacttaatgtactgtttatatgcataattctcaattacaaagtttctttttgtaagtggctttctctatgttccagaagccatatttgattaagtctaaaggctgtaacaagctggctctccctgtgcagagggcctttgtgttttattaatcactgtaagatagtgcctggcccagtgcctgtcagacagtaggcagtctgaagtccacacctgacaatgcgtgctcgaagctgcagctgctgcctctaatgcgtcacagtaagataaccaccctcctgttgcgaggtagaagttacttcactgtcctttttatatttcttattgctatgccatttcacaggatcgtgctgccagagacgactgcttctagtggacatttctgcagttagtacactgctgtatgttgtaggttctgcttaaagctgccgtgctaaagagattttcacagacatcttccaggtacctggtctagttagtggcagggatatgttttacaaaaggcagctttctcattcagatccgtaccctggtgctgacctgtgtactgtggtgtaatggtgaactttttgatttctttccagacttgctgaatttcatcactgctaactctagatgctctctctataaggtcttgggcctctcaaactcaagaaaatttaatggctcctattcctttgttaaagggttaattcattgtctagccttggcccttggcatatgaacagatgttttgctcttagtatgtttgaaccttgcatttgatacaatgaagtgtttttgtaagtttcaaggcagttatcttgattttggggggatttaatatattaaagctatataatactcagatttgggcactgtaatgactatatctgtgctgttaattacatgtatttaaaacgtcacgtaccatgtaaattctattacaagacaggttgctttgcaattaaatttattttagttaagacttaggaataccattttctttcattgtattcatttgcgtatcccaggctgccctcagaattgttgcatacccgaggatgaacttgaacttgtgacggctctgcttttctctcttaagttctgggatgcagagaagatggccacaggccaccacacacagtttctgtggtgctggagactgcacagggccacacgtgtacttagcgtaagcactctgctgcccaagctgcgctccagcccatgaacacacgtggaattaaaggagtaattaatgatatcttatcaaagttaatagcctcagccctttttaggggttttgagtttagttacagatatttgaagctaatattggttatgaatattcactttttgcatatagattttcccactatagataaacacttaatactttccc

As you see, you have to call python, then writhe the name of the script, the name of your multifasta file and the ID of the sequence which are you looking for.

$ python IDextractor.py file.fa ID

And you MUST have biopython installed to run this script

$ sudo apt-get install python-biopython

-1-

To concatenate multiple fastas, as neilfws said, you can use cat. For example:

Get the fastas:

$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*

unzip

$ wget gzip -d chr*.fa.gz

Concatenate into one fasta:

$ cat chr*.fa > allgenome.fa

note that > is used to redirect the screen output into a file.

I hope its helps. Cheers!

ADD COMMENT

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6