Is there a better way of downloading the human genome reference sequence in fasta format than downloading it from the UCSC site? BWA protocol asks for an index to be created from the human genome reference multi fasta so I want to get this. Thanks
[Edited for clarification in response to answers and comments:]
You can get the fasta sequences for each chromosome here (human genome build 37): http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes
The version used by the 1000 genomes project is recommended. The mitochondrial genome in the g1k version is the most widely used rCRS. The chromosomes and contigs are concatenated, so it is less likely to make mistakes (people frequently concatenate all sequences including different haplotypes from the same region).
We have seen a lot of complications caused by different chromosome names (chr1 vs. 1) or different ordering (chr2 before chr10 or after). It is true that which b37 version to use does not matter too much, but converging to something close to a standard would reduce a lot of unnecessary works for everyone.
Just for the record (since I'm always searching for these links myself)...
The canonical source for GRCh17, which hg19 is based upon (and should be identical to) is:
1000 Genomes also has a pre-concatenated multi-fasta reference suitable for use with most next-gen aligners out of the box at:
This file does have an "alternate" chrM, and includes all the "random" contigs. There's a README explaining the method of construction in that folder. YMMV.
For those in Europe (they now have a US mirror, too), try Ensembl for a local snapshot of the reference assembly:
So you can anticipate the download time and storage space required, the total size for each of these variations is ~3GB uncompressed, ~750MB compressed.
Using an rsync command to download the entire directory:
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .
This directory is where all fasta files one file per chromosome are located in .gz(zipped) format plus other useful files for human reference genome dataset. Original web site.
unix specific, gunzip the files
$ cat file1.fa file2.fa etc >multifastafile.fa will get you the reference human genome
also see this discussion about this very same topic : http://seqanswers.com/forums/showthread.php?t=5996