Where Can I Download Human Reference Genome In Fasta Format? Hgref.Fa File
6
49
Entering edit mode
12.2 years ago
Biomed 4.8k

Is there a better way of downloading the human genome reference sequence in fasta format than downloading it from the UCSC site? BWA protocol asks for an index to be created from the human genome reference multi fasta so I want to get this. Thanks

[Edited for clarification in response to answers and comments:]

human fasta sequence bwa • 115k views
8
Entering edit mode

Please consider taking minimal effort finding the answer yourself before posting a question.

7
Entering edit mode

Please consider doing something more useful than posting this answers. I just waited a minute but I feel better. Thanks

0
Entering edit mode

As a further extension to this question refer to this question.

0
Entering edit mode
24
Entering edit mode
11.8 years ago
lh3 33k

The version used by the 1000 genomes project is recommended. The mitochondrial genome in the g1k version is the most widely used rCRS. The chromosomes and contigs are concatenated, so it is less likely to make mistakes (people frequently concatenate all sequences including different haplotypes from the same region).

We have seen a lot of complications caused by different chromosome names (chr1 vs. 1) or different ordering (chr2 before chr10 or after). It is true that which b37 version to use does not matter too much, but converging to something close to a standard would reduce a lot of unnecessary works for everyone.

3
Entering edit mode

using the g1k version is highly recommended.

1
Entering edit mode

random and Un are already in the g1k version. Usually you would not want to map to haplotypes as you will lose most of variants.

0
Entering edit mode

I'm very interested in this opinion, since we have moved from the hg18 reference that came with our SOLiD sequencer to the hg19 we manually built by concatenating all chromosome chunks from UCSC.

Did I understand you right that we would be less error prone if we use g1k reference genome rather than UCSC's? because the main problem we see is how to efficiently deal with chrUn, random and haplotypes. chrUns should definitely be stored, but are you saying that random chunks and/or different haplotypes shouldn't be concatenated on that single multifasta file?

0
Entering edit mode

because if you download the single hg19 file from UCSC and convert it to fasta using twoBitToFa you end up with a multifasta file containing all chromosomes, including those haplotypes, random and chrUn. since g1k seems to include only those later unmapped supercontigs, is there any reason or recommendation to leave the rest of the files aside?

0
Entering edit mode

thanks a lot for the advice

15
Entering edit mode
12.2 years ago

You can get the fasta sequences for each chromosome here (human genome build 37)

3
Entering edit mode

2
Entering edit mode

no, you can just concatenate those file into one unique file.

2
Entering edit mode

I used

$cat file 1 file2 filen > hg18.mfa  to create the multifasta file but I wan not sure with the ordering of ChrX,Y and M. My current order is Chr1-22,chrX,ChrY andChr M. Will this ordering have any affect downstream in the analysis? Is there a standard order that is different than this? Thanks ADD REPLY 0 Entering edit mode Thanks you for your help I elaborated a little on your initial input. ADD REPLY 0 Entering edit mode the files come in one file per chromosome format, I want to use them in one multifasta file as input to BWA. Do I simply concatenate these chr fasta files into one big fasta file to get the multi fasta file? Or is there something else to it? Any ideas? ADD REPLY 0 Entering edit mode Thanks you, I guess I will have more questions on this as I go but this site and people like you are a great help. ADD REPLY 0 Entering edit mode Will this ordering have any affect downstream in the analysis?: no ADD REPLY 0 Entering edit mode "Chromosome M" is the mitochondrial DNA sequence. Depending on the analysis you're doing you should not include it. ADD REPLY 11 Entering edit mode 12.2 years ago Biomed 4.8k Using an rsync command to download the entire directory: rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/  This directory is where all fasta files one file per chromosome are located in .gz(zipped) format plus other useful files for human reference genome dataset. Original web site. ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/chromosomes/README.txt  unix specific, gunzip the files $ cat file1.fa file2.fa etc >multifastafile.fa will get you the reference human genome


7
Entering edit mode
11.8 years ago

Just for the record (since I'm always searching for these links myself)...

This is the canonical source for GRCh17, which hg19 is based upon (and should be identical to).

1000 Genomes also has a pre-concatenated multi-fasta reference suitable for use with most next-gen aligners out of the box here.

This file does have an "alternate" chrM, and includes all the "random" contigs. There's a README explaining the method of construction in that folder. YMMV.

For those in Europe (they now have a US mirror, too), try Ensembl for a local snapshot of the reference assembly.

So you can anticipate the download time and storage space required, the total size for each of these variations is ~3GB uncompressed, ~750MB compressed.

2
Entering edit mode
9.6 years ago
Tulip Nandu ▴ 90

1
Entering edit mode
5.6 years ago

I know that this question is already 6 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve the human reference genome from several database sources one can simply type:

# download human reference genome from NCBI RefSeq
biomartr::getGenome(db  = "refseq", organism = "Homo sapiens")


or

# download human reference genome from NCBI Genbank
biomartr::getGenome(db  = "genbank", organism = "Homo sapiens")


or

# download human reference genome from ENSEMBL
biomartr::getGenome(db  = "ensembl", organism = "Homo sapiens")


This way, users can use the same command to retrieve reference genomes from different databases. Each database has its own custom gene identifier and thus, it should always be clear which reference genome has been used to perform subsequent analyses.

For more detailed information please consult the Genomic Sequence Retrieval vignette.

The getGenome() function will then generate a log file that stores the following information:

File Name: Homo_sapiens_genomic_refseq.fna.gz

Organism Name: Homo_sapiens

Database: NCBI refseq

refseq_category: reference

genome assembly_accession: GCF_000001405.35

bioproject: PRJNA168

biosample: NA

taxid: 9606

infraspecific_name: NA

version_status: latest

release_type: Patch

genome_rep: Full

seq_rel_date: 2016-09-26

submitter: Genome Reference Consortium

Thus, you will always know with which reference genome and with which genome version you are working.

I hope that this will help to improve the reproducibility of many studies.

Alternatively, the biomartr package also provides functions for retrieving corresponding coding sequence - getCDS(), protein sequence - getProteome(), and annotation files - getGFF().