Question: Where Can I Download Human Reference Genome In Fasta Format? Hgref.Fa File
23
gravatar for Biomed
6.6 years ago by
Biomed4.1k
Bethesda, MD, USA
Biomed4.1k wrote:

Is there a better way of downloading the human genome reference sequence in fasta format than downloading it from the UCSC site? BWA protocol asks for an index to be created from the human genome reference multi fasta so I want to get this. Thanks

[Edited for clarification in response to answers and comments:]

fasta sequence human bwa • 71k views
ADD COMMENTlink modified 15 days ago by Hajk-Georg Drost80 • written 6.6 years ago by Biomed4.1k
8

Please consider taking minimal effort finding the answer yourself before posting a question.

ADD REPLYlink written 6.6 years ago by Michael Schubert6.6k

as a further extension to this question I have added this question : http://biostar.stackexchange.com/questions/6319/input-files-in-fasta-sequence-are-required-for-comparing-various-dna-sequencing-t

ADD REPLYlink written 6.0 years ago by Higherdefender90

As a further extension to this question refer to this question : http://biostar.stackexchange.com/questions/6319/input-files-in-fasta-sequence-are-required-for-comparing-various-dna-sequencing-t

ADD REPLYlink modified 4.3 years ago by Istvan Albert ♦♦ 69k • written 6.0 years ago by Higherdefender90

Relevant post: How do experienced people look for full reference genomes?

ADD REPLYlink written 2.7 years ago by Malachi Griffith14k
20
gravatar for lh3
6.2 years ago by
lh328k
United States
lh328k wrote:

The version used by the 1000 genomes project is recommended. The mitochondrial genome in the g1k version is the most widely used rCRS. The chromosomes and contigs are concatenated, so it is less likely to make mistakes (people frequently concatenate all sequences including different haplotypes from the same region).

We have seen a lot of complications caused by different chromosome names (chr1 vs. 1) or different ordering (chr2 before chr10 or after). It is true that which b37 version to use does not matter too much, but converging to something close to a standard would reduce a lot of unnecessary works for everyone.

ADD COMMENTlink written 6.2 years ago by lh328k
2

using the g1k version is highly recommended.

ADD REPLYlink written 5.8 years ago by lh328k
1

random and Un are already in the g1k version. Usually you would not want to map to haplotypes as you will lose most of variants.

ADD REPLYlink written 5.8 years ago by lh328k

I'm very interested in this opinion, since we have moved from the hg18 reference that came with our SOLiD sequencer to the hg19 we manually built by concatenating all chromosome chunks from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/. did I understand you right that we would be less error prone if we use g1k reference genome rather than UCSC's? because the main problem we see is how to efficiently deal with chrUn, random and haplotypes. chrUns should definitely be stored, but are you saying that random chunks and/or different haplotypes shouldn't be concatenated on that single?

ADD REPLYlink written 5.8 years ago by Jorge Amigo9.4k

I'm very interested in this opinion, since we have moved from the hg18 reference that came with our SOLiD sequencer to the hg19 we manually built by concatenating all chromosome chunks from hgdownload.cse.ucsc.edu/goldenPath/hg19/… did I understand you right that we would be less error prone if we use g1k reference genome rather than UCSC's? because the main problem we see is how to efficiently deal with chrUn, random and haplotypes. chrUns should definitely be stored, but are you saying that random chunks and/or different haplotypes shouldn't be concatenated on that single multifasta file?

ADD REPLYlink written 5.8 years ago by Jorge Amigo9.4k

because if you download the single hg19 file from UCSC at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit and convert it to fasta using twoBitToFa you end up with a multifasta file containing all chromosomes, including those haplotypes, random and chrUn. since g1k seems to include only those later unmapped supercontigs, is there any reason or recommendation to leave the rest of the files aside?

ADD REPLYlink written 5.8 years ago by Jorge Amigo9.4k

thanks a lot for the advice

ADD REPLYlink written 5.8 years ago by Jorge Amigo9.4k
13
gravatar for Pierre Lindenbaum
6.6 years ago by
France
Pierre Lindenbaum89k wrote:

You can get the fasta sequences for each chromosome here (human genome build 37): http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes

ADD COMMENTlink written 6.6 years ago by Pierre Lindenbaum89k
2

no, you can just concatenate those file into one unique file.

ADD REPLYlink written 6.6 years ago by Pierre Lindenbaum89k
2

I used $ cat file 1 file2 filen> hg18.mfa to create the multifasta file but I wan not sure with the ordering of ChrX,Y and M. My current order is Chr1-22,chrX,ChrY andChr M. Will this ordering have any affect downstream in the analysis? Is there a standard order that is different than this? Thanks

ADD REPLYlink written 6.6 years ago by Biomed4.1k

Thanks you for your help I elaborated a little on your initial input.

ADD REPLYlink written 6.6 years ago by Biomed4.1k

the files come in one file per chromosome format, I want to use them in one multifasta file as input to BWA. Do I simply concatenate these chr fasta files into one big fasta file to get the multi fasta file? Or is there something else to it? Any ideas?

ADD REPLYlink written 6.6 years ago by Biomed4.1k

Thanks you, I guess I will have more questions on this as I go but this site and people like you are a great help.

ADD REPLYlink written 6.6 years ago by Biomed4.1k

Will this ordering have any affect downstream in the analysis?: no

ADD REPLYlink written 6.6 years ago by Pierre Lindenbaum89k

"Chromosome M" is the mitochondrial DNA sequence. Depending on the analysis you're doing you should not include it.

ADD REPLYlink written 6.6 years ago by Paulo Nuin3.7k
8
gravatar for Biomed
6.6 years ago by
Biomed4.1k
Bethesda, MD, USA
Biomed4.1k wrote:

Using an rsync command to download the entire directory:

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .

This directory is where all fasta files one file per chromosome are located in .gz(zipped) format plus other useful files for human reference genome dataset. Original web site.

ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/chromosomes/README.txt

unix specific, gunzip the files

$ cat file1.fa file2.fa etc >multifastafile.fa will get you the reference human genome

also see this discussion about this very same topic : http://seqanswers.com/forums/showthread.php?t=5996

ADD COMMENTlink modified 4.3 years ago by Istvan Albert ♦♦ 69k • written 6.6 years ago by Biomed4.1k
6
gravatar for Jonathan Manning
6.2 years ago by
Near Boston, MA
Jonathan Manning590 wrote:

Just for the record (since I'm always searching for these links myself)...

The canonical source for GRCh17, which hg19 is based upon (and should be identical to) is:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/

1000 Genomes also has a pre-concatenated multi-fasta reference suitable for use with most next-gen aligners out of the box at:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/

This file does have an "alternate" chrM, and includes all the "random" contigs. There's a README explaining the method of construction in that folder. YMMV.

For those in Europe (they now have a US mirror, too), try Ensembl for a local snapshot of the reference assembly:

ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

So you can anticipate the download time and storage space required, the total size for each of these variations is ~3GB uncompressed, ~750MB compressed.

ADD COMMENTlink modified 4.3 years ago by Istvan Albert ♦♦ 69k • written 6.2 years ago by Jonathan Manning590
1
gravatar for Tulip Nandu
4.0 years ago by
Tulip Nandu30
United States
Tulip Nandu30 wrote:

I would recommend downloading from ensembl database. Here is the link: http://www.ensembl.org/info/data/ftp/index.html

ADD COMMENTlink written 4.0 years ago by Tulip Nandu30
1
gravatar for Hajk-Georg Drost
15 days ago by
Cambridge
Hajk-Georg Drost80 wrote:

I know that this question is already 6 years old, but I hope that my answer might be useful to others anyway.

I implemented a standardized way to automate the genome retrieval process in R (see biomartr package).

To retrieve the human reference genome from several database sources one can simply type:

# download human reference genome from NCBI RefSeq
biomartr::getGenome(db  = "refseq", organism = "Homo sapiens")

or

# download human reference genome from NCBI Genbank
biomartr::getGenome(db  = "genbank", organism = "Homo sapiens")

or

# download human reference genome from ENSEMBL
biomartr::getGenome(db  = "ensembl", organism = "Homo sapiens")

This way, users can use the same command to retrieve reference genomes from different databases. Each database has its own custom gene identifier and thus, it should always be clear which reference genome has been used to perform subsequent analyses.

For more detailed information please consult the Genomic Sequence Retrieval vignette.

The getGenome() function will then generate a log file that stores the following information:

File Name: Homo_sapiens_genomic_refseq.fna.gz

Organism Name: Homo_sapiens

Database: NCBI refseq

URL: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.35_GRCh38.p9/GCF_000001405.35_GRCh38.p9_genomic.fna.gz

Download_Date: Sat Oct 22 12:41:07 2016

refseq_category: reference

genome assembly_accession: GCF_000001405.35

bioproject: PRJNA168

biosample: NA

taxid: 9606

infraspecific_name: NA

version_status: latest

release_type: Patch

genome_rep: Full

seq_rel_date: 2016-09-26

submitter: Genome Reference Consortium

Thus, you will always know with which reference genome and with which genome version you are working.

I hope that this will help to improve the reproducibility of many studies.

Alternatively, the biomartr package also provides functions for retrieving corresponding coding sequence - getCDS(), protein sequence - getProteome(), and annotation files - getGFF().

ADD COMMENTlink written 15 days ago by Hajk-Georg Drost80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 615 users visited in the last hour