Question: Where Can I Download Human Reference Genome In Fasta Format? Hgref.Fa File
8
gravatar for Biomed
3.8 years ago by
Biomed3.0k
Bethesda, MD, USA
Biomed3.0k wrote:

Is there a better way of downloading the human genome reference sequence in fasta format than downloading it from the UCSC site? BWA protocol asks for an index to be created from the human genome reference multi fasta so I want to get this. Thanks

[Edited for clarification in response to answers and comments:]

ADD COMMENTlink modified 13 months ago by Tulip Nandu0 • written 3.8 years ago by Biomed3.0k
7

Please consider taking minimal effort finding the answer yourself before posting a question.

ADD REPLYlink written 3.8 years ago by Michael Schubert5.9k

as a further extension to this question I have added this question : http://biostar.stackexchange.com/questions/6319/input-files-in-fasta-sequence-are-required-for-comparing-various-dna-sequencing-t

ADD REPLYlink written 3.1 years ago by Higherdefender90

As a further extension to this question refer to this question : http://biostar.stackexchange.com/questions/6319/input-files-in-fasta-sequence-are-required-for-comparing-various-dna-sequencing-t

ADD REPLYlink modified 18 months ago by Istvan Albert ♦♦ 39k • written 3.1 years ago by Higherdefender90
13
gravatar for lh3
3.4 years ago by
lh320k
lh320k wrote:

The version used by the 1000 genomes project is recommended. The mitochondrial genome in the g1k version is the most widely used rCRS. The chromosomes and contigs are concatenated, so it is less likely to make mistakes (people frequently concatenate all sequences including different haplotypes from the same region).

We have seen a lot of complications caused by different chromosome names (chr1 vs. 1) or different ordering (chr2 before chr10 or after). It is true that which b37 version to use does not matter too much, but converging to something close to a standard would reduce a lot of unnecessary works for everyone.

ADD COMMENTlink written 3.4 years ago by lh320k
2

using the g1k version is highly recommended.

ADD REPLYlink written 3.0 years ago by lh320k
1

random and Un are already in the g1k version. Usually you would not want to map to haplotypes as you will lose most of variants.

ADD REPLYlink written 3.0 years ago by lh320k

I'm very interested in this opinion, since we have moved from the hg18 reference that came with our SOLiD sequencer to the hg19 we manually built by concatenating all chromosome chunks from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/. did I understand you right that we would be less error prone if we use g1k reference genome rather than UCSC's? because the main problem we see is how to efficiently deal with chrUn, random and haplotypes. chrUns should definitely be stored, but are you saying that random chunks and/or different haplotypes shouldn't be concatenated on that single?

ADD REPLYlink written 3.0 years ago by Jorge Amigo6.2k

I'm very interested in this opinion, since we have moved from the hg18 reference that came with our SOLiD sequencer to the hg19 we manually built by concatenating all chromosome chunks from hgdownload.cse.ucsc.edu/goldenPath/hg19/… did I understand you right that we would be less error prone if we use g1k reference genome rather than UCSC's? because the main problem we see is how to efficiently deal with chrUn, random and haplotypes. chrUns should definitely be stored, but are you saying that random chunks and/or different haplotypes shouldn't be concatenated on that single multifasta file?

ADD REPLYlink written 3.0 years ago by Jorge Amigo6.2k

because if you download the single hg19 file from UCSC at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit and convert it to fasta using twoBitToFa you end up with a multifasta file containing all chromosomes, including those haplotypes, random and chrUn. since g1k seems to include only those later unmapped supercontigs, is there any reason or recommendation to leave the rest of the files aside?

ADD REPLYlink written 3.0 years ago by Jorge Amigo6.2k

thanks a lot for the advice

ADD REPLYlink written 3.0 years ago by Jorge Amigo6.2k
10
gravatar for Pierre Lindenbaum
3.8 years ago by
France
Pierre Lindenbaum58k wrote:

You can get the fasta sequences for each chromosome here (human genome build 37): http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes

ADD COMMENTlink written 3.8 years ago by Pierre Lindenbaum58k
2

I used $ cat file 1 file2 filen> hg18.mfa to create the multifasta file but I wan not sure with the ordering of ChrX,Y and M. My current order is Chr1-22,chrX,ChrY andChr M. Will this ordering have any affect downstream in the analysis? Is there a standard order that is different than this? Thanks

ADD REPLYlink written 3.8 years ago by Biomed3.0k
1

no, you can just concatenate those file into one unique file.

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

Thanks you for your help I elaborated a little on your initial input.

ADD REPLYlink written 3.8 years ago by Biomed3.0k

the files come in one file per chromosome format, I want to use them in one multifasta file as input to BWA. Do I simply concatenate these chr fasta files into one big fasta file to get the multi fasta file? Or is there something else to it? Any ideas?

ADD REPLYlink written 3.8 years ago by Biomed3.0k

Thanks you, I guess I will have more questions on this as I go but this site and people like you are a great help.

ADD REPLYlink written 3.8 years ago by Biomed3.0k

Will this ordering have any affect downstream in the analysis?: no

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum58k

"Chromosome M" is the mitochondrial DNA sequence. Depending on the analysis you're doing you should not include it.

ADD REPLYlink written 3.8 years ago by Paulo Nuin3.5k
4
gravatar for Jonathan Manning
3.4 years ago by
Near Boston, MA
Jonathan Manning550 wrote:

Just for the record (since I'm always searching for these links myself)...

The canonical source for GRCh17, which hg19 is based upon (and should be identical to) is:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37/Primary_Assembly/assembled_chromosomes/FASTA/

1000 Genomes also has a pre-concatenated multi-fasta reference suitable for use with most next-gen aligners out of the box at:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/

This file does have an "alternate" chrM, and includes all the "random" contigs. There's a README explaining the method of construction in that folder. YMMV.

For those in Europe (they now have a US mirror, too), try Ensembl for a local snapshot of the reference assembly:

ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/

So you can anticipate the download time and storage space required, the total size for each of these variations is ~3GB uncompressed, ~750MB compressed.

ADD COMMENTlink modified 18 months ago by Istvan Albert ♦♦ 39k • written 3.4 years ago by Jonathan Manning550
3
gravatar for Biomed
3.8 years ago by
Biomed3.0k
Bethesda, MD, USA
Biomed3.0k wrote:

Using an rsync command to download the entire directory:

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .

This directory is where all fasta files one file per chromosome are located in .gz(zipped) format plus other useful files for human reference genome dataset. Original web site.

ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/chromosomes/README.txt

unix specific, gunzip the files

$ cat file1.fa file2.fa etc >multifastafile.fa will get you the reference human genome

also see this discussion about this very same topic : http://seqanswers.com/forums/showthread.php?t=5996

ADD COMMENTlink modified 18 months ago by Istvan Albert ♦♦ 39k • written 3.8 years ago by Biomed3.0k
0
gravatar for Tulip Nandu
13 months ago by
Tulip Nandu0 wrote:

I would recommend downloading from ensembl database. Here is the link: http://www.ensembl.org/info/data/ftp/index.html

ADD COMMENTlink written 13 months ago by Tulip Nandu0
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 575 users visited in the last hour