Which is a good source to download a reference genome?
3
0
Entering edit mode
6.5 years ago
Arindam Ghosh ▴ 510

Which one of the following is a good source to download a reference genome to be used for RNA-seq analysis?

  • UCSC
  • NCBI/GRC
  • iGenome
  • Ensembl

What are the things to be kept in mind while downloading one?

rna-seq genome • 2.9k views
ADD COMMENT
3
Entering edit mode
6.5 years ago

It is not an easy task to select not only reference genome, but also ecosystem of annotations and additional information. First, start with this paper by Zhao & Zhang

Personally, I prefer UCSC for human, just because of ENCODE annotations. For, other species I prefer Ensembl, because it is the easiest one to use (one page with all downloads including .fa, .gtf, .gff and easy to use data warehouse - biomart).

ADD COMMENT
2
Entering edit mode

FYI, Ensembl has ENCODE data including GENCODE annotations. The advantage of Ensembl over other resources is that the data is better organized/integrated and the combination of local MySQL database + perl API is very powerful.

ADD REPLY
2
Entering edit mode
6.5 years ago

No matter what source you choose, try genomepy to download your genomes. Will include chromosome sizes, a BED file with gaps and, optionally, gene annotation. Works for Ensembl, UCSC and NCBI. Automated, scriptable and reproducable!

genomepy example

ADD COMMENT
1
Entering edit mode
6.5 years ago
ATpoint 81k

As the reference genome comes from the GRC, it should not matter where you get your genome from. I assume you are working with human. What I do is the following: Be sure to download the entire genome, so the primary chromosomes, unplaced and random contigs, but exclude alternative haplotypes for standard analysis. In case of human hg38, download the hg38.fa.gz and the file with the chromSizes from here, decompress, use samtools faidx to index and then use this command to get your final reference genome.

grep -v '_alt' hg38.chrom.sizes | xargs samtools faidx hg38.fa > hg38_noALT.fa

This will exclude the alternative haplotypes. From there on, index the fasta with the downstream tool of choice.

ADD COMMENT
0
Entering edit mode

Yes, though the sequences come from GRC, I guess the annotations are done by different groups. So is there any difference?

ADD REPLY
0
Entering edit mode

There are some. For example contig naming scheme. hg19 from UCSC is using chr1, chr2... scheme. GRCh37 from Ensembl is using 1, 2... scheme.

ADD REPLY
0
Entering edit mode

Yes, the annotations can be/are different and the choice can impact the outcome. Have a look here. This is an ongoing discussion which one is better with probably the usual answer: "it depends on your task". I use Gencode, simply because some genes I was interested in were not included in RefSeq.

ADD REPLY

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6