Question: Which is a good source to download a reference genome?
0
gravatar for Arindam Ghosh
2.7 years ago by
Arindam Ghosh300
India
Arindam Ghosh300 wrote:

Which one of the following is a good source to download a reference genome to be used for RNA-seq analysis?

  • UCSC
  • NCBI/GRC
  • iGenome
  • Ensembl

What are the things to be kept in mind while downloading one?

rna-seq genome • 1.3k views
ADD COMMENTlink modified 2.7 years ago by simon.vanheeringen200 • written 2.7 years ago by Arindam Ghosh300
3
gravatar for piechota.marcin
2.7 years ago by
European Union
piechota.marcin70 wrote:

It is not an easy task to select not only reference genome, but also ecosystem of annotations and additional information. First, start with this paper by Zhao & Zhang

Personally, I prefer UCSC for human, just because of ENCODE annotations. For, other species I prefer Ensembl, because it is the easiest one to use (one page with all downloads including .fa, .gtf, .gff and easy to use data warehouse - biomart).

ADD COMMENTlink written 2.7 years ago by piechota.marcin70
2

FYI, Ensembl has ENCODE data including GENCODE annotations. The advantage of Ensembl over other resources is that the data is better organized/integrated and the combination of local MySQL database + perl API is very powerful.

ADD REPLYlink written 2.7 years ago by Jean-Karim Heriche23k
2
gravatar for simon.vanheeringen
2.7 years ago by
simon.vanheeringen200 wrote:

No matter what source you choose, try genomepy to download your genomes. Will include chromosome sizes, a BED file with gaps and, optionally, gene annotation. Works for Ensembl, UCSC and NCBI. Automated, scriptable and reproducable!

genomepy example

ADD COMMENTlink written 2.7 years ago by simon.vanheeringen200
1
gravatar for ATpoint
2.7 years ago by
ATpoint36k
Germany
ATpoint36k wrote:

As the reference genome comes from the GRC, it should not matter where you get your genome from. I assume you are working with human. What I do is the following: Be sure to download the entire genome, so the primary chromosomes, unplaced and random contigs, but exclude alternative haplotypes for standard analysis. In case of human hg38, download the hg38.fa.gz and the file with the chromSizes from here, decompress, use samtools faidx to index and then use this command to get your final reference genome.

grep -v '_alt' hg38.chrom.sizes | xargs samtools faidx hg38.fa > hg38_noALT.fa

This will exclude the alternative haplotypes. From there on, index the fasta with the downstream tool of choice.

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by ATpoint36k

Yes, though the sequences come from GRC, I guess the annotations are done by different groups. So is there any difference?

ADD REPLYlink written 2.7 years ago by Arindam Ghosh300

There are some. For example contig naming scheme. hg19 from UCSC is using chr1, chr2... scheme. GRCh37 from Ensembl is using 1, 2... scheme.

ADD REPLYlink written 2.7 years ago by piechota.marcin70

Yes, the annotations can be/are different and the choice can impact the outcome. Have a look here. This is an ongoing discussion which one is better with probably the usual answer: "it depends on your task". I use Gencode, simply because some genes I was interested in were not included in RefSeq.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by ATpoint36k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour