Difference between the Fasta files from UCSC and Gencode/Ensembl
2
11
Entering edit mode
6.4 years ago
Ar ★ 1.1k

I would like to know if there is any difference between the genome build fasta files from UCSC and Gencode/Ensembl ? For example, is there any difference between the GRCh38/hg38 of UCSC to Gencode/Ensembl and similarly for mm10/GRCh38 of UCSC and Gencode/Ensembl ?

If not, then why there is a difference in genomic coordinates of these genes ? For example why Mecp2 coordinates in UCSC for GRCh38/hg38 is chrX:154,021,813-154,097,731 and for Ensembl Chromosome X: 154,021,573-154,137,103

Can I use the fasta file downloaded from NCBI/UCSC and annotation file downloaded from Gencode for alignment and other bioinformatics downstream purposes ?

Thanks!

genome-build RNA-Seq ChIP-Seq • 11k views
2
Entering edit mode
2
Entering edit mode

Ensembl and UCSC do some de novo gene predictions. It is possible that the longest transcript identified is slightly different. see this and this.

0
Entering edit mode

Thanks a lot. That was helpful!

22
Entering edit mode
6.4 years ago
Denise CS ★ 5.2k

The genome is the same regardless the genome browser you use. The Genome Reference Consortium are the people behind maintaining the human (and other) assemblies. So GRCh38 = Genome Reference Consortium Human Build 38). However, there may be different status, versions of that sequence. In Ensembl, you have GCA_000001405.22 (this is th INSDC assembly ID) whereas in UCSC you have GCA_000001405.15. The latest version is GCA_000001405.23 and is available on the GRC site. Usually the different versions of GCA_000001405 have to do with the addition of patches (the ones to fix the sequence and the novel patches. What types of patches are there). Although the assembly is essentially the same (GCA_000001405) (but remember the patches), the mode of annotation is completely different between UCSC and Ensembl. This explains what you've seen for MECP2. UCSC has it as X:154,021,573-154,097,755 76,183 bp whereas Ensembl has it as X:154,021,573-154,137,103. The start of the gene did vary (note the gene is on the reverse strand). Transcript MECP2-015 (i.e. ENST00000631210.1) is the culprit for the extended 5' end. This transcript was manually annotated by HAVANA and incorporated by Ensembl during the merge between automatic and manual annotation pipelines, which gives rise to the GENCODE set of genes. So you can use the fasta file downloaded from NCBI/UCSC and the annotation file downloaded from GENCODE, but do make sure the version of the FASTA sequence is the same version the annotation was carried out. If you download the FASTA from Ensembl, you need not to worry, the version will be the same.

1
Entering edit mode

Thanks and that was an awesome answer! Actually you made my stupid question look good! :D

0
Entering edit mode

Hello Denise,

According to you answer, can I use then as reference genome the one located here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz ; and use the annotation files provided by ENSEMBL here: ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/ (transcripts fasta) and ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/ (GTF)

Or do I need to select the GTF/transcripts file that match GCA_000001405.15 reference genome? Because I can see in the header of the GTF file that the genome version used in the "current" folder is the 25 (and want to use the reference version 15):

#!genome-build GRCh38.p10
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.25


For other hand I guess that I will need to modify the GTF chromosomes names and add the prefix of "chr", right? Thank you in advance :)

5
Entering edit mode
6.1 years ago
apa@stowers ▴ 580

Excellent question! We regularly use modified Ensembl annotations with UCSC genome builds. As long as you ensure the underlying assemblies are the same, you are MOSTLY going to be OK. But there are a few key things to watch out for:

1. UCSC names the chromosomes differently from Ensembl, mainly just adding "chr", but for scaffolds it is more complex, and names must be matched manually.

2. There can be scaffolds that Ensembl incorporates into its build, but UCSC does not, or vice versa. Mostly this is patches or alternate assemblies. However, Ensembl will annotate all the genes from all sequences it incorporates, so you might get dozens to hundreds of Ensembl genes which cannot be mapped to UCSC, because UCSC didn't include those sequences.

3. Watch out for the mitochondrion, which often is taken from somewhere else. I know this is a problem with hg19 -- Ensembl and UCSC use slightly different sequences -- but I haven't seen others with this issue.

4. For some organisms, most notably C. elegans, finding an assembly that was used by both UCSC and Ensembl is hard. The two institutions basically interleave their genome versions.

5. As far as I know, UCSC does not import other providers' gene coordinates. UCSC takes cDNA sequences and does the mapping themselves. So, BLAT might have a different opinion of some transcripts than Ensembl. Including, not mapping the transcript at all, or mapping it to new locations that Ensembl did not. This does not invalidate Ensembl's coords -- we still use them as-is with virtually no issues -- but just be aware, there can be a few discrepancies here and there.

6. For some older UCSC builds, scaffolds are reassembled into "random" chromosomes by stitching them together with 50kb N linkers. Ensembl leaves scaffolds as-is, so, in addition to changing the chrom names in the Ensembl GTF, the random-chrom offsets must also be added. Usually the offsets are in UCSC's ctgPos.txt file, but this doesn't always exist. UCSC doesn't reassemble scaffolds nowadays, but for mm9 / hg19 and before, this is an issue when importing Ensembl.

1
Entering edit mode

Hi!

Current UCSC Genome Browser staff member here. For number 5, we do actually import coordinates for some gene tracks, such as GENCODE for hg38 and mm10 or the Ensembl Gene tracks for assemblies other than those two. You are correct, though, that for a long time we did actually realign RNA sequences provided by RefSeq to produced what was formerly known as our "RefSeq Genes" track. However, we recently released an "NCBI RefSeq" track that is based entirely on coordinates and alignments provided by the RefSeq group. You can read about it more on our website: https://genome.ucsc.edu/goldenPath/newsarch.html#030317.