Question

Difference between the Fasta files from UCSC and Gencode/Ensembl

11

Entering edit mode

7.9 years ago

Ar ★ 1.1k

I would like to know if there is any difference between the genome build fasta files from UCSC and Gencode/Ensembl ? For example, is there any difference between the GRCh38/hg38 of UCSC to Gencode/Ensembl and similarly for mm10/GRCh38 of UCSC and Gencode/Ensembl ?

If not, then why there is a difference in genomic coordinates of these genes ? For example why Mecp2 coordinates in UCSC for GRCh38/hg38 is chrX:154,021,813-154,097,731 and for Ensembl Chromosome X: 154,021,573-154,137,103

Can I use the fasta file downloaded from NCBI/UCSC and annotation file downloaded from Gencode for alignment and other bioinformatics downstream purposes ?

Thanks!

genome-build RNA-Seq ChIP-Seq • 13k views

ADD COMMENT • link updated 7.5 years ago by apa@stowers ▴ 600 • written 7.9 years ago by Ar ★ 1.1k

2

Entering edit mode

See these posts for human genome:

GRCh37/38(NCBI) vs hg19/hg38(UCSC)

Is there any differences between Human Genome downloaded from UCSC website and the on from Ensembl

Resources for converting between UCSC <-> Gencode <-> Ensembl chromosome names

The problem is that they are rather old, there are some new releases, etc.

ADD REPLY • link 7.9 years ago by natasha.sernova ★ 4.0k

2

Entering edit mode

Ensembl and UCSC do some de novo gene predictions. It is possible that the longest transcript identified is slightly different. see this and this.

ADD REPLY • link 7.9 years ago by GenoMax 144k

0

Entering edit mode

Thanks a lot. That was helpful!

ADD REPLY • link 7.9 years ago by Ar ★ 1.1k

score 22 · Answer 1 · 2016-08-30

The genome is the same regardless the genome browser you use. The Genome Reference Consortium are the people behind maintaining the human (and other) assemblies. So GRCh38 = Genome Reference Consortium Human Build 38). However, there may be different status, versions of that sequence. In Ensembl, you have GCA_000001405.22 (this is th INSDC assembly ID) whereas in UCSC you have GCA_000001405.15. The latest version is GCA_000001405.23 and is available on the GRC site. Usually the different versions of GCA_000001405 have to do with the addition of patches (the ones to fix the sequence and the novel patches. What types of patches are there). Although the assembly is essentially the same (GCA_000001405) (but remember the patches), the mode of annotation is completely different between UCSC and Ensembl. This explains what you've seen for MECP2. UCSC has it as X:154,021,573-154,097,755 76,183 bp whereas Ensembl has it as X:154,021,573-154,137,103. The start of the gene did vary (note the gene is on the reverse strand). Transcript MECP2-015 (i.e. ENST00000631210.1) is the culprit for the extended 5' end. This transcript was manually annotated by HAVANA and incorporated by Ensembl during the merge between automatic and manual annotation pipelines, which gives rise to the GENCODE set of genes. So you can use the fasta file downloaded from NCBI/UCSC and the annotation file downloaded from GENCODE, but do make sure the version of the FASTA sequence is the same version the annotation was carried out. If you download the FASTA from Ensembl, you need not to worry, the version will be the same.

score 6 · Answer 2 · 2017-01-11

Excellent question! We regularly use modified Ensembl annotations with UCSC genome builds. As long as you ensure the underlying assemblies are the same, you are MOSTLY going to be OK. But there are a few key things to watch out for:

UCSC names the chromosomes differently from Ensembl, mainly just adding "chr", but for scaffolds it is more complex, and names must be matched manually.
There can be scaffolds that Ensembl incorporates into its build, but UCSC does not, or vice versa. Mostly this is patches or alternate assemblies. However, Ensembl will annotate all the genes from all sequences it incorporates, so you might get dozens to hundreds of Ensembl genes which cannot be mapped to UCSC, because UCSC didn't include those sequences.
Watch out for the mitochondrion, which often is taken from somewhere else. I know this is a problem with hg19 -- Ensembl and UCSC use slightly different sequences -- but I haven't seen others with this issue.
For some organisms, most notably C. elegans, finding an assembly that was used by both UCSC and Ensembl is hard. The two institutions basically interleave their genome versions.
As far as I know, UCSC does not import other providers' gene coordinates. UCSC takes cDNA sequences and does the mapping themselves. So, BLAT might have a different opinion of some transcripts than Ensembl. Including, not mapping the transcript at all, or mapping it to new locations that Ensembl did not. This does not invalidate Ensembl's coords -- we still use them as-is with virtually no issues -- but just be aware, there can be a few discrepancies here and there.
For some older UCSC builds, scaffolds are reassembled into "random" chromosomes by stitching them together with 50kb N linkers. Ensembl leaves scaffolds as-is, so, in addition to changing the chrom names in the Ensembl GTF, the random-chrom offsets must also be added. Usually the offsets are in UCSC's ctgPos.txt file, but this doesn't always exist. UCSC doesn't reassemble scaffolds nowadays, but for mm9 / hg19 and before, this is an issue when importing Ensembl.