Question: Results differ when using different ensembl versions
gravatar for gudraephouto
2.5 years ago by
gudraephouto10 wrote:

Please take a look at Gencode. There are different versions of Ensemble for GRCh37.

My question may be very basic but I'm confused. What is the difference between them? How do you decide which one to use?

To obtain read counts, I just mapped my fastq files to GRCh37 (fasta file) from gencode. But I'm wondering which version of GTF file to choose among all of those ensembl versions. I tried 2 of them: "Ensembl 74, 75" and "Ensembl 73" but they give me different results which make me confused.

Is it ok if choose my reference genome from UCSC (hg19) instead of GRCh37 from Gencode and choose the GTF file from Gencode? Or Should they both be chosen from one place?

ensembl • 2.0k views
ADD COMMENTlink modified 2.5 years ago by i.sudbery4.2k • written 2.5 years ago by gudraephouto10
gravatar for harold.smith.tarheel
2.5 years ago by
United States
harold.smith.tarheel4.3k wrote:

It's absolutely essential that you use the matching versions of the reference genome and annotation (GTF/GFF) for all of your analyses, or the results will be incorrect. UCSC hg19 is not identical to GRCh37. Different versions of GRCh37 represent minor improvements in the sequence and annotations.

Note that GRCh38 has been available for nearly three years now, and it's probably the one you should be using.

ADD COMMENTlink written 2.5 years ago by harold.smith.tarheel4.3k

Thank you so much.

I read on Wikipedia that

Human reference genome:

Release name /Date of release/Equivalent UCSC version

GRCh38 /Dec 2013/hg38

GRCh37 / Feb 2009 / hg19

NCBI Build 36.1 /Mar 2006/hg18

NCBI Build 35 /May 2004 /hg17

NCBI Build 34 / Jul 2003/hg16

That's why I considered both hg19 and GRCh37 equivalent.

About using GRCh38, once I read somewhere on biostars that hg38 (If we consider it equivalent to GRCh38) has some disadvantages which I don't remember what they were. So I decided to use GRCh37. Now If you think hg38 is just as fine as or better than hg19, I can use it.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by gudraephouto10

If you're doing bioinformatics, Wikipedia is probably not the best resource for detailed information about the human genome…

Just like software, there are major versions (e.g., GRCh38) with minor intermediate patches that are not identical. And different databases include/exclude different data for the "same" version/patch (so UCSC hg19 ≠ ENSEMBL hg19 ≠ Broad Institute hg19).

The choice of reference version is largely dictated by downstream applications (e.g., if you intend to use a specific dataset of dbSNP or 1000Genomes, then select the identical reference used in that analysis). If not constrained by that consideration, then I would recommend GRCh38.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by harold.smith.tarheel4.3k

Patches are not supposed to change chromosome coordinates so I am not sure if what you are saying to strictly correct. It is possible that the annotation provided by different providers is different.

We don't know what the OP meant when saying "they give me different results which make me confused". We need more information. But I do agree that to avoid any possibility of confusion it is best to stick with sequence/annotation from one source/build.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax64k

I believe you're correct that coordinates are not changed by patches, but annotations can change and sequences added, which will affect alignments, read counts, browser tracks, variant effect predictions, etc. And, if memory serves, some but not all of the hg19s I mentioned included the mitochondrial genome and some of the unmapped contigs.

But we're both making the same point: use one source/build for the entire pipeline, or be prepared for surprises.

ADD REPLYlink written 2.5 years ago by harold.smith.tarheel4.3k

I just repeated the process of aligning fastq files but this time to hg38 (Hisat). But it lead to a lower overall alignment rate.

The overall alignment rate by hg38: 81%

The overall alignment rate by hg19: 94%

I haven't tried it using GRCh38 yet.

Now what do you think?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by gudraephouto10

According to what Devon Ryan says here: "GRCh37/hg19 and GRCh38 are genome builds rather than annotations, which describe where features are in a given genome build. The actual sequences you'll get from NCBI/UCSC/Ensembl will be identical, but their annotations will be different and (importantly) updated at different frequencies."

There mustn't be any difference in the result of alignment by whatever reference genome you use but the type of gene annotation matters to obtain read counts. Right?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by gudraephouto10

There are sequence differences between GRCh37 and GRCh38, it's a newer genome build. But these genome data (fasta) should be the same whether you get it from NCBI/UCSC/Ensembl. The gtf files contain the annotation and those do vary and are updated more frequently than the sequence information.

Mapping to hg38 will not be the same as mapping to hg19.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by WouterDeCoster37k

To cast light on it, I once aligned my fastq file using hisat to hg19, then I used the newest GTF file from gencode: "Version 19 (July 2013 freeze, GRCh37) - Ensembl 74, 75" to count reads.

I repeated the process again but this time I used the oldest GTF file from gencode: "Version 3c (July 2009 freeze, GRCh37) - Ensembl 56" .

The confusing thing was that the first GTF file could result in a higher library size for the sample but the second GTF file lead to more genes being expressed.

I mean while the counts for some genes obtained by GTF Version 19 were zero, the same genes had high counts (for example 400 or 1200) by GTF Version 3c.

I, first, thought this may be caused because I used hg19 for alignment, so I agained aligned my fastq files using GRCh37 and compared the results with both versions of GTF files but the same thing happened again: The number and type of expressed genes changed.

Another thing is that hg19 with GTF files from gencode could lead to higher library size than GRCh37 and GTF files from gencode.

This is why I asked:

1- how do you decide which ensembl to choose.
2- Is it necessary to use GTF file and reference genome from one place (GRCH37 and GTF both from gencode/hg19 and GTF both from UCSC)? Or is it fine to use hg19 from UCSC and GTF file from gencode?

GTF by UCSC doesn't work properly with htseq, so I have to choose my GTF file from gencode.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by gudraephouto10

Have you tried to use featureCounts to do the counting?

ADD REPLYlink written 2.5 years ago by genomax64k

No I haven't tried it yet but do you think it is robust? I feel there must not be a big difference between featurecounts and htseq.

I think the reference genome and gene annotation play a big role and I'm wondering what other people do and what reference genome and ensembl they choose.

Maybe I'm being too scrupulous and I have to finally decide to choose between "GRCh38.p7 as my reference genome and its corresponding annotation GTF" and "GRCh37.p13 and its corresponding GTF".

Or maybe I just align my reads to hg19 UCSC and obtain counts using the last version of GTF from gencode: "Version 19 (July 2013 freeze, GRCh37).

What's your final suggestion? Suppose you are given a set of RNA-seq fastq files, to what reference genome will you align them and which version of GTF file will you use? Please take it into account that I'm using htseq and subsequently, GTF files from UCSC are out of my choice.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by gudraephouto10

I use the sequence/annotation/index bundles from Illumina iGenomes site, where possible, to avoid these sorts of issues. The bundles are curated so everything is consistent. featureCounts allows summarization at gene level (not sure what kind you were using with HTSeq-count) and it will produce a matrix of counts (you can feed it all sample files at the same time) that is ready for use with DE programs.

I don't know why you saw wide differences when using different GTF files that differed only at patch level. One thing to keep in mind is that counting programs will ignore multi-mapped reads (unless you ask them to be included). How did you handle multi-mappers in your alignments (e.g. did you discard them, allow them to map at one best site or map at all possible sites).

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax64k

To compare the results with each other, I did all of the process on galaxy using default settings... And multi-mapped reads were discarded. I'm going to align my data using tophat instead of hisat to see what happens next.

ADD REPLYlink written 2.5 years ago by gudraephouto10
gravatar for i.sudbery
2.5 years ago by
Sheffield, UK
i.sudbery4.2k wrote:

Both Ensembl version 74/75 and version 73 are based on GRCh37, and should also work fine with hg19 - the coordinates should correspond.

Each version of ensembl, which is a set of gene annotations, is based on a particular version of the genome sequence. But ensembl is updated far more regularly than the genome sequence is. There are new versions of Ensembl every six months if I remember correctly. Thus there is no "corresponding GTF" for a particular genome sequence version.

As for your inconsistent results - thats just what happens when annotation versions get updated. Transcript models are retired or added - thus transcripts that were expressed in 73 simply don't exist in 75. Not every change that is made is correct - some transcripts for which you might have very good evidence in your RNAseq, might be removed from one version to the next, but we have to assume the newest version is the best. In the course of a three year project you are going to go through 6 versions of ensembl, and things are going to change quite a bit - thats just the nature of the beast.

The current newest version of Ensembl annotations is version 85, which is based on the GRCh38 version of the sequence. Many people will tell you that you should be moving over to using GRCh38/hg38. However, as yet there aren't any RNA-seq tools that I am aware of that will cope with some of the new features of hg38, in particular the alternative contigs. Thus, if you want to use hg38, you will need to filter out the _alt contigs before you map (or use the "analysis set" from the UCSC website, which already has these filtered out).

You would then be able to use Ensembl 85.

ADD COMMENTlink written 2.5 years ago by i.sudbery4.2k

Thanks a lot... The explanation was clear... do you have a link to the "analysis set"? I found something here but I'm not sure which one to download.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by gudraephouto10

I would recommend this file: hg38.analysisSet.chroms.tar.gz. It is a tar file that contains a seperate fasta for each chromosome. In order to build indexes for mapping etc, you'll want to extract it and concatenate the resulting files. You can then get Ensembl85 GTFs from here, or use GENCODE version 25 here.

ADD REPLYlink written 2.5 years ago by i.sudbery4.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1227 users visited in the last hour