Question: ENSEMBL annotation file for quantification: which file to use?
0
gravatar for psm
13 months ago by
psm20
psm20 wrote:

Hello, question regarding gene quantification for RNAseq: I've used HISAT2 to align my reads against the hg38 genome, and used UCSC annotation for this.

I now want to perform gene-level quantification using featureCounts. On the Ensembl website (ftp://ftp.ensembl.org/pub/release-93/gtf/homo_sapiens), there are many options for GTFs:

Homo_sapiens.GRCh38.93.chr.gtf.gz
Homo_sapiens.GRCh38.93.chr_patch_hapl_scaff.gtf.gz Homo_sapiens.GRCh38.93.gtf.gz Homo_sapiens.GRCh38.93.abinitio.gtf

What is the difference between these and which should I choose?

Also, the original alignment was done using UCSC gtf, would it be acceptable to then count using the Ensembl annotation? I want to switch because of this paper

Many thanks in avance for any help.

annotation rna-seq ensembl • 703 views
ADD COMMENTlink modified 13 months ago by Ben_Ensembl1.0k • written 13 months ago by psm20
1

It is important to make sure the chromosome identifiers are the same between your fasta reference and your gtf annotation. If one uses chr1 and the other just 1 then you have a problem.

ADD REPLYlink written 13 months ago by WouterDeCoster41k

Thank you for that pointer - noted. Thankfully, if I understand correctly, Devon Ryan has pointed out that FeatureCounts is not impaired by this for UCSC and Ensembl chromosome names.

ADD REPLYlink written 13 months ago by psm20
5
gravatar for Ben_Ensembl
13 months ago by
Ben_Ensembl1.0k
EMBL-EBI
Ben_Ensembl1.0k wrote:

Just to compliment Devon Ryan's answer:

.gtf: This is the default file, it should contain the full annotation for all species except human and mouse. For human and mouse, it will contain all annotation on the primary assembly, ie excluding patch and haplotype regions. All species have one.

.chr.gtf: Contains only annotation on chromosomes, so toplevel scaffolds are excluded (patch and haplotypes are not included).

.chr_patch_hapl_scaff: Contains all annotation on all toplevel sequences, including patch and haplotype regions. It should only exist for human and mouse

Species with no chromosomes will have a single file, .gtf Species with only chromosomes but no scaffolds will have a single file, .gtf Species with chromosomes and scaffolds will have two files, .gtf and .chr.gtf

Further information can be found in the README file: http://ftp.ensembl.org/pub/current_gtf/homo_sapiens/README

ADD COMMENTlink written 13 months ago by Ben_Ensembl1.0k

Would it be possible to add that to the README file? The ab initio file is mentioned, but the other two aren't.

ADD REPLYlink written 13 months ago by Devon Ryan92k
1

Sure- I'll talk with my colleagues who are responsible for the README file and see whether we can update it to make it more comprehensive.

ADD REPLYlink written 13 months ago by Ben_Ensembl1.0k

Very helpful! Thanks for clarifying the different formats.

ADD REPLYlink written 13 months ago by psm20
2
gravatar for Devon Ryan
13 months ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

You're lucky that featureCounts can translate between UCSC and Ensembl chromosome names, most tools can't. So you should use Homo_sapiens.GRCh38.93.gtf.gz (using the chr_patch_hapl_scaff file won't hurt, it just contains contigs absent from your reference genome).

ADD COMMENTlink written 13 months ago by Devon Ryan92k

Thanks for breaking it down for me - that's exactly what I wanted to know.

ADD REPLYlink written 13 months ago by psm20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2391 users visited in the last hour