Question: featureCount vs GENCODE gene lengths
0
gravatar for user31888
4 months ago by
user3188820
United States
user3188820 wrote:

I used featureCount for counting reads in a bam file.

featureCount notably outputs the length of each gene.

However, I noticed huge differences in number of gene exons and length between the 'hg38_RefSeq_exon.txt" annotation file provided with featureCount and the last version of GENCODE gtf. Why is that?

Maybe the following approach is not correct:

(1) keeping exon features only in GENCODE v28 GRCh38.p12 gtf

(2) sort the gtf subset by chromosome and start position:

chr1    11869   12227   ENSG00000223972
chr1    12010   12057   ENSG00000223972
chr1    12179   12227   ENSG00000223972
chr1    12613   12697   ENSG00000223972
chr1    12613   12721   ENSG00000223972
chr1    12975   13052   ENSG00000223972
...

(3) get the union of overlapping exons from the same genes:

> bedtools merge -i <sorted_gtf_subset> -d 0 -c 4 -o distinct
chr1    11869   12227   ENSG00000223972
chr1    12613   12721   ENSG00000223972
chr1    12975   13052   ENSG00000223972
chr1    13221   14501   ENSG00000223972,ENSG00000227232
chr1    15005   15038   ENSG00000227232
chr1    15796   15947   ENSG00000227232
chr1    16607   16765   ENSG00000227232
chr1    16858   17055   ENSG00000227232
...

(4) for each gene, sum the difference between start / end position of merged exons.

For example, featureCount outputs a length of 20,654 for ENSG00000237298 (RefSeq ID: 100506866) with 22 exons in its annotation file 'hg38_RefSeq_exon.txt'.

Using the method above with GENCODE gtf, I get 63 exons for this same gene and a length of 62,661 (difference of 42,007 bases !!!). I also noticed that 50 of these exons are also shared with other genes (50 exons with ENSG00000155657 and 1 with ENSG00000270956)

Could someone explain this difference to me?

gencode featurecount • 217 views
ADD COMMENTlink modified 4 months ago by Emily_Ensembl16k • written 4 months ago by user3188820

Have you examined the GENCODE GTF file carefully? It is possible that there are overlapping annotations in that regions that is leading to the much higher count using your method.

As you point out the RefSeq entry is clear with 22 exons marked in it and the longest transcript included in ENSG00000237298 also points to that same entry.

ADD REPLYlink written 4 months ago by genomax59k

I have just used the same approach with the latest RefSeq gff3. It gives the exact same 22 exons as in featureCount. I don't see what other filter I could apply to the GENCODE gtf (the source maybe?)

ADD REPLYlink written 4 months ago by user3188820
3
gravatar for Emily_Ensembl
4 months ago by
Emily_Ensembl16k
EMBL-EBI
Emily_Ensembl16k wrote:
  1. It's an antisense gene to Titin. It's a whopper. As an antisense gene, it, of course, shares exonic regions (though on the opposite strand) with Titin (ENSG00000155657), the gene it is antisense to because that's what an antisense gene is. It doesn't share exons with ENSG00000270956 from what I can see.

  2. RefSeq and Ensembl annotation is fundamentally different. The manual gene annotation done in Ensembl aims to be as comprehensive as possible, annotating all transcripts for which there is evidence. This particularly affects non-coding transcripts, for which there is evidence that transcription is occurring pretty much everywhere. RefSeq annotation is more conservative, so they miss lots of non-coding transcripts.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Emily_Ensembl16k

Thanks Emily ! I understand now.

ADD REPLYlink written 4 months ago by user3188820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1294 users visited in the last hour