I used featureCount for counting reads in a bam file.
featureCount notably outputs the length of each gene.
However, I noticed huge differences in number of gene exons and length between the 'hg38_RefSeq_exon.txt" annotation file provided with featureCount and the last version of GENCODE gtf. Why is that?
Maybe the following approach is not correct:
(1) keeping exon features only in GENCODE v28 GRCh38.p12 gtf
(2) sort the gtf subset by chromosome and start position:
chr1 11869 12227 ENSG00000223972 chr1 12010 12057 ENSG00000223972 chr1 12179 12227 ENSG00000223972 chr1 12613 12697 ENSG00000223972 chr1 12613 12721 ENSG00000223972 chr1 12975 13052 ENSG00000223972 ...
(3) get the union of overlapping exons from the same genes:
> bedtools merge -i <sorted_gtf_subset> -d 0 -c 4 -o distinct chr1 11869 12227 ENSG00000223972 chr1 12613 12721 ENSG00000223972 chr1 12975 13052 ENSG00000223972 chr1 13221 14501 ENSG00000223972,ENSG00000227232 chr1 15005 15038 ENSG00000227232 chr1 15796 15947 ENSG00000227232 chr1 16607 16765 ENSG00000227232 chr1 16858 17055 ENSG00000227232 ...
(4) for each gene, sum the difference between start / end position of merged exons.
For example, featureCount outputs a length of 20,654 for ENSG00000237298 (RefSeq ID: 100506866) with 22 exons in its annotation file 'hg38_RefSeq_exon.txt'.
Using the method above with GENCODE gtf, I get 63 exons for this same gene and a length of 62,661 (difference of 42,007 bases !!!). I also noticed that 50 of these exons are also shared with other genes (50 exons with ENSG00000155657 and 1 with ENSG00000270956)
Could someone explain this difference to me?