Recently I thought I was finally settling on a pipeline for human & mouse RNA-Seq using STAR for alignment, Gencode as an annotation source, FeatureCounts for read counting, and RSEM for transcript quantification. However, just yesterday we noticed that a common marker gene, SOX10, was showing up as having zero counts in all human samples but was well above zero when looking at the RSEM TPM, and definitely had mapped reads (which were not multimapped) when looking at the BAM files in IGV and some of the reads in the SAM files. It turns out that the Gencode annotation for hg38 (at least v21 and v22, "Comprehensive gene annotation" GTF) has an overlapping putative not-well-annotated feature that overlaps the entire length of the SOX10 gene, preventing reads from being counted to SOX10 when using default settings in HTSeq-Count and FeatureCounts. This is kind of scary, since we don't know how many other "real" genes this might be happening to.
What are some suggestions for how to do deal with this? I imagine default settings for the counting programs have a very good reason for being set to exclude counting ambiguous reads by default, so I'm on the fence about just enabling counting of overlapping features and running with it. I could also take a subset of the GTF and remove this one overlapping feature that's causing a problem, or remove all the "overlapping features" that don't appear to be well supported.