I'm running differential expression analysis for my species of interest using CuffDiff (3 samples, 5 biological replicates each). While checking the output files I've found a case of duplicated genes (both are neighbours, the only difference is their length) with only one having expression values estimated (FPKM), the second one has zeroes in all samples. Also in the read_group_tracking file the number of raw_frags is 0 for the second gene. I've inspected the bam files produced with Tophat and the RNA-seq reads are mapped in locations of both genes. I've tested CuffDiff with several sets of parameters (default, with -b/--frag-bias-correct, -u/--multi-read-correct, etc) and all give the same output. Is this normal way that Cuffdiff behaves in case of duplicated, highly similar genes? Should I care or not? I saw that actually in some studies, people first cluster the genes based on sequence similarity and then estimate expression only for the representative genes of each cluster. Thanks for any piece of advice in the above matter!
If you're looking at gene level expression (XLOC), then Cuffdiff creates a "window" of sorts and then collapses anything within that window down into a quantification. Often this encompasses more than one gene, which is odd in my opinion, and doesn't really make sense. Looking for answers to questions like "how exactly does this work in tuxedo?" - there are three ways to find out, the manual, the paper, or the source code.
While this might not be constructive to your question, I've been down the Tuxedo rabbit hole, and I really wasn't satisfied with the answers I was getting, I'd recommend some alternatives. For gene level differential expression, I'd recommend alignment (using whatever you want, Tophat, HISAT2, STAR, etc) -> Count using HT_Seq_Count, or Feature Counts -> Differential expression test using DESeq2. For transcript level differential expression, I'd recommend you quantify using Kallisto or Salmon -> Differential transcript expression using Sleuth.