I have a insect genome with no annotation. So mapped annotated NCBI protein sequences of closest relative to the genome and created my transcriptome data ( fasta and gff file). After analyzing data for the differential gene expression using EdgeR, I found that featureCount does not calculate or detect genes that have overlapping or same reading frame. and there was one incident in my case. The gene that is supposed to be expressing was not showing any expression. but later I find out there are reads corresponding to that transcript.
Then I checked my whole transcript data that was designed based on closest relative transcript data using CD-hit program. Here even I am using similarity cut-off 100% I am still seeing at least 4000-5000 genes 100% similar with in same database. percent similarity increases as I relax the similarity cut-off.
So looking at the situation I am confused whether my analysis is missing something or giving me wrong information.
if the similar sequences represents gene duplication events how to deal with it's differential expression analysis?
can anyone explain if I am doing anything wrong? or what should I do?
Thank you for time and consideration.