Context: Sometimes, I get badly mapped RNA-Seq reads due to pseudo-genes (here close duplicates with similar sequences). Lots of software have an option to set the maximal size introns can take, but it often seems unreasonable (Tophat sets it at 500000bp I think). I would like to set it slightly below the maximal distance between any two homologous genes.
Question: I have as input a list of genes of interest in .gtf format. I want to find out what is the minimum distance between any of these two genes.
Is "bedtools closest -io -a foo.gtf -b foo.gtf" a correct way to obtain this value?
General: I have reads which are split in two: the left part aligns on a gene, the right one on another gene (close by and with similar sequences). How to stop this?
As a final note, transcriptome assembly usually ignores any aberrant reads and follow the consensus.
Thanks for your advice!