We are looking at some of the TCGA's level 3 data on RNAseq to examine expression levels for a series of homologs with very high identity between exons. The level of identity, exon junctions, etc is such, that for instance, BLAT of individual exons gives back near perfect matches for several of the homologs and the exon junctions are matches nucleotide for nucleotide. For example, an average of 2bp in 100 will differ between exons. Since we know which individuals have homozygous deletions for the gene and we still see RPKM readings for these exons, we know this is incorrect and probably due to the incorrect alignment of sequence reads to the wrong exons.
What is the best method to use TCGA's level 1 data (BAM files, presumably) to get more exact matches for our gene(s) of interest? I guess we'd want to do some level of trimming to select high quality reads and allow minimum or no mismatching. Do the standard pipelines for analyzing RNAseq data do well in cases like this or is there a better method? To be clear, we know the gene model and transcripts which should arise in the presence of this deletion. If I missed any post that answers this kind of question, please let me know. Any help is appreciated.