Many of my RNAseq reads are mapping to the pseudogene FTH1P2 instead of the protein-coding gene FTH1.
I visualized FTH1P2 to see where the reads mapped:
and BLASTed the highlighted region, which showed 98.62% identity with FTH1:
Note that there is an exon-exon junction at position 470 in FTH1, which is right in the middle of the BLAST alignment. This leads me to believe that Rsubread is not picking up the alternative mapping within FTH1 because it's missing the exon-exon junction.
Why is Rsubread not identifying alternative mapping sites within FTH1? How can I modify my inputs to help Rsubread know where the exon-exon junctions are located?
I tried running 'align' using a GTF file that includes FTH1 and its exons/junctions while excluding the pseudogenes, but it still only recognizes the mapping site on chr1 in FTH1P2. Note that if I build my reference index only include the protein-coding regions (gencode.v41.pc_transcripts.fa), then the reads are properly assigned to FTH1. The only problem with this is that there's no GTF file to use for featureCounts in the next step, though counting can be done using 'samtools idxstats'.
Here's my align function, done using Gencode's primary genome assembly:
align.stat <- align(index=<reference_index from primary_assembly.fa>, readfile1=<myreadfile>.fastq.gz, type="rna", input_format="gzFASTQ", output_format="BAM", nthreads = 12, useAnnotation = TRUE, annot.ext = gencode.v41.primary_assembly.annotation.gtf', isGTF = TRUE, GTF.featureType = 'exon', GTF.attrType = 'gene_id')
Thank you in advance to anyone with ideas.