Many of my RNAseq reads are mapping to the pseudogene FTH1P2 instead of the protein-coding gene FTH1.
I visualized FTH1P2 to see where the reads mapped:
and BLASTed the highlighted region, which showed 98.62% identity with FTH1:
Note that there is an exon-exon junction at position 470 in FTH1, which is right in the middle of the BLAST alignment. This leads me to believe that Rsubread is not picking up the alternative mapping within FTH1 because it's missing the exon-exon junction.
Why is Rsubread not identifying alternative mapping sites within FTH1? How can I modify my inputs to help Rsubread know where the exon-exon junctions are located?
I tried running 'align' using a GTF file that includes FTH1 and its exons/junctions while excluding the pseudogenes, but it still only recognizes the mapping site on chr1 in FTH1P2. Note that if I build my reference index only include the protein-coding regions (gencode.v41.pc_transcripts.fa), then the reads are properly assigned to FTH1. The only problem with this is that there's no GTF file to use for featureCounts in the next step, though counting can be done using 'samtools idxstats'.
Here's my align function, done using Gencode's primary genome assembly:
align.stat <- align(index=<reference_index from primary_assembly.fa>, readfile1=<myreadfile>.fastq.gz, type="rna", input_format="gzFASTQ", output_format="BAM", nthreads = 12, useAnnotation = TRUE, annot.ext = gencode.v41.primary_assembly.annotation.gtf', isGTF = TRUE, GTF.featureType = 'exon', GTF.attrType = 'gene_id')
Thank you in advance to anyone with ideas.
UPDATE: So I think I found out what's going on. I took the region from the pseudogene that all of these reads are mapping to and compared it against the FTH1 gene (from Gencode's primary_assembly.fa) as well as against the FTH1 transcript itself (from Gencode's pc_transcripts.fa).
The pseudogene aligns very nicely with the protein-coding transcript:
But it doesn't line up well with the primary assembly sequence on chr11 (FTH1):
Therefore, the reads are artifactually aligning to the pseudogene on the primary assembly because the primary assembly FTH1 gene does not correspond to the protein-coding sequence.
Takeaway: From now on, I'm going to map reads using Gencode's 'pc_transcripts.fa' file, as it seems these sequences are more reliable. Gencode doesn't provide a GTF file to use with featureCounts on the BAM files, but this can be done with samtools with the following:
and then summarized at the gene level in Pandas using this:
You could also consider using the new complete human CHM13 genome available from:
I have constructed featureCounts SAF files for CHM13 based on RefSeq gene annotation and placed them here:
The
chm13v2.0_RefSeqStrict_27Sep2022.saf.gz
file might particularly be of interest to you because it excludes computationally generated gene models and only includes curated RefSeq genes.