Question

Reads mapping to pseudogenes using RSubread

0

Entering edit mode

19 months ago

cstashko • 0

Many of my RNAseq reads are mapping to the pseudogene FTH1P2 instead of the protein-coding gene FTH1.

I visualized FTH1P2 to see where the reads mapped: Reads mapping to FTH1P2

and BLASTed the highlighted region, which showed 98.62% identity with FTH1: BLAST results

Note that there is an exon-exon junction at position 470 in FTH1, which is right in the middle of the BLAST alignment. This leads me to believe that Rsubread is not picking up the alternative mapping within FTH1 because it's missing the exon-exon junction.

Why is Rsubread not identifying alternative mapping sites within FTH1? How can I modify my inputs to help Rsubread know where the exon-exon junctions are located?

I tried running 'align' using a GTF file that includes FTH1 and its exons/junctions while excluding the pseudogenes, but it still only recognizes the mapping site on chr1 in FTH1P2. Note that if I build my reference index only include the protein-coding regions (gencode.v41.pc_transcripts.fa), then the reads are properly assigned to FTH1. The only problem with this is that there's no GTF file to use for featureCounts in the next step, though counting can be done using 'samtools idxstats'.

Here's my align function, done using Gencode's primary genome assembly:

align.stat <- align(index=<reference_index from primary_assembly.fa>, readfile1=<myreadfile>.fastq.gz, type="rna", input_format="gzFASTQ", output_format="BAM", nthreads = 12, useAnnotation = TRUE, annot.ext = gencode.v41.primary_assembly.annotation.gtf', isGTF = TRUE, GTF.featureType = 'exon', GTF.attrType = 'gene_id')

Thank you in advance to anyone with ideas.

Pseudogenes Read RNAseq Rsubread Mapping Alignment • 912 views

ADD COMMENT • link 19 months ago by cstashko • 0

0

Entering edit mode

UPDATE: So I think I found out what's going on. I took the region from the pseudogene that all of these reads are mapping to and compared it against the FTH1 gene (from Gencode's primary_assembly.fa) as well as against the FTH1 transcript itself (from Gencode's pc_transcripts.fa).

The pseudogene aligns very nicely with the protein-coding transcript:

enter image description here

But it doesn't line up well with the primary assembly sequence on chr11 (FTH1):

enter image description here

Therefore, the reads are artifactually aligning to the pseudogene on the primary assembly because the primary assembly FTH1 gene does not correspond to the protein-coding sequence.

Takeaway: From now on, I'm going to map reads using Gencode's 'pc_transcripts.fa' file, as it seems these sequences are more reliable. Gencode doesn't provide a GTF file to use with featureCounts on the BAM files, but this can be done with samtools with the following:

samtools sort input.BAM >sorted.BAM
samtools index sorted.BAM
samtools idxstats sorted.BAM | cut -f 1,3 >counts.txt

and then summarized at the gene level in Pandas using this:

counts['name'] = counts.index
counts['name'] = counts['name'].str.split('|', expand=True)[5]
counts = counts.groupby('name').sum()

ADD REPLY • link 19 months ago by cstashko • 0

1

Entering edit mode

You could also consider using the new complete human CHM13 genome available from:

https://github.com/marbl/CHM13

I have constructed featureCounts SAF files for CHM13 based on RefSeq gene annotation and placed them here:

https://bioinf.wehi.edu.au/Rsubread/annot

The chm13v2.0_RefSeqStrict_27Sep2022.saf.gz file might particularly be of interest to you because it excludes computationally generated gene models and only includes curated RefSeq genes.

ADD REPLY • link 19 months ago by Gordon Smyth ★ 7.0k

score 1 · Answer 1 · 2022-09-28

There isn't anything in your post to suggest that subread is missing FTH1 or it's exons. If these reads match the pseudogene region even slightly better than the protein-coding region, then they will be aligned there. That doesn't mean that subread isn't finding the mapping sites within FTH1, just that the sites within the pseudo-gene appear at least as good.

You seem to be running Rsubread::align correctly however you might try annot.inbuilt="hg38" instead of Gencode because pseudo-genes are less prominent in the inbuilt RefSeq annotation compared to Gencode. If you are curious about exon-exon junctions then you could try Rsubread::subjunc instead of align because the latter function identifies exon-exon junctions de novo, but I doubt that it would make any difference to the issue that you report.