Question

featureCount with high multimapping fragments

0

Entering edit mode

3.8 years ago

bertb ▴ 20

Hello,

I am using featureCount to complete a gene level count of my RNAseq reads. From the results summary, it looks like I have a very high number multimapping reads, resulting in a low assingment rate (less than 15%).

I have paired end reads, with an unstranded library

The options I have chosen are as follows:

featureCounts -T 8 -p -a ~/file/path/sacCer3.ncbiRefSeq.gtf -t exon -g gene_id -o ~/file/path/fcresults/groseq2.txt UWN.bam UMN.bam

after processing the reads, this is what my summary looks like:

Status  UWN.bam UMN.bam
Assigned        7608607 10093518
Unassigned_Unmapped     4743940 3388412
Unassigned_Read_Type    0       0
Unassigned_Singleton    0       0
Unassigned_MappingQuality       0       0
Unassigned_Chimera      0       0
Unassigned_FragmentLength       0       0
Unassigned_Duplicate    0       0
Unassigned_MultiMapping 66981505        58121996
Unassigned_Secondary    0       0
Unassigned_NonSplit     0       0
Unassigned_NoFeatures   1927159 2654380
Unassigned_Overlapping_Length   0       0
Unassigned_Ambiguity    10983810        8513544

From what I understand, this means I have reads (fragments) that are mapping to more than one feature.

My question is, if the SAM/BAM file I created was able to successfully map fragments to a reference genome, how is it possible that these fragments are assigned to more than one gene (meta feature) in my GTF file? Do I need to use a different annotation file, or is this a typical assignment rate for the count I am after?

Any help would appreciated! Thanks,

RNA-Seq sequencing • 2.0k views

ADD COMMENT • link updated 3.8 years ago by swbarnes2 14k • written 3.8 years ago by bertb ▴ 20

score 0 · Answer 1 · 2020-10-07

0

Entering edit mode

3.8 years ago

swbarnes2 14k

From what I understand, this means I have reads (fragments) that are mapping to more than one feature.

I think the 67M read are aligning to multiple places in the genome, and another 11M are aligning to multiple features.

That's not normal. In the RNA preps I analyze, I typically get STAR telling me that about 70% of reads mapped to the genome uniquely.

Here are some stats from a recent experiment; Total are all the reads in the fastqs, mapped means they were aligned to the genome, assigned means they were assigned to a gene (I used RSEM for the last step, it is smarter about handling reads that are ambiguous that featureCounts is)

                          Assigned_reads Mapped_reads Total_reads
           Sample_1       22858584     26435275    27173101
           Sample_2       15934143     18462699    18732273
           Sample_3       20612644     23995078    24387303

These might be a little nicer than usual; the RNA was commercially-made control RNA.

ADD COMMENT • link 3.8 years ago by swbarnes2 14k

0

Entering edit mode

Hey - thanks for replying.

Yes, I used Hisat2 to map my reads, and got a good alignment rate (also ~70%). This is why I was so confused to get a poor assignment rate. I was wondering if anybody had come across discrepancies of this kind.

Thanks for your help.

ADD REPLY • link 3.8 years ago by bertb ▴ 20

0

Entering edit mode

I don't see a discrepancy. Your data is telling you that most of your reads are mapping to multiple places. I'd investigate what exactly those reads are all aligning to.

ADD REPLY • link 3.8 years ago by swbarnes2 14k

0

Entering edit mode

Thanks for your help. Can you tell me how I can parse which reads are aligning or mapping to multiple places? From what I've read there are NH tags in the .bam files, but I'm not sure how I can separate these, or see which features they are mapping to. Thanks

ADD REPLY • link 3.8 years ago by bertb ▴ 20