featureCount with high multimapping fragments
1
0
Entering edit mode
4.0 years ago
bertb ▴ 20

Hello,

I am using featureCount to complete a gene level count of my RNAseq reads. From the results summary, it looks like I have a very high number multimapping reads, resulting in a low assingment rate (less than 15%).

I have paired end reads, with an unstranded library

The options I have chosen are as follows:

featureCounts -T 8 -p -a ~/file/path/sacCer3.ncbiRefSeq.gtf -t exon -g gene_id -o ~/file/path/fcresults/groseq2.txt UWN.bam UMN.bam

after processing the reads, this is what my summary looks like:

Status  UWN.bam UMN.bam
Assigned        7608607 10093518
Unassigned_Unmapped     4743940 3388412
Unassigned_Read_Type    0       0
Unassigned_Singleton    0       0
Unassigned_MappingQuality       0       0
Unassigned_Chimera      0       0
Unassigned_FragmentLength       0       0
Unassigned_Duplicate    0       0
Unassigned_MultiMapping 66981505        58121996
Unassigned_Secondary    0       0
Unassigned_NonSplit     0       0
Unassigned_NoFeatures   1927159 2654380
Unassigned_Overlapping_Length   0       0
Unassigned_Ambiguity    10983810        8513544

From what I understand, this means I have reads (fragments) that are mapping to more than one feature.

My question is, if the SAM/BAM file I created was able to successfully map fragments to a reference genome, how is it possible that these fragments are assigned to more than one gene (meta feature) in my GTF file? Do I need to use a different annotation file, or is this a typical assignment rate for the count I am after?

Any help would appreciated! Thanks,

RNA-Seq sequencing • 2.0k views
ADD COMMENT
0
Entering edit mode
4.0 years ago

From what I understand, this means I have reads (fragments) that are mapping to more than one feature.

I think the 67M read are aligning to multiple places in the genome, and another 11M are aligning to multiple features.

That's not normal. In the RNA preps I analyze, I typically get STAR telling me that about 70% of reads mapped to the genome uniquely.

Here are some stats from a recent experiment; Total are all the reads in the fastqs, mapped means they were aligned to the genome, assigned means they were assigned to a gene (I used RSEM for the last step, it is smarter about handling reads that are ambiguous that featureCounts is)

                          Assigned_reads Mapped_reads Total_reads
           Sample_1       22858584     26435275    27173101
           Sample_2       15934143     18462699    18732273
           Sample_3       20612644     23995078    24387303

These might be a little nicer than usual; the RNA was commercially-made control RNA.

ADD COMMENT
0
Entering edit mode

Hey - thanks for replying.

Yes, I used Hisat2 to map my reads, and got a good alignment rate (also ~70%). This is why I was so confused to get a poor assignment rate. I was wondering if anybody had come across discrepancies of this kind.

Thanks for your help.

ADD REPLY
0
Entering edit mode

I don't see a discrepancy. Your data is telling you that most of your reads are mapping to multiple places. I'd investigate what exactly those reads are all aligning to.

ADD REPLY
0
Entering edit mode

Thanks for your help. Can you tell me how I can parse which reads are aligning or mapping to multiple places? From what I've read there are NH tags in the .bam files, but I'm not sure how I can separate these, or see which features they are mapping to. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6