Question

FeatureCounts results: does it look correct? Why so many unassigned unmapped reads?

1

Entering edit mode

4.0 years ago

tanya_fiskur ▴ 70

Hello everyone!

I can't understand why my featureCounts summary differs so much from the RNA STAR ones. Both were done in Galaxy. I used default parameters (which I can't add here right now because Galaxy has Bad Gateway, but I will add them ASAP). The subsequent PCA plot is not that good, so I suspect that something is wrong with counting.

The featureCounts found 3681407 unassigned unmapped reads, when RNA STAR reported only about 663296 unmapped reads.

What could be the reason?

The example of summary of featureCounts:

Status  RNA STAR on data 56, data 38, and data 37: mapped.bam (8,5 weeks, sample 3)
Assigned    4705542
Unassigned_Unmapped 3681407
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 1557399
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   5861961
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    193083

The RNA STAR result for the same sample:

                      Number of input reads |   14855335
                  Average input read length |   478
                                UNIQUE READS:
               Uniquely mapped reads number |   10732876
                    Uniquely mapped reads % |   72.25%
                      Average mapped length |   472.17
                   Number of splices: Total |   8059241
        Number of splices: Annotated (sjdb) |   8025445
                   Number of splices: GT/AG |   7965722
                   Number of splices: GC/AG |   72373
                   Number of splices: AT/AC |   4974
           Number of splices: Non-canonical |   16172
                  Mismatch rate per base, % |   0.44%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.24
                    Insertion rate per base |   0.01%
                   Insertion average length |   2.05
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   441052
         % of reads mapped to multiple loci |   2.97%
    Number of reads mapped to too many loci |   67940
         % of reads mapped to too many loci |   0.46%
                              UNMAPPED READS:
Number of reads unmapped: too many mismatches | 663296
   % of reads unmapped: too many mismatches |   4.47%
        Number of reads unmapped: too short |   2921806
             % of reads unmapped: too short |   19.67%
            Number of reads unmapped: other |   28365
                 % of reads unmapped: other |   0.19%
                              CHIMERIC READS:
                   Number of chimeric reads |   467140
                        % of chimeric reads |   3.14%

Any help is highly appreciated!

RNA-Seq featureCounts RNA STAR • 4.0k views

ADD COMMENT • link updated 4.0 years ago by Chyguc • 0 • written 4.0 years ago by tanya_fiskur ▴ 70

1

Entering edit mode

STAR maps to the genome, featureCounts to the content of the GTF. You are comparing apples with peers. I do not see anything suspicious. Even if your reads map to the genome you can have e.g. genomic DNA contamination which will result in lower assignment rate in the featureCounts output. If you make statements about PCA plots etc. then please add an image. Not that good is not a very informative description.

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

Sure. The PCA: https://ibb.co/hRMpy5B Just the samples are not well-grouped.

Thank you very much for your comment. Indeed, I looked in a wrong direction.

It is also rather strange that featureCounts has much higher Unassigned_MultiMapping number. I understand that the gtf file can have much less sequences than the reference genome, so many reads are just unmapped. But why there is such an increase of multimapped reads?..

ADD REPLY • link 4.0 years ago by tanya_fiskur ▴ 70

score 0 · Answer 1 · 2020-05-08

From the result posted above, there are more than 663296 reads that were reported as unmapped for other reasons besides the one you referred to which was the mismatch result. So there are already more unmapped reads.

featureCounts reports assignment of alignments to genomic features. There is the complexity factor of polymorphisms eg splicing that can make alignments to be more (unmapped, multimapped etc) when the reads are split to match the exon-exon junctions during alignments. This is the reason alignments summarized by featureCounts is usually more than the reads. So what featureCounts quantifies is alignment not reads. You can check https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf for more knowledge.

I hope this helps you.