featureCounts - Low Assigned rate - Locations of reads
8 months ago
chrys ▴ 60

Well hello there,

I am using featureCounts from the subread package to count some third generation reads produced by Nanopore sequencing (MinION) and mapped to a reference genome. While we had overall high basecall quality for our reads and the mapping rates were also very nice (94%) featureCount only produced assignment rates in the 50% to 60%.

The largest group there is "NoFeatures" which made me wonder where those reads mapped.

Assigned    1057725
Unassigned_Unmapped 62207
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 0
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   457608
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    283748


I used a custom annotation gff (Gencode + Custom features) to count the mappings. I was wondering if somebody knew a tool or straight forward way (other then checking IGV visually), where those reads are.

Especially if we possibly have some kind of contamination by genomic DNA.

Any suggestions for QC / Tools / procedures are welcome. Thanks !

If it is mapped but not overlapping the GTF then it is introns or intergenic. You can make a custom SAF file for featureCounts (see manual) to count the reads for these features. Intergenic is the complement of the entire genome with the GTF entries of type="gene" and intron is the entire genome minus intergenic and exon.

You can also use the qualimap rnaseq tool to count the number/percentage of exonic, intergenic or intronic regions: http://qualimap.conesalab.org/doc_html/analysis.html#rna-seq-qc.

I believe that you only need the bam and the GTF files (if I remember it well). Although you've a GFF file, you could convert this to GTF by using gffread: https://github.com/gpertea/gffread

Thanks to you both !

Qualimap was an excellent suggestions. Exactly what I am looking for. GFF to GTF conversion should be also no problem.

I found it puzzling that with ultra-long reads one would get so many unassigned counts. Thank you.