Question

Why there are low assigned reads in featureCounts output and different outputs for quality checks?

0

Entering edit mode

3.6 years ago

newbie ▴ 120

I'm using rnaseq data (paired-end and strand-specific) which is rRNA depleted with Ribo-Zero Gold rRNA Removal Kit. I observed that most number of reads are intronic (genomic origin), which is due to Ribo-Zero.

I did the alignment with Hisat2. Quality check of the alignment was done with RNA-seqQC (qualimap) and samtools flagstat stats.

Below I have given the results From RNA-seq QC and samtools. I don't get why the results from both are not the same?

And when I tried extracting read counts with featureCounts.

featureCounts -a /documents/annotation_data/GRCh38/gencode.v27.annotation.gtf -F GTF -p -s 2 -T 8 -o sample01_RNAseq_cutadapt.txt sample01_RNAseq_cutadapt.sorted.bam

FeatureCounts output:

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P sample01_RNAseq_cutadapt.sorted.bam        ||
||                                                                            ||
||             Output file : sample01_RNAseq_cutadapt.txt                 ||
||                 Summary : sample01_RNAseq_cutadapt.txt.summary         ||
||              Annotation : gencode.v27.annotation.gtf (GTF)                 ||
||      Dir for temp files : /documents/data/ ... ||
||                                                                            ||
||                 Threads : 8                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file gencode.v27.annotation.gtf ...                        ||
||    Features : 1200453                                                      ||
||    Meta-features : 58288                                                   ||
||    Chromosomes/contigs : 25                                                ||
||                                                                            ||
|| Process BAM file sample01_RNAseq_cutadapt.sorted.bam...                    ||
||    Paired-end reads are included.                                          ||
||    Strand specific : reversely stranded                                    ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Pairing up the read pairs.                                              ||
||                                                                            ||
||    Total alignments : 85057355                                             ||
||    Successfully assigned alignments : 15837729 (18.6%)                     ||
||    Running time : 6.59 minutes                                             ||

RNA-Seq QC report (Qualimap) after the alignment:

RNA-Seq QC report
-----------------------------------

>>>>>>> Input

    bam file = /documents/Hisat2_bamOutputs/sample01_RNAseq_cutadapt.sorted.bam
    gff file = /documents/annotation_data/GRCh38/gencode.v27.annotation.gtf
    counting algorithm = uniquely-mapped-reads
    protocol = strand-specific-reverse


>>>>>>> Reads alignment

    reads aligned (left/right) = 40,475,469 / 39,127,135
    read pairs aligned  = 38,373,732
    total alignments = 164,740,444
    secondary alignments = 85,137,840
    non-unique alignments = 56,300,532
    aligned to genes  = 2,341,222
    ambiguous alignments = 143,275
    no feature assigned = 10,530,260
    not aligned = 3,496,915


>>>>>>> Reads genomic origin

    exonic =  2,341,222 (18.19%)
    intronic = 9,502,317 (73.82%)
    intergenic = 1,027,943 (7.99%)
    overlapping exon = 344,632 (2.68%)


>>>>>>> Transcript coverage profile

    5' bias = 0.79
    3' bias = 0.84
    5'-3' bias = 0.92

And this is the result from samtools stats

# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      raw total sequences:    83099498
SN      filtered sequences:     0
SN      sequences:      83099498
SN      is sorted:      1
SN      1st fragments:  41549807
SN      last fragments: 41549691
SN      reads mapped:   79602782
SN      reads mapped and paired:        77446534        # paired-end technology bit set + both mates mapped
SN      reads unmapped: 3496716
SN      reads properly paired:  76747534        # proper-pair bit set
SN      reads paired:   83099498        # paired-end technology bit set
SN      reads duplicated:       0       # PCR or optical duplicate bit set
SN      reads MQ0:      496791  # mapped and MQ=0
SN      reads QC failed:        0
SN      non-primary alignments: 85137861
SN      total length:   10934613644     # ignores clipping
SN      total first fragment length:    5462252543      # ignores clipping
SN      total last fragment length:     5472361101      # ignores clipping
SN      bases mapped:   10441500710     # ignores clipping
SN      bases mapped (cigar):   10424391662     # more accurate
SN      bases trimmed:  0
SN      bases duplicated:       0
SN      mismatches:     26253838        # from NM fields
SN      error rate:     2.518501e-03    # mismatches / bases mapped (cigar)
SN      average length: 131
SN      average first fragment length:  131
SN      average last fragment length:   132
SN      maximum length: 147
SN      maximum first fragment length:  147
SN      maximum last fragment length:   147
SN      average quality:        38.4
SN      insert size average:    682.5
SN      insert size standard deviation: 1667.3
SN      inward oriented pairs:  38345988
SN      outward oriented pairs: 118451

1) I see that only 18.6% reads were successfully assigned in featureCounts output. Why there are very low in number?

2) Why the RNA-seq QC and samtools results are different?

Any help is appreciated. thanq

RNA-Seq featureCounts alignment qualimap reads • 2.8k views

ADD COMMENT • link updated 3.6 years ago by rpolicastro 13k • written 3.6 years ago by newbie ▴ 120

score 0 · Answer 1 · 2020-10-01

0

Entering edit mode

3.6 years ago

GenoMax 142k

RNAseq QC and featureCount reports are in alignment since they are looking are "where" reads are aligning according to gene models. Samtools report is for general alignments, so that is a different kind of report.

Have you looked at these alignment in IGV to see why there are so many reads in "intronic" regions? If there is DNA contamination in this sample you may see alignments that are scattered (not restricted to exons/UTR). If that is not the case then does this data have rRNA contamination?

ADD COMMENT • link 3.6 years ago by GenoMax 142k

0

Entering edit mode

I see that higher intronic mapping rate is see with rRNA removal. Here wee used Ribo Zero tool kit. Equal distribution of reads mapping to intronic, exonic and intergenic regions suggests that there is DNA contamination. I got this information Fromm here rna-seq QC tutorial same seen here too [https://link.springer.com/article/10.1186/1471-2164-15-675]

ADD REPLY • link 3.6 years ago by newbie ▴ 120

0

Entering edit mode

I have data from 300 tumor samples. And for all I see that genomic origin of reads are higher in intronic region.

ADD REPLY • link 3.6 years ago by newbie ▴ 120

1

Entering edit mode

Have you visually inspected the alignments? Just because a paper said something it may not be applicable to your data. It is certainly possible that the rRNA removal did not work as expected. If that is the case then you should explicitly use the rDNA repeat for human (or something like SortMeRNA) to see if your data has rRNA contamination by doing those alignments.

ADD REPLY • link 3.6 years ago by GenoMax 142k

0

Entering edit mode

I have 600 fastq.gz files. Could you please show me one example how to use SortMeRNA on .fastq.gz file please. thanq

ADD REPLY • link 3.6 years ago by newbie ▴ 120

0

Entering edit mode

Program manual for SortMeRNA is detailed and gives necessary examples. Start with a sample or two first. After you understand the usage think of analyzing remaining data.

ADD REPLY • link 3.6 years ago by GenoMax 142k

score 0 · Answer 2 · 2020-10-01

0

Entering edit mode

3.6 years ago

rpolicastro 13k

By default featureCounts only counts reads over exons (this is controlled by the -t flag). In your report you have about 74% of your reads over introns, and another 8% intergenic, meaning about 82% of your reads wouldn't be considered when counting features.

What might be troubling is the prevalence of intronic reads. Is this normal RNA-seq? If so, genomax brings up good points about DNA contamination or possible inefficiency of rRNA depletion.

ADD COMMENT • link 3.6 years ago by rpolicastro 13k

0

Entering edit mode

In my command I haven't used -t. So it should also count all the lncRNAs also.