Question

Standard QC and EDA for Bulk RNA-seq

0

Entering edit mode

5 days ago

AlexStar ▴ 200

In single-cell RNA-seq, there are several standard QC steps to ensure data quality, such as quantifying mitochondrial (MT) gene expression, filtering cells with a very low number of expressed genes, and examining total counts.

I'm wondering if there are equivalent, standard QC procedures for bulk RNA-seq. I know to check if deposited data is already normalized or if it consists of raw counts, but beyond that, what Exploratory Data Analysis (EDA) and QC steps are typically performed for bulk RNA-seq datasets?

single-cell RNA-seq python anndata Bulk • 310 views

ADD COMMENT • link updated 5 days ago by i.sudbery 22k • written 5 days ago by AlexStar ▴ 200

score 1 · Answer 1 · 2025-11-12

There are myriad QC metrics you could do on bulk-RNAseq. You might like to look at this package: https://cran.r-project.org/web/packages/RNAseqQC/vignettes/introduction.html

You might like to start with QCing the reads. Pay particular attension to the levels of over-represented sequences and the GC content. What you are looking for with GC content is not a specific distribution, but that a) The distribution is relatively smooth and has a single mode and b) is similar between samples. You can also look at the duplication levels, but as this is neccessarily single ended duplication (prior to alignment), its only an indication, rather than completely diagnostic.

Post-ailgnment you might want to look at how many reads align (expect >80%), how many align to exonic sequence (>50% for rRNA depleted, >66% for polyA)

Once the reads are counted/quantified, you can look at normalisation factors (check they are not orders of magnitude different between samples), and PCAs/clustering of the samples to make sure they cluster by condition.

This is more or less where we'd leave it for a normal experiment. But further things you might do if you wanted to be particularly thorough, or suspected there was something wrong:

How many align to repeats or rRNA sequences. If its supposed to be polyA selected, what fraction map to non-polyadenylated transcripts?

If you've done alignment with a read aligner, rather than a quantifier, you can look at the distribution of reads along genes using some sort of metagene - do you have a 3' or 5' bias (and importantly, is this the same between samples).

You can also check the paired end duplication levels using something like Picard MarkDuplicates or EstimateLibrarySize.

What fraction of reads are spliced.

Is expression evenly distributed across chromosomes.