Question

How to decide the quality of RNA-seq analysis

2

Entering edit mode

8.3 years ago

farui.huihui ▴ 20

Hi all,

I am a newbie in this forum. Probably my question has been answered somewhere, if so, please give me a link. Thanks.

My question basically is, how to judge the quality of RNA-seq analysis. Actually, this question can be rephrased in two sub-questions:

Given the quality control tool FastQC, there are many outputs in terms of quality, which ones are critical, i.e. if those quality checks fail, the fastq file should be abandoned;
The alignment quality, or mapping quality. When I use bioconductor package Rsubread to count the reads, I found that some bam files give more than 70% successfully assigned reads rate, but some only provide less than 30%, lowest even less than 5%. Does this mean those datasets with low successfully assigned reads rate are poor quality? I know there are few tools, like RSeQC, again, which outputs are critical.

I'll appreciate any help you offer. Thanks!

RNA-Seq • 5.1k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.3 years ago by farui.huihui ▴ 20

Ram · Answer 1 · 2016-01-13

10

Entering edit mode

8.3 years ago

Amitm ★ 2.2k

hi,

Assuming that this is human poly+ve RNA-seq (Illumina), first up see the number of reads you have. 40-50million reads is min. number if you are aiming to do differential expression.

2) Look at the box plot of qual. distrib. in FastQC. A good data would have the majority of plots well above phred score 30 (green region). Depending on read length, boxes for few bases at the end falling below 30 is normal.

3) Then check the plot for "Sequence dupli. level". Don't worry if its shown to be failed as per FastQC. RNA-seq is bound to have duplicates because mRNAs are present in multiple copies. A peak/ hump for x-axis value ">10" or more is good. It means you have sampled the transcriptome well.

4) Check the "Per seq. GC content". Ideally it should be unimodal: a single hump in the centre with the theoretical and observed more or less overlapping. I would be worried if I see two humps.

5) Check "Adapter content" plot

If you have adapter content warning, remove them using any suitable tool (Cutadapt, Trimmomatic etc.). Use a base quality-based filter to clip reads when quality falls below a level. Like "SLIDINGWINDOW" in Trimmomatic.

After quality filtering if you still have good number of reads left, box plots look OK for most of the read length, no mean looking 2nd hump in GC content (a shoulder is OK) and majority of read lengths are towards the higher end (like if you have trimmed 5 bases, then the peak in Seq. len. distrib plot is over 95/96), then your data is good to go (for algn.)

Post algn. a >70% algn. rate is decent. Though I routinely get >85-90% using STAR.

All the above is theory, ultimately it would depend how much you value a particular sample. If you have many and there is one low quality sample you could remove it. If its clinical sample on other hand, you try to salvage what you have.

Check the sample with 5% algn. rate for its FastQC. How many reads it had to start with? Use bedtools to overlap the BAM with exons to find where reads aligned.

Also if you want to dig deep for that sample, check the diagnostic metrics of the sequencing machine.

ADD COMMENT • link 8.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Thank you very much, Amitm! It's a great tutorial!

I am little confused about the mapped/aligned rate (same thing?). I looked into the .bam file with low assigned rate, using samtools flagstat, the output is shown below:

29549418 + 0 in total (QC-passed reads + QC-failed reads)
20783699 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
29549418 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

It seems that 100% reads were mapped, but less than 5% were counted/assigned to genes, how this happened?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by farui.huihui ▴ 20

0

Entering edit mode

BTW, have you ever used fastx_toolkit to trim and clip the fastq before analysis?

ADD REPLY • link 8.3 years ago by farui.huihui ▴ 20

1

Entering edit mode

hi,

depending on what aligner you used, all the reads provided might not end up in the BAM i.e. if only aligned reads are written into the BAM. Hence samtools flagstat has no way of knowing how many reads you started with, other than just looking at the BAM. More informative would be to look at the stats generated by the aligner where you can get the actual align. rate. Like TopHat & STAR, both give useful stats of how much aligned (unique or multi-mapping).

Also, open the BAM in IGV and check how the alignment looks. If you don't have idea of how a RNA-seq aligned BAM might look then you can import Bodymap/ ENCODE datasets through IGV menu and then you can make out if the alignment looks OK.

I am hoping that you used a splice-aware aligner (TopHat, STAR etc.).

One way to find where the alignments are is to convert your BAM to BED and then overlap the BED with BED of exons (from UCSC or Ensembl).

I have used Fastx-toolkit. Last time I used it, mult-threading wasn't available. If speed is not an issue, you can use that. BTW, any good (well used in community) trimmer would be OK.

ADD REPLY • link 8.3 years ago by Amitm ★ 2.2k

0

Entering edit mode

Yes, I am using STAR. Because I am doing large scale analysis, I just realised, it turns out that those low algnd rate datasets are miRNA/ncRNA ......

Thank you very much.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by farui.huihui ▴ 20