Question

Sequence duplication levels-RNA Seq

2

Entering edit mode

6.1 years ago

makwana.kd ▴ 50

I have a RNA Seq data (Illumina 1.9). I did QC using Fastqc after over represented sequences and adapter removal. On fastqc I observed there were failures for Kmer, GC content and sample duplication modules . Reading on several blog post suggesting it to be a normal occurrence I then aligned to reference genome using STAR followed by HT-seq for read counts and then Deseq2 for differential expression. RNA samples for RNA sequencing was isolated from polysomal heavy fraction so essentially the samples had ribosomal bound messenger transcripts. Poly A selection method was employed the company who did sequencing. After analyzing RNA-Seq data I did RT-qPCR and have validated the findings I have got after Deseq2 analysis and have seen almost similar results to RNA-Seq data.

Percentage of unique reads after deduplication, as suggested by fastqc, for some of my samples is as low as 8%. My validation suggests to me that the libraries were fine. I have read different opinions online and it has got me all confused now. Some suggest to remove duplicates and then proceed, whereas, some suggest it as a no no.

Is this a normal for RNA-seq data to have such a low unique reads as suggested by Fastqc?

RNA-Seq • 11k views

ADD COMMENT • link updated 3.6 years ago by joshua.theisen ▴ 30 • written 6.1 years ago by makwana.kd ▴ 50

3

Entering edit mode

High sequence duplication levels in RNA-seq are normal and expected. Do not remove duplicates. This would underestimate the true expression of highly expressed genes, as it would artificially downscale the counts of these genes.

ADD REPLY • link 6.1 years ago by ATpoint 81k

1

Entering edit mode

In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. FastQC Documentation - Duplicate Sequences module

ADD REPLY • link 6.1 years ago by said3427 ▴ 120

0

Entering edit mode

What does it mean that "examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication"? What should I be looking for? Is it that over-sequencing will appear as a large number of overlapping reads, some of which are exact duplicates by chance, while technical (PCR) duplication will appear as individual stacks of multiple copies of the exact same read?

ADD REPLY • link 3.6 years ago by joshua.theisen ▴ 30