Question: Sequence duplication levels-RNA Seq
gravatar for makwana.kd
2.6 years ago by
makwana.kd30 wrote:

I have a RNA Seq data (Illumina 1.9). I did QC using Fastqc after over represented sequences and adapter removal. On fastqc I observed there were failures for Kmer, GC content and sample duplication modules . Reading on several blog post suggesting it to be a normal occurrence I then aligned to reference genome using STAR followed by HT-seq for read counts and then Deseq2 for differential expression. RNA samples for RNA sequencing was isolated from polysomal heavy fraction so essentially the samples had ribosomal bound messenger transcripts. Poly A selection method was employed the company who did sequencing. After analyzing RNA-Seq data I did RT-qPCR and have validated the findings I have got after Deseq2 analysis and have seen almost similar results to RNA-Seq data.

Percentage of unique reads after deduplication, as suggested by fastqc, for some of my samples is as low as 8%. My validation suggests to me that the libraries were fine. I have read different opinions online and it has got me all confused now. Some suggest to remove duplicates and then proceed, whereas, some suggest it as a no no.

Is this a normal for RNA-seq data to have such a low unique reads as suggested by Fastqc?

rna-seq • 4.3k views
ADD COMMENTlink modified 25 days ago by joshua.theisen20 • written 2.6 years ago by makwana.kd30

High sequence duplication levels in RNA-seq are normal and expected. Do not remove duplicates. This would underestimate the true expression of highly expressed genes, as it would artificially downscale the counts of these genes.

ADD REPLYlink written 2.6 years ago by ATpoint40k

In RNA-Seq libraries sequences from different transcripts will be present at wildly different levels in the starting population. In order to be able to observe lowly expressed transcripts it is therefore common to greatly over sequence high expressed transcripts, and this will potentially create large set of duplicates. This will result in high overall duplication in this test, and will often produce peaks in the higher duplication bins. This duplication will come from physically connected regions, and an examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication, but these distinctions are not possible from raw fastq files. FastQC Documentation - Duplicate Sequences module

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by said342790

What does it mean that "examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication"? What should I be looking for? Is it that over-sequencing will appear as a large number of overlapping reads, some of which are exact duplicates by chance, while technical (PCR) duplication will appear as individual stacks of multiple copies of the exact same read?

ADD REPLYlink written 25 days ago by joshua.theisen20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1408 users visited in the last hour