Question: Trim/Remove Reads With Adapters From Illumina Rna-Seq Experiment
gravatar for dfernan
7.1 years ago by
United States
dfernan690 wrote:


I have data from a single-end 59 bp mouse cells RNA-Seq experiment, each experiment correspond to one cell flow, meaning each experiment is between 100,000,000 to 200,000,000 reads. The protocol was to pull-down polyA RNA.

When I run fastqc I obtain some concerning results:

1) Overrepresented sequences corresponding to the illumina adaptor, is this common?

Overrepresented sequences

Sequence Count Percentage Possible Source


The index of the specific adapter in this experiment is CGATGT, which makes sense. So the adapter is 59 bp, same length as the single-end library? Isn't that an issue? Should I trim all the adapters? Anyone has experience on how to trim the adapters using trim galore? Does the following trim galore command makes sense? Should I use the whole adapter sequence after the -a, or just the index sequences as I did below?

I.e., Should I do this?

trim_galore -a CGATGT -q 15 -s 5 -e 0.05 --length 48 <fastq_file>

or this?

trim_galore -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCG -q 15 -s 5 -e 0.05 --length 48 <fastq_file>

2) High duplicate numbers:

Sequence duplicate levels >= 75 % or so.

What should I do about the high duplicate levels? - My goal is to do differential expression between experiments.

(1) Leave it as it is and map the data

(2) Collapse duplicates into one

(3) No solution experiments with such duplicate levels mean something went wrong.

Please let me know your suggestions regarding this issues. Thanks!

rnaseq fastq rna rna-seq fastqc • 17k views
ADD COMMENTlink modified 7.1 years ago by Istvan Albert ♦♦ 85k • written 7.1 years ago by dfernan690
gravatar for Istvan Albert
7.1 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

The advantage of listing the longer version of the adapter is that the tool can then recognize adapters with sequencing errors in them. I would list the adapter up to the variable region but not the variable region itself.

On duplication levels:

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Istvan Albert ♦♦ 85k

@Istvan thanks a lot.

ADD REPLYlink written 7.1 years ago by dfernan690

By variable region you mean actual sequence of index? So that you wouldn't actually trim ATCTCGTATGCCGTCTTCTGCTTG part? Thanks a lot:)

ADD REPLYlink written 7.1 years ago by Biomonika (Noolean)3.1k

yes because all it needs to match the adapter and once it does will then remove everything after it, so the initial GATCGGAAGAGCACACGTCTGAACTCCAGTCAC suffices

ADD REPLYlink written 7.1 years ago by Istvan Albert ♦♦ 85k

Yes, I now realize that you are discussing adapter at read ends. I have found Index 2 adaptor at the read beginnings in many of my strand-specific datasets so therefore the question.

ADD REPLYlink written 7.1 years ago by Biomonika (Noolean)3.1k
gravatar for Irsan
7.1 years ago by
Irsan7.2k wrote:

About illumina adapters in your reads, yes that's normal. I have no experience with galore. I use cutadapt / trimmomatic to remove adapters and trim low quality bases at the ends of reads.

About the high duplication level, that's also normal in rnaseq. Usually a few of the most expressed transcripts consume a very big proportion of the flow cell. This means only a few sequences result in many reads resulting in a high duplication level.

ADD COMMENTlink written 7.1 years ago by Irsan7.2k

@irsan, thank you good answer

ADD REPLYlink written 7.1 years ago by dfernan690
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1107 users visited in the last hour