Question

Trim/Remove Reads With Adapters From Illumina Rna-Seq Experiment

5

Entering edit mode

10.5 years ago

dfernan ▴ 760

Hi,

I have data from a single-end 59 bp mouse cells RNA-Seq experiment, each experiment correspond to one cell flow, meaning each experiment is between 100,000,000 to 200,000,000 reads. The protocol was to pull-down polyA RNA.

When I run fastqc I obtain some concerning results:

1) Overrepresented sequences corresponding to the illumina adaptor, is this common?

Overrepresented sequences

Sequence Count Percentage Possible Source

GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTG 389676 0.29 TruSeq Adapter, Index 2 (100% over 59bp)

The index of the specific adapter in this experiment is CGATGT, which makes sense. So the adapter is 59 bp, same length as the single-end library? Isn't that an issue? Should I trim all the adapters? Anyone has experience on how to trim the adapters using trim galore? Does the following trim galore command makes sense? Should I use the whole adapter sequence after the -a, or just the index sequences as I did below?

I.e., Should I do this?

trim_galore -a CGATGT -q 15 -s 5 -e 0.05 --length 48 <fastq_file>

or this?

trim_galore -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCG -q 15 -s 5 -e 0.05 --length 48 <fastq_file>

2) High duplicate numbers:

Sequence duplicate levels >= 75 % or so.

What should I do about the high duplicate levels? - My goal is to do differential expression between experiments.

(1) Leave it as it is and map the data

(2) Collapse duplicates into one

(3) No solution experiments with such duplicate levels mean something went wrong.

Please let me know your suggestions regarding this issues. Thanks!

rnaseq rna-seq rna fastqc fastq • 19k views

ADD COMMENT • link updated 10.5 years ago by Istvan Albert 100k • written 10.5 years ago by dfernan ▴ 760

score 7 · Answer 1 · 2013-10-23

7

Entering edit mode

10.5 years ago

Istvan Albert 100k

The advantage of listing the longer version of the adapter is that the tool can then recognize adapters with sequencing errors in them. I would list the adapter up to the variable region but not the variable region itself.

On duplication levels:

ADD COMMENT • link 10.5 years ago by Istvan Albert 100k

0

Entering edit mode

@Istvan thanks a lot.

ADD REPLY • link 10.5 years ago by dfernan ▴ 760

0

Entering edit mode

By variable region you mean actual sequence of index? So that you wouldn't actually trim ATCTCGTATGCCGTCTTCTGCTTG part? Thanks a lot:)

ADD REPLY • link 10.5 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

yes because all it needs to match the adapter and once it does will then remove everything after it, so the initial GATCGGAAGAGCACACGTCTGAACTCCAGTCAC suffices

ADD REPLY • link 10.5 years ago by Istvan Albert 100k

0

Entering edit mode

Yes, I now realize that you are discussing adapter at read ends. I have found Index 2 adaptor at the read beginnings in many of my strand-specific datasets so therefore the question.

ADD REPLY • link 10.5 years ago by Biomonika (Noolean) 3.2k

score 4 · Answer 2 · 2013-10-23

4

Entering edit mode

10.5 years ago

Irsan ★ 7.8k

About illumina adapters in your reads, yes that's normal. I have no experience with galore. I use cutadapt / trimmomatic to remove adapters and trim low quality bases at the ends of reads.

About the high duplication level, that's also normal in rnaseq. Usually a few of the most expressed transcripts consume a very big proportion of the flow cell. This means only a few sequences result in many reads resulting in a high duplication level.