Trim/Remove Reads With Adapters From Illumina Rna-Seq Experiment
2
5
Entering edit mode
8.6 years ago
dfernan ▴ 710

Hi,

I have data from a single-end 59 bp mouse cells RNA-Seq experiment, each experiment correspond to one cell flow, meaning each experiment is between 100,000,000 to 200,000,000 reads. The protocol was to pull-down polyA RNA.

When I run fastqc I obtain some concerning results:

1) Overrepresented sequences corresponding to the illumina adaptor, is this common?

Overrepresented sequences

Sequence Count Percentage Possible Source

GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTG 389676 0.29 TruSeq Adapter, Index 2 (100% over 59bp)

The index of the specific adapter in this experiment is CGATGT, which makes sense. So the adapter is 59 bp, same length as the single-end library? Isn't that an issue? Should I trim all the adapters? Anyone has experience on how to trim the adapters using trim galore? Does the following trim galore command makes sense? Should I use the whole adapter sequence after the -a, or just the index sequences as I did below?

I.e., Should I do this?

trim_galore -a CGATGT -q 15 -s 5 -e 0.05 --length 48 <fastq_file>

or this?

trim_galore -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCG -q 15 -s 5 -e 0.05 --length 48 <fastq_file>

2) High duplicate numbers:

Sequence duplicate levels >= 75 % or so.

What should I do about the high duplicate levels? - My goal is to do differential expression between experiments.

(1) Leave it as it is and map the data

(2) Collapse duplicates into one

(3) No solution experiments with such duplicate levels mean something went wrong.

Please let me know your suggestions regarding this issues. Thanks!

rnaseq rna-seq rna fastqc fastq • 18k views
ADD COMMENT
7
Entering edit mode
8.6 years ago

The advantage of listing the longer version of the adapter is that the tool can then recognize adapters with sequencing errors in them. I would list the adapter up to the variable region but not the variable region itself.

On duplication levels:

ADD COMMENT
0
Entering edit mode

@Istvan thanks a lot.

ADD REPLY
0
Entering edit mode

By variable region you mean actual sequence of index? So that you wouldn't actually trim ATCTCGTATGCCGTCTTCTGCTTG part? Thanks a lot:)

ADD REPLY
0
Entering edit mode

yes because all it needs to match the adapter and once it does will then remove everything after it, so the initial GATCGGAAGAGCACACGTCTGAACTCCAGTCAC suffices

ADD REPLY
0
Entering edit mode

Yes, I now realize that you are discussing adapter at read ends. I have found Index 2 adaptor at the read beginnings in many of my strand-specific datasets so therefore the question.

ADD REPLY
4
Entering edit mode
8.6 years ago
Irsan ★ 7.5k

About illumina adapters in your reads, yes that's normal. I have no experience with galore. I use cutadapt / trimmomatic to remove adapters and trim low quality bases at the ends of reads.

About the high duplication level, that's also normal in rnaseq. Usually a few of the most expressed transcripts consume a very big proportion of the flow cell. This means only a few sequences result in many reads resulting in a high duplication level.

ADD COMMENT
0
Entering edit mode

@irsan, thank you good answer

ADD REPLY

Login before adding your answer.

Traffic: 1443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6