2
5
Entering edit mode
8.8 years ago
dfernan ▴ 710

Hi,

I have data from a single-end 59 bp mouse cells RNA-Seq experiment, each experiment correspond to one cell flow, meaning each experiment is between 100,000,000 to 200,000,000 reads. The protocol was to pull-down polyA RNA.

When I run fastqc I obtain some concerning results:

1) Overrepresented sequences corresponding to the illumina adaptor, is this common?

Overrepresented sequences

Sequence Count Percentage Possible Source

GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTG 389676 0.29 TruSeq Adapter, Index 2 (100% over 59bp)

The index of the specific adapter in this experiment is CGATGT, which makes sense. So the adapter is 59 bp, same length as the single-end library? Isn't that an issue? Should I trim all the adapters? Anyone has experience on how to trim the adapters using trim galore? Does the following trim galore command makes sense? Should I use the whole adapter sequence after the -a, or just the index sequences as I did below?

I.e., Should I do this?

trim_galore -a CGATGT -q 15 -s 5 -e 0.05 --length 48 <fastq_file>


or this?

trim_galore -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCG -q 15 -s 5 -e 0.05 --length 48 <fastq_file>


2) High duplicate numbers:

Sequence duplicate levels >= 75 % or so.

What should I do about the high duplicate levels? - My goal is to do differential expression between experiments.

(1) Leave it as it is and map the data

(2) Collapse duplicates into one

(3) No solution experiments with such duplicate levels mean something went wrong.

rnaseq rna-seq rna fastqc fastq • 18k views
7
Entering edit mode
8.8 years ago

The advantage of listing the longer version of the adapter is that the tool can then recognize adapters with sequencing errors in them. I would list the adapter up to the variable region but not the variable region itself.

On duplication levels:

0
Entering edit mode

@Istvan thanks a lot.

0
Entering edit mode

By variable region you mean actual sequence of index? So that you wouldn't actually trim ATCTCGTATGCCGTCTTCTGCTTG part? Thanks a lot:)

0
Entering edit mode

yes because all it needs to match the adapter and once it does will then remove everything after it, so the initial GATCGGAAGAGCACACGTCTGAACTCCAGTCAC suffices

0
Entering edit mode

Yes, I now realize that you are discussing adapter at read ends. I have found Index 2 adaptor at the read beginnings in many of my strand-specific datasets so therefore the question.

4
Entering edit mode
8.8 years ago
Irsan ★ 7.6k

About the high duplication level, that's also normal in rnaseq. Usually a few of the most expressed transcripts consume a very big proportion of the flow cell. This means only a few sequences result in many reads resulting in a high duplication level.

0
Entering edit mode