I have data from a single-end 59 bp mouse cells RNA-Seq experiment, each experiment correspond to one cell flow, meaning each experiment is between 100,000,000 to 200,000,000 reads. The protocol was to pull-down polyA RNA.
When I run fastqc I obtain some concerning results:
1) Overrepresented sequences corresponding to the illumina adaptor, is this common?
Sequence Count Percentage Possible Source
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTG 389676 0.29 TruSeq Adapter, Index 2 (100% over 59bp)
The index of the specific adapter in this experiment is CGATGT, which makes sense. So the adapter is 59 bp, same length as the single-end library? Isn't that an issue? Should I trim all the adapters? Anyone has experience on how to trim the adapters using trim galore? Does the following trim galore command makes sense? Should I use the whole adapter sequence after the -a, or just the index sequences as I did below?
I.e., Should I do this?
trim_galore -a CGATGT -q 15 -s 5 -e 0.05 --length 48 <fastq_file>
trim_galore -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCG -q 15 -s 5 -e 0.05 --length 48 <fastq_file>
2) High duplicate numbers:
Sequence duplicate levels >= 75 % or so.
What should I do about the high duplicate levels? - My goal is to do differential expression between experiments.
(1) Leave it as it is and map the data
(2) Collapse duplicates into one
(3) No solution experiments with such duplicate levels mean something went wrong.
Please let me know your suggestions regarding this issues. Thanks!