I have human bulk RNA-seq paired-end reads (R1, R2) and the fastqc shows that there are multiple overrepresented sequences (that are not adaptors). Also the per base sequence content shows a warning. I used BLAT to check the overrepresented sequences and they all are from either chrUn_GL000220v1 or chr14 except the sequence GGGGGG... from R2.
a) I need to trim the last 5 bases from both R1 and R2. I have read that the first 12 bases are fine and do not need to be trimmed for RNA-seq analysis ( correct me if I am wrong). b) I also need to trim the overrepresented sequences since they are contamination except the GGGG.. that did not align to a sequence from human genome.
Below is the link to the reports: https://hmaryam0.wixsite.com/fastqc-reps
What will be order for trimming? should I trim them A) all in one run? or B) 1. ends 2. overrep seqs or C) 1. overrep seqs 2. ends I have tried them all and they all end up with different results.
A) cutadapt -u -5 -U -5 --pair-filter any --minimum-length 10 -a (overreps) A- (overreps) 10 -o tr_R1.fastq -p tr_R2.fastq R1.fastq R2.fastq
B) 1. cutadapt -u -5 -U -5 --pair-filter any --minimum-length -a (overreps) A- (overreps) 10 -o tr_ends_R1.fastq -p tr_ends_R2.fastq R1_.fastq R2.fastq 2. cutadapt -a (overreps) A- (overreps) -o tr_R1.fastq -p tr_R2.fastq tr_ends_R1.fastq tr_ends_R2.fastq
C) 1. cutadapt -a (different overreps) A- (different overreps) -o tr_overreps_R1.fastq -p tr_overreps_R2.fastq R1.fastq R2.fastq 2. cutadapt -u -5 -U -5 --pair-filter any --minimum-length 10 -o tr_R1.fastq -p tr_R2.fastq tr_overreps_R1.fastq tr_overreps_R2.fastq