Question: order of trimming for RNA-seq QC
0
gravatar for maria2019
29 days ago by
maria201930
maria201930 wrote:

I have human bulk RNA-seq paired-end reads (R1, R2) and the fastqc shows that there are multiple overrepresented sequences (that are not adaptors). Also the per base sequence content shows a warning. I used BLAT to check the overrepresented sequences and they all are from either chrUn_GL000220v1 or chr14 except the sequence GGGGGG... from R2.

a) I need to trim the last 5 bases from both R1 and R2. I have read that the first 12 bases are fine and do not need to be trimmed for RNA-seq analysis ( correct me if I am wrong). b) I also need to trim the overrepresented sequences since they are contamination except the GGGG.. that did not align to a sequence from human genome.

Below is the link to the reports: https://hmaryam0.wixsite.com/fastqc-reps

What will be order for trimming? should I trim them A) all in one run? or B) 1. ends 2. overrep seqs or C) 1. overrep seqs 2. ends I have tried them all and they all end up with different results.

A) cutadapt -u -5 -U -5 --pair-filter any --minimum-length 10 -a (overreps) A- (overreps) 10 -o tr_R1.fastq -p tr_R2.fastq R1.fastq R2.fastq

B) 1. cutadapt -u -5 -U -5 --pair-filter any --minimum-length -a (overreps) A- (overreps) 10 -o tr_ends_R1.fastq -p tr_ends_R2.fastq R1_.fastq R2.fastq 2. cutadapt -a (overreps) A- (overreps) -o tr_R1.fastq -p tr_R2.fastq tr_ends_R1.fastq tr_ends_R2.fastq

C) 1. cutadapt -a (different overreps) A- (different overreps) -o tr_overreps_R1.fastq -p tr_overreps_R2.fastq R1.fastq R2.fastq 2. cutadapt -u -5 -U -5 --pair-filter any --minimum-length 10 -o tr_R1.fastq -p tr_R2.fastq tr_overreps_R1.fastq tr_overreps_R2.fastq

fastqc rna-seq qc cutadapt • 115 views
ADD COMMENTlink written 29 days ago by maria201930
1

a) Correct do not trim initial 10-15 bases.
b) Do not do anything to over-represented sequences if they are not adapters. Check to see if they are rRNA bases otherwise you may end up throwing away good data.
c) Poly-G's are likely no signal = G issue from 2-color chemistry. You can remove those stretches.

See these informative blog posts:
https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/
https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/
https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

ADD REPLYlink modified 29 days ago • written 29 days ago by genomax67k

Thank you very much for your response. The reads are not rRNA but they are from human Chr. Are they not considered as contamination then?

ADD REPLYlink written 28 days ago by maria201930
1

If they are aligning to the correct genome then they are not contamination. It is possible that some genes may be highly expressed and sequences from them may show up as "over-represented".

ADD REPLYlink written 28 days ago by genomax67k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1951 users visited in the last hour