Question

Removing overrepresented sequences in paired end RNA-seq

0

Entering edit mode

9 months ago

Kermit ▴ 90

After trimming and QC'ing RNAseq for adapters with trim_galore (cutadapt+fastqc), should I remove over-represented sequences that FastQC identifies as possible adapter/primer source?

[Is there a high chance that over-represented seqs will mess up my downstream gene quantification data; do I even care?]

If so, is there a way to do so automatically similar to the following option

--action {trim,retain,mask,lowercase,none} (default: trim)
Specify what to do if an adapter match was found

Paired End 1

Sequence:        [50 base sequence]
Count:           46619
Percentage:      0.11
Possible Source: TruSeq Adapter, Index 13 (97% over 39bp)

TruSeq Adapter, Index 13 5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTG https://dnatech.genomecenter.ucdavis.edu/wp-content/uploads/2013/06/illumina-adapter-sequences_1000000002694-00.pdf

Paired End 2

Sequence:        [50 base sequence] 
Count:           50457
Percentage:      0.12
Possible Source: Illumina Single End PCR Primer 1 (100% over 50bp)

Original Command

trim_galore --illumina --fastqc --paired file_1.fq.gz file_2.fq.gz

If it's a 100% match then why isn't it removed?
Both of the identified sequences start with the same 11 bases ATCGGAAGAGC
When I tested a different paired end sample it had no ove-rrepresented sequences
46K and 50K counts are really high. Only possible if these are MT RNA or rRNA?

cutadapt rna-seq trimgalore • 778 views

ADD COMMENT • link updated 9 months ago by Ram 43k • written 9 months ago by Kermit ▴ 90

0

Entering edit mode

There is a core sequence present in all Illumina indexed adapters. Once that sequence is found you should remove all sequence to the 3'-end of that sequence. I am not sure what you are asking here. I am not a trim_galore user but if it understands Illumina adapters then action to use is "trim".

It is not necessary to have ove-represented sequences identified as a matter of course for RNAseq data. FastQC uses the following limit when scanning for these sequence

To conserve memory only sequences which appear in the first 100,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

If the over-represented sequences are not identified as library adapters then leave them in the dataset for further analysis.

ADD REPLY • link 9 months ago by GenoMax 141k

0

Entering edit mode

In case you are sequencing for counting applications like differential gene expression (DGE) RNA-seq analysis, ChIP-seq, ATAC-seq, read trimming is generally not required anymore when using modern aligners. For such studies local aligners or pseudo-aligners should be used. Modern “local aligners” like STAR, BWA-MEM, HISAT2, will “soft-clip” non-matching sequences.

https://dnatech.genomecenter.ucdavis.edu/faqs/when-should-i-trim-my-illumina-reads-and-how-should-i-do-it/

ADD REPLY • link 9 months ago by Kermit ▴ 90

score 0 · Answer 1 · 2023-07-04

One method of excluding these overrepresented sequences is to use a different trimmer that catches them in the first place

bbmap/bbduk.sh in1=file_1.fq.gz in2=file_2.fq.gz out1=file_1_trim.fq.gz out2=file_2_trim.fq.gz ref=adapters ktrim=r k=23 mink=11 hdist=1 tpe tbo

fastqc -t 2 file_1_trim.fq.gz file_2_trim.fq.gz

For bbduk I just used the options from first paragraph of their docs. It was also about 20x faster than cutadapt.

Neither of the originally overrepresented sequences are found in the results of fastqc.