Question

Illumina index adapter trimming of FASTQ files using Cutadapt/TrimGalore

1

Entering edit mode

4.1 years ago

arctic ▴ 40

Dear all,

I am new to the field. I am trying to analyze single end 100b FastQ files with ~70million reads/sample. I am trying to determine if adapter sequences are present and if so how to go about them. I ran FastQC on the files and reports show they each have an "overrepresented sequence" of an "illumina index adapter" in them.

I have the following questions:

Does sample1 look like a trimmed file or it requires adapter trimming?
If further trimming is recommended what would be the best seq/adapter option to be used for cutadapt/TrimGalore? [See below for my thoughts so far]
Based on the FastQC report, do I need to worry about presence of any other adapter sequences beside the index?

My thoughts on question 2: The sequences for illumina index adapter format appear to be:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG

These are the adapter sequences found in my FastQC report for sample 1:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACCATGGCATCTCGTATGC 
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCATGGCATCTCGTATG

I am thinking of using below options for cutadapt/trimgalore to remove the adapter(s):

trim_galore sample1.fastq.gz -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG -q 20 --length 20 –fastqc

However, it seems that trimmomatics for instance only takes care of the initial sequence of the index adapter (only up to Ns and not after): https://github.com/timflutre/trimmomatic/blob/master/adapters/TruSeq3-SE.fa

Many thanks for your time and reply beforehand.

cutadapt TrimGalore trimming RNA-Seq index adapter • 7.3k views

ADD COMMENT • link updated 2.2 years ago by ccfpwll • 0 • written 4.1 years ago by arctic ▴ 40

1

Entering edit mode

For your reference most trimming programs should trim all sequence to the right when they find the core sequence that is common to the adapters. Finding the core sequence indicates that one ran out of insert and hit the adapter on 3'-end (i.e you have an insert shorter than the length of sequencing).

ADD REPLY • link 4.1 years ago by GenoMax 141k

0

Entering edit mode

Thanks a lot for setting me in the right direction. So since the illumina universal adapter sequence ("AGATCGGAAGAGC") is already included on the "left end" (5'end) of the indexing adapter sequence, trimming that sequence alone will also remove the entire index adapter sequence to its right. Which would also be the reason why trimmomatics uses only the "left end" of the index adapter sequence for trimming. I tried trimming using universal sequence and it appears to have removed the index adapter sequences accordingly.

Also based on your reply I found below which was helpful: https://support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-from-only-the--ends-of-reads.html

Best

ADD REPLY • link 4.1 years ago by arctic ▴ 40

0

Entering edit mode

2.5 years ago

Jiacheng ▴ 60

In your case, you do not need to bother 0.5% adapter contents. However, if you want optimal results, 0.5% adapters can be removed while leaving non-adapter sequences intact as trimming algorithm improves. Also, trimming is not all about adapters. Removing Ns and low-quality tails are also important.

I'd recommend atria to determine and trim the adapter sequences. It is a newly-published cutting-edge trimmer with exceptional precision and speed. And if you do not know what adapter sequence should be used, Atria can detect adapters if adapter content is higher than 0.04%. (If <0.04%, no need to do adapter trimming.)

Eg: Finding adapters

atria --detect-adapter -r reads.fastq [...]

Do N trimming and low-quality trimming:

atria --no-adapter-trim -r read1.fastq [-R read2.fastq]

ADD COMMENT • link 2.5 years ago by Jiacheng ▴ 60

0

Entering edit mode

Hi Jiacheng, I checked out atria. It seems the --detect-adapter feature is not ready for mac os, right?

ADD REPLY • link 2.2 years ago by ccfpwll • 0

score 3 · Accepted Answer · 2020-03-13

3

Entering edit mode

4.1 years ago

swbarnes2 14k

0.5%? You really don't have to worry about that if you don't want to.

The N's are for the variable index region. You know what the index is, you can see it in the fastqc report. Why would you put N's in?

ADD COMMENT • link 4.1 years ago by swbarnes2 14k

0

Entering edit mode

Thank you for your reply.

About the Ns: I have several files that each have similar index adapters and vary in those 6 bases, so I was planning on including Ns so I could trim all the samples in parallel. However, after genomax's comment, I understand that trimming the "universal adapter sequence" would also remove the rest of index adapter sequence as well. Considering both your and ATpoint's feedbacks, I think I will skip the trimming step altogether. Best.

ADD REPLY • link 4.1 years ago by arctic ▴ 40

score 3 · Accepted Answer · 2020-03-13

3

Entering edit mode

4.1 years ago

ATpoint 82k

The sequence to trim if having Universal Adaprer contaminations is AGATCGGAAGAGC. In you case (0.5%) I would not even bother and directly align the files without any manipulations.

ADD COMMENT • link 4.1 years ago by ATpoint 82k

0

Entering edit mode

Thank you for your comment ATpoint. I think I will skip trimming and proceed directly with alignment as you and swbarnes2 recommended. Best

ADD REPLY • link 4.1 years ago by arctic ▴ 40