Question

How can I deal with adapter contamination in next-gen sequencing reads?

1

Entering edit mode

8.6 years ago

gbdias ▴ 150

Hey guys,

After browsing similar questions and trying to use the "friendly" tools available, I concluded that adapter removing is not trivial at all for non-expert users. At least not for some datasets. So I have a few questions, If you could help me with any of those it would be really nice.

How do I know what adapters are present in my reads? (Fastqc report shows several hits with Illumina Multiplexing PCR primer 2.0.1, but clipping it's sequence won't clean all reads and reports will keep showing this contamination). Shouldn't I know the adapter just by knowing the library prep kit used?
Why don't all reads have adapters?
If I use Cutadapt with the first 13bp of Illumina universal adapter (AGATCGGAAGAGC) over half of my dataset is lost in clipping (20Gb to 9Gb). Also, Fastqc will still show adapter contamination. Can I trust this clipping?

illumina adapter next-gen-sequencing • 5.1k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.6 years ago by gbdias ▴ 150

1

Entering edit mode

I am using Adapter Removal. It identifies adapters on it's own. Also add quality filter, it's worth it.

Why not all reads ahve adapters? Beacause clipping them is part of the instrument software before you get your FASTQs

You can also run prinseq before and after Adapter Removal. By looking at sequences lengths, you should be left with only one peak. Also looking at duplications section gives insight about any adapters that may be present

ADD REPLY • link 8.6 years ago by stolarek.ir ▴ 700

Ram · Answer 1 · 2015-10-16

1

Entering edit mode

8.6 years ago

5heikki 11k

Try trim galore. It's really a wrapper for cutadapt and fastqc, but IMO does the job very nicely.

ADD COMMENT • link 8.6 years ago by 5heikki 11k

0

Entering edit mode

I've just used trim_galore (arguments below) to trim the adapter sequences off of fastq files from Illumina hiseq 4000 run TruSeq prep. It seemed to work well, running it in the default mode to auto-detect adapters and remove them, as well as remove any bases with phred score < 5, but my fastQC reports for some files show that Illumina Single End PCR primer or TruSeq Adapter, Index 7, remain in certain samples (0.15 % and 0.53 %, respectively).

Do I have to run cutadapt again and feed it these specific sequences to remove? I have many samples and searching through each report for specific adapters to remove in a second cutadapt run is not ideal.

Was I not stringent enough in trimming?

Do I need to get rid of the remaining contaminants to perform differential gene expression analysis?

trim_galore --paired -q 5 -o /output/path/ --fastqc_args "--outdir /fastqc/output/path/" sample_R1.fastq.gz sample_R2.fastq.gz

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.5 years ago by robvanner ▴ 20

Ram · Answer 2 · 2015-10-16

I suggest you try BBDuk. It's both more sensitive and more specific than other adapter trimmers, as it can trim by overlap detection in addition to sequence matching, to remove even 1bp of adapter at the very end. It comes with all of the standard Illumina adapter sequences in /resources/adapters.fa

Usage:

bbduk.sh in1=r1.fq in2=r2.fq out=trimmed#.fq ref=adapters.fa tbo tpe k=23 mink=11 hdist=1 ktrim=r ftm=5

If you run BBMerge (also included) like this:

bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt reads=1m xloose

...then you will see the insert size distribution of your reads. Reads with insert sizes less than the read length contain adapter sequence. So, that will show you the amount of data you should expect to lose via adapter-trimming, not including adapter-dimers, which will be totally eliminated but don't show up on an insert size plot.