How To Best Deal With Adapter Contamination (Illumina)?
7
12
Entering edit mode
12.4 years ago
Gaffa ▴ 500

I've got non-trivial amounts of adapter contamination in a paired-end >100 bp read Illumina run (i.e. the machine reading technical adapters/primers rather than biological sequence). How would I best go about identifying such contaminated reads?

I know the sequence of the adapters used, but because of sequencing errors you can't simply do a straightforward regular expression pattern match. The adapter sequences are about 75 bp and seem to always begin at the very 5' end of the affected reads (though I can't be 100% that this always holds), and the remaining 3' parts of the reads seem to be nonsense low-complexity sequence, lots of homopolymers.

adaptor next-gen sequencing • 19k views
6
Entering edit mode
12.4 years ago
brentp 24k

Brad Chapman has a nice post on doing this here. It checks for exact match first then does a global sequence alignment allowing a specified number of mismatches.

I believe that vectorstrip in the EMBOSS toolkit can also do this.

If you prefer BioPerl, here is one more possible solution.

3
Entering edit mode
12.4 years ago
Bio_X2Y ★ 4.2k

0
Entering edit mode

Thanks - I have now used this program with great success.

0
Entering edit mode
3
Entering edit mode
12.2 years ago
Ketil 4.1k

Another option might be the FastX toolkit, available from http://hannonlab.cshl.edu/fastx_toolkit. Specifically fastx_clipper should do what you want (although I haven't yet tested this).

3
Entering edit mode
11.8 years ago
Weronika ▴ 300

I've been very happy with cutadapt - it has a lot of useful options like filtering resulting reads by length, and a straightforward way of controlling the number of mismatches allowed. It's still developed (or at least the author checks the issue-tracker regularly). I hope it'll work for paired-end reads someday!

I also tested fastx_clipper in the FASTX-Toolkit - it works well, my only quibble with it is that it's impossible to require an exact match or control the number of mutations in a direct way. The other tools in that toolkit are very useful as well.

If you're writing your own pipeline in python, you could consider the HTSeq package - it has a huge amount of useful functions such as fastq parsing and quality de-coding, and trim_right_end/trim_left_end functions for adapter stripping. I haven't used that particular one, but I've been happy with HTSeq in general.

If you're using R, the Biostrings package has a trimLRPatterns function. I haven't tried it.

0
Entering edit mode
12.4 years ago

For our software we also ended up using an alignment approach. You can read more about the approach in the manual section - http://www.clcbio.com/index.php?id=1330&manual=Adapter_trimming.html Clicking the individual subsections will bring you to the details.

Cheers

Roald

***Disclaimer - I work at CLC bio *****

0
Entering edit mode
12.4 years ago
Bach ▴ 550

To search for adaptor contamination, I constantly use either SSAHA2 and parse the (pretty simple) result files. Some of my users have proposed SMALT (successor to SSAHA2) as they think it is more sensitive, but I need to confirm that.

Both tools are available from the Sanger Centre.

0
Entering edit mode
11.6 years ago
Rna-Seq • 0

All these tools such as cutadapt are useful for single end reads. There are not many tools for trimming adapters from paired end reads. When reads are trimmed the order of the reads will be lost between the paired end sequences if trimming removes some reads.

0
Entering edit mode

That's true, but it's fairly easy to re-sync paired fastq files by going through the them and removing those reads that don't have a matching pair in the other file.