understanding how cutadapt trims
1
0
Entering edit mode
2.2 years ago
dsull ★ 5.8k

I have the following read stored in trim.fq

@read
NGTTTGATTTGGGAAAGAAGGAAAGGAGAGGAGGGGAGGGGAGGGGAAGGGGGAGGGGAAATGCGCAAAAGATCGGAAGACATGACAAGTCAAGC
+
#4:BDFFFHHHHHIJJJIJJJIIJIJHIGHIGHJJJDFHII8AHFDAABDDDD5<BDD<@BCCCB@BDDDC@ACDD@B<A<AAA@@CDA>ADDC3

The following causes trimming to occur successfully:

cutadapt -a AGATCGGAAGAGC -e 0.1 trim.fq

However, the following does NOT result in the trimming:

cutadapt -a AGATCGGAAGATT -e 0.1 trim.fq

Why is this? In both cases, only the sequence AGATCGGAAGA is matched...

cutadapt • 1.4k views
ADD COMMENT
0
Entering edit mode

try raising error rate to 0.2 and it works. Try calculating allowed mismatches in your index sequences (e of 0.1 vs 0.2) . Trimming alone with AGATCGGAAGA also works.

ADD REPLY
0
Entering edit mode

Honestly that is kinda alarming that it clips in only one of the two cases with the same parameters and what seems to me to be two mismatches at the 3' end in both cases.

Does it trim the whole "AGATCGGAAGACATGACAAGTCAAGC" fragment in the former case?

EDIT: See my answer below but "errors" for cutadapt are not just mismatches which explains the behavior

ADD REPLY
0
Entering edit mode

Not two mismatches. One mismatch. In first case (AGATCGGAAGAGC) only one mismatch, which is why it works. In later case (AGATCGGAAGATT), there are two mismatches which is why the adapter is not getting trimmed. If you raise allowed mismatches to two, second one works.

Target sequence: AGATCGGAAGACA 
Sequence 1: AGATCGGAAGAGC
Sequence 2: AGATCGGAAGATT

Mismatch calculation:

Allowed Error rate: 0.1
Calculated mismatches: 0.1 x 13 (length of Target)= 1.3
Floor of 1.3 = 1

Remove common sequence:

Target sequence: CA 
Sequence 1: GC (only one mismatch between remaining target bases and sequence1 )
Sequence 2: TT (Two mismatches between remaining target bases and sequence1)

Because of this, with e of 0.1 or 1, sequence gets trimmed with sequence 1, not with sequence 2. If one increases error rate to 0..2 (0.2 x 13 = 2.6 and floor is 2) or 2, sequence gets trimmed with sequence 2 (and sequence 1) as well. If we take another sequence two non-matching bases like sequence 2 at the end (AGATCGGAAGAGT and sequence 2: AGATCGGAAGATT), cutadapt needs e 0.2 like sequence 2.

$ cutadapt --quiet  -a AGATCGGAAGAGC -e 1 test.fq

@read
NGTTTGATTTGGGAAAGAAGGAAAGGAGAGGAGGGGAGGGGAGGGGAAGGGGGAGGGGAAATGCGCAAA
+
#4:BDFFFHHHHHIJJJIJJJIIJIJHIGHIGHJJJDFHII8AHFDAABDDDD5<BDD<@BCCCB@BDD

$ cutadapt --quiet  -a AGATCGGAAGATT -e 1 test.fq

@read
NGTTTGATTTGGGAAAGAAGGAAAGGAGAGGAGGGGAGGGGAGGGGAAGGGGGAGGGGAAATGCGCAAAAGATCGGAAGACATGACAAGTCAAGC
+
#4:BDFFFHHHHHIJJJIJJJIIJIJHIGHIGHJJJDFHII8AHFDAABDDDD5<BDD<@BCCCB@BDDDC@ACDD@B<A<AAA@@CDA>ADDC3

$ cutadapt --quiet  -a AGATCGGAAGATT -e 2 test.fq

@read
NGTTTGATTTGGGAAAGAAGGAAAGGAGAGGAGGGGAGGGGAGGGGAAGGGGGAGGGGAAATGCGCAAA
+
#4:BDFFFHHHHHIJJJIJJJIIJIJHIGHIGHJJJDFHII8AHFDAABDDDD5<BDD<@BCCCB@BDD

However, cutadapt doesn't allow mismatches in short subsequences of the primer sequence.

ADD REPLY
4
Entering edit mode
2.2 years ago

Actually I think I figured it out.

The errors for cutadapt are apparently mismatches, insertions and deletions. In this case with -e 0.1 you are allowed one error to trim.

So if you assume there is a deletion of a G from your read the first adapter would still work

The new real read assumption: N*AGATCGGAAGAGCA

Which allows for your first adapter to be trimmed because the terminal GC are now perfect matches: AGATCGGAAGAGC

So I would assume trimming any adapter with a C at the terminal position should work. (And it does - just checked).

With the adapter ending in TT no possible single insertion, deletion, or mismatch would allow it to match to the reference.

ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6