1
0
Entering edit mode
4.3 years ago
tlorin ▴ 330

Hello all,

Please apologies if this is a trivial question regarding cutadapt but it is the first time I am running this program.

I have 2 PE Illumina Hiseq4000 files (100nt) with many many reads. I take a subset of 100,000 sequences for left and right reads. The sequencing platform told me the forward adapter is AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC and the reverse is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT.

If I do a grep AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC subset.R1.fastq I get 7 results (subset of 100,000 sequences):

AGGGTTGCATTGGCCCTCTGGCTCTTGAAGTTCAAAAAGCCAGTCTGAATGATCCTCTTCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCG TCGGTATGGTGATGCATTTCGTGTTGACACTCTGGGTGGTGATGGCTTTCTCCAGTTCGTCCAGCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACG CTGGGCTCAACCTCCAGGGTGATGGTCTTACCAGTGAGTGTCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGT GGTGTCACTTGGCTCCACCTCCAGGGTGATGGTCTTACCAGTGAGGGTCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGT AGGGCCTGGAGCCTGGGGCCTGGGGTCCGAGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACACCTCGTATGCCGGCCTCTGCTTGAA GTTACAGTGTGTGTGTGTTACAGTGTGTGTGTGTTACTGTGTGTGTGTGTGTTACAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACA GGTGGTTGCTGTTGTCCCAGATCTGAGTGTTGTGCCACAGCACCATCTTCAGCTTGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACAC

If I do run grep AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT subset.R2.fastq I get 4 results:

CCTCGGACCCCAGGCCCCAGGCTCCAGGCCCTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAA AGGTGATGTGGACGTGTCAGTTCCAACAGTTGAGGATGACAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAA AAATGCCATCCTTTGGGGTTAAAGGTCCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAA CAAGGAGGGCATTCCCCCTGACCAGCAGAGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAC

I want to trim these reads so I run cutadapt (according to the documentation and to this post).

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o subset.R1.trim.fastq -p subset.R2.trim.fastq subset.R1.fastq subset.R2.fastq

I would expect cutadapt to trim at least 7 sequences in the left reads file and 4 in the right reads file (and even more including that the adapter sequence does not need to be full in the read).

Trimming 2 adapters with at most 10.0% errors in paired-end legacy mode ...

=== Summary ===

Pairs written (passing filters):       100,000 (100.0%)

Total basepairs processed:    20,000,000 bp
Total written (filtered):     19,987,225 bp (99.9%)

Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC; Type: regular 3'; Length: 34; Trimmed: 2425 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-34 bp: 3

A: 21.4%
C: 38.6%
G: 23.8%
T: 16.2%
none/other: 0.0%

Overview of removed sequences
length  count   expect  max.err error counts
3   1739    1562.5  0   1739
4   415 390.6   0   415
5   142 97.7    0   142
6   34  24.4    0   34
7   4   6.1 0   4
8   9   1.5 0   9
9   10  0.4 0   7 3
10  6   0.1 1   3 3
11  4   0.0 1   4
12  5   0.0 1   5
13  7   0.0 1   7
14  5   0.0 1   5
15  1   0.0 1   1
16  2   0.0 1   2
17  6   0.0 1   6
18  1   0.0 1   1
19  1   0.0 1   1
20  3   0.0 2   2 1
21  2   0.0 2   1 0 1
22  2   0.0 2   2
23  2   0.0 2   2
24  2   0.0 2   2
25  1   0.0 2   1
27  1   0.0 2   1
28  1   0.0 2   1
29  3   0.0 2   3
30  3   0.0 3   1 2
32  4   0.0 3   3 0 0 1
35  1   0.0 3   1
39  1   0.0 3   1
43  1   0.0 3   1
44  2   0.0 3   1 1
45  1   0.0 3   0 1
49  1   0.0 3   1
53  1   0.0 3   0 0 1
56  1   0.0 3   1
68  1   0.0 3   1

Sequence: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT; Type: regular 5'; Length: 58; Trimmed: 938 times.

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3; 40-49 bp: 4; 50-58 bp: 5

Overview of removed sequences
length  count   expect  max.err error counts
3   686 1562.5  0   686
4   210 390.6   0   210
5   15  97.7    0   15
6   13  24.4    0   13
7   3   6.1 0   3
8   1   1.5 0   1
9   1   0.4 0   0 1
10  6   0.1 1   0 6
11  3   0.0 1   0 3


So I understand that at least 2425 reads should be trimmed in the left read file and 938 in the right read file. This seems plausible. However, if I count reads in the output files I get 100,000 reads for both output files, so it seems that no read has been trimmed!

Any idea of what is occurring? Is the command line correct?

1
Entering edit mode

Trimming does not mean removing. A trimmed read will still be present in the output, it will just be shorter. That said, it's weird that read 2 never got trimmed. For normal paired libraries, read 1 and read 2 will have adapter sequence at the same position, so in this case the adapter trimming is not working correctly. I suggest you try this approach instead.

0
Entering edit mode

Thanks @Brian! It's much clearer. So if I understand correctly, I should run bbduk.sh in1=subset.R1.fastq in2=subset.R2.fastq out1=subset.R1.trim.fastq out2=subset.R2.trim.fastq literal=AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT ktrim=r k=23 mink=11 hdist=1 tpe tbo.

This gives the following output.

BBDuk version 37.31
Initial:
Memory: max=12875m, free=12674m, used=201m

Added 5667 kmers; time:     0.071 seconds.
Memory: max=12875m, free=12203m, used=672m

Input is being processed as paired
Started output streams: 0.019 seconds.
Processing time:        0.857 seconds.

KTrimmed:                   212 reads (0.11%)   6266 bases (0.03%)
Trimmed by overlap:         242 reads (0.12%)   2058 bases (0.01%)
Total Removed:              8 reads (0.00%)     8324 bases (0.04%)
Result:                     199992 reads (100.00%)  19991676 bases (99.96%)

Time:               0.959 seconds.
Bases Processed:      20000k    20.86m bases/sec


awk 'NR%4 == 2 {lengths[length(\$0)]++} END {for (l in lengths) {print l, lengths[l]}}'  subset.R1.trim.fastq | sort -k1 -n -r | head
100 99773
99 20
98 17
97 12
96 13
94 11
89 11
95 7
93 6
92 9


1
Entering edit mode

It's an old post, but I think the correct options for R1 and R2 are -a and -A ... not -a and -g ... both would be applied to R1... capitalised options -A or -G are applied to R2.

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT -o subset.R1.trim.fastq -p subset.R2.trim.fastq subset.R1.fastq subset.R2.fastq

1
Entering edit mode
4.3 years ago
h.mon 33k

Check the range of read length before and after cutadapt: trimming means the adapter sequences will be removed from within the reads, leaving the remaining bit of the read shorter.

I would be concerned about the asymmetrical trimming between R1 and R2, you should try Trimmomatic or BBDuk, as they will trim both reads symmetrically even if the adapter is recognized only in one read.

edit: rephrased trimming explanation, after reading a clearer one just above.