Question

Why mapping rate is low after adapter trimming paired-end data?

0

Entering edit mode

7.2 years ago

gundalav ▴ 380

I have the pipeline with the following trimming and mapping parameter:

 trim_galore --stringency 5 --dont_gzip --trim1 --length 30 -q 0 -a CTGTCTCTTATACACATCT  -o $TRIMGALORE_DIR $READ1 $READ2

 bowtie $Bowtie_Index_File -1 $TRIMGALORE_R1 -2 $TRIMGALORE_R2 --threads 5 -m 1 -v 2 -S -I 0 -X 2000

Raw paired-end mapping (without adapter trimming) gives this result

**********************************************
Stats for BAM file(s):
**********************************************

Total reads:       169555726
Mapped reads:      97870984    (57.722%)
Forward strand:    120620234    (71.139%)
Reverse strand:    48935492    (28.861%)
Failed QC:         0    (0%)
Duplicates:        0    (0%)
Paired-end reads:  169555726    (100%)
'Proper-pairs':    97870984    (57.722%)
Both pairs mapped: 97870984    (57.722%)
Read 1:            84777863
Read 2:            84777863
Singletons:        0    (0%)
Average insert size (absolute value): 125.862
Median insert size (absolute value): 99

But after running it with Trim_Galore I get this:

**********************************************
Stats for BAM file(s):
**********************************************

Total reads:       168972000
Mapped reads:      314228    (0.185965%)
Forward strand:    168814886    (99.907%)
Reverse strand:    157114    (0.0929823%)
Failed QC:         0    (0%)
Duplicates:        0    (0%)
Paired-end reads:  168972000    (100%)
'Proper-pairs':    314228    (0.185965%)
Both pairs mapped: 314228    (0.185965%)
Read 1:            84486000
Read 2:            84486000
Singletons:        0    (0%)
Average insert size (absolute value): 571.141
Median insert size (absolute value): 297

So the mapping rates drop radically from 57% to 0.18%.

I find it strange because the trim galore only use trim away very little sequence. Why was it and what's the right parameter should I use for both Bowtie and Cutadapt/trim_galore?

 SUMMARISING RUN PARAMETERS
==========================
Input filename: /home/ubuntu/R1_001.fastq
Trimming mode: single-end
Trim Galore version: 0.4.1
Cutadapt version: 1.12
Quality Phred score cutoff: 0
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATACACATCT' ()
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 5 bp
Minimum required sequence length before a sequence gets removed: 30 bp
All sequences will be trimmed by 1 bp on their 3' end to avoid problems with invalid paired-end alignments with Bowtie 1
Writing final adapter and quality trimmed output to BB_89_S1_R1_001_trimmed.fq
  >>> Now performing quality (cutoff 0) and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATACACATCT' from file /home/ubuntu/R1_001.fastq <<< 
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
50000000 sequences processed
60000000 sequences processed
70000000 sequences processed
80000000 sequences processed
This is cutadapt 1.12 with Python 2.7.13
Command line parameters: -f fastq -e 0.1 -q 0 -O 5 -a CTGTCTCTTATACACATCT /home/ubuntu/R1_001.fastq
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 804.75 s (9 us/read; 6.32 M reads/minute).
=== Summary ===
Total reads processed:              84,777,863
Reads with adapters:                   424,374 (0.5%)
Reads written (passing filters):    84,777,863 (100.0%)
Total basepairs processed: 3,052,003,068 bp
Quality-trimmed:                       0 bp (0.0%)
Total written (filtered):  3,048,935,811 bp (99.9%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATACACATCT; Type: regular 3'; Length: 19; Trimmed: 424374 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1
Bases preceding removed adapters:
  A: 13.7%
  C: 36.7%
  G: 25.3%
  T: 24.2%
  none/other: 0.0%
Overview of removed sequences
length  count   expect  max.err error counts
5   132511  82790.9 0   132511
6   72231   20697.7 0   72231
7   50748   5174.4  0   50748
8   43106   1293.6  0   43106
9   53661   323.4   0   50630 3031
10  50660   80.9    1   44674 5986
11  8616    20.2    1   6320 2296
12  2894    5.1 1   1601 1293
13  2928    1.3 1   2706 222
14  995 0.3 1   906 89
15  2681    0.1 1   2570 111
16  620 0.0 1   598 22
17  1622    0.0 1   1578 44
18  263 0.0 1   255 8
19  460 0.0 1   446 14
20  86  0.0 1   84 2
21  107 0.0 1   103 4
22  49  0.0 1   42 7
23  28  0.0 1   23 5
24  11  0.0 1   10 1
25  8   0.0 1   7 1
26  7   0.0 1   6 1
27  23  0.0 1   22 1
28  8   0.0 1   8
29  1   0.0 1   1
33  2   0.0 1   1 1
35  1   0.0 1   0 1
36  47  0.0 1   45 2
RUN STATISTICS FOR INPUT FILE: /home/ubuntu/R1_001.fastq
=============================================
84777863 sequences processed in total
Sequences removed because they became shorter than the length cutoff of 30 bp:  291863 (0.3%)
Writing report to 'trim_galore_dir/R2_001.fastq_trimming_report.txt'
SUMMARISING RUN PARAMETERS
==========================
Input filename: /home/ubuntu/R2_001.fastq
Trimming mode: single-end
Trim Galore version: 0.4.1
Cutadapt version: 1.12
Quality Phred score cutoff: 0
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATACACATCT' ()
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 5 bp
Minimum required sequence length before a sequence gets removed: 30 bp
All sequences will be trimmed by 1 bp on their 3' end to avoid problems with invalid paired-end alignments with Bowtie 1
Writing final adapter and quality trimmed output to BB_89_S1_R2_001_trimmed.fq
  >>> Now performing quality (cutoff 0) and adapter trimming in a single pass for the adapter sequence: 'CTGTCTCTTATACACATCT' from file /home/ubuntu/R2_001.fastq <<< 
10000000 sequences processed
20000000 sequences processed
30000000 sequences processed
40000000 sequences processed
50000000 sequences processed
60000000 sequences processed
70000000 sequences processed
80000000 sequences processed
This is cutadapt 1.12 with Python 2.7.13
Command line parameters: -f fastq -e 0.1 -q 0 -O 5 -a CTGTCTCTTATACACATCT /home/ubuntu/R2_001.fastq
Trimming 1 adapter with at most 10.0% errors in single-end mode ...
Finished in 814.43 s (10 us/read; 6.25 M reads/minute).
=== Summary ===
Total reads processed:              84,777,863
Reads with adapters:                   423,690 (0.5%)
Reads written (passing filters):    84,777,863 (100.0%)
Total basepairs processed: 3,052,003,068 bp
Quality-trimmed:                       0 bp (0.0%)
Total written (filtered):  3,048,945,359 bp (99.9%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATACACATCT; Type: regular 3'; Length: 19; Trimmed: 423690 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1
Bases preceding removed adapters:
  A: 13.7%
  C: 36.1%
  G: 25.4%
  T: 24.8%
  none/other: 0.0%
Overview of removed sequences
length  count   expect  max.err error counts
5   132783  82790.9 0   132783
6   72585   20697.7 0   72585
7   50354   5174.4  0   50354
8   43191   1293.6  0   43191
9   53096   323.4   0   49979 3117
10  50491   80.9    1   44467 6024
11  8598    20.2    1   6158 2440
12  2915    5.1 1   1566 1349
13  2880    1.3 1   2625 255
14  959 0.3 1   865 94
15  2597    0.1 1   2493 104
16  601 0.0 1   570 31
17  1573    0.0 1   1520 53
18  256 0.0 1   240 16
19  447 0.0 1   430 17
20  83  0.0 1   77 6
21  101 0.0 1   87 14
22  46  0.0 1   40 6
23  28  0.0 1   26 2
24  12  0.0 1   10 2
25  8   0.0 1   7 1
26  7   0.0 1   6 1
27  23  0.0 1   20 3
28  8   0.0 1   7 1
29  1   0.0 1   1
33  2   0.0 1   1 1
36  45  0.0 1   44 1

sequencing alignment adapter cutadapt trim_galore • 4.5k views

ADD COMMENT • link updated 7.2 years ago by dariober 14k • written 7.2 years ago by gundalav ▴ 380

0

Entering edit mode

Have you checked a random selection of reads by blasting them at NCBI to make sure they align to the right genome?

ADD REPLY • link 7.2 years ago by GenoMax 141k

0

Entering edit mode

Initially 100% of your reads that mapped, mapped as proper pairs. That's very odd. How are you doing the mapping? It appears that you are excluding unpaired reads from being considered mapped, and that the order got messed up during trimming so that the reads are no longer paired.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

@BrianBushnell do you mean mapping before trimming? bowtie $Bowtie_Index_File -1 $READ1 -2 $READ2 --threads 5 -m 1 -v 2 -S -I 0 -X 2000 that's READ before trimming.

ADD REPLY • link 7.2 years ago by gundalav ▴ 380

score 2 · Answer 1 · 2017-02-11

2

Entering edit mode

7.2 years ago

dariober 14k

Shouldn't trim_galore be run with the --paired option for paired end reads? This may be the reason why reads went out of sync, as Brain pointed out.

ADD COMMENT • link 7.2 years ago by dariober 14k

0

Entering edit mode

I tried with your suggestion, doesn't change:

trim_galore --stringency 5 --dont_gzip --paired --trim1 --length 30 -q 0 -a CTGTCTCTTATACACATCT -o $TRIMGALORE_DIR $READ1 $READ2

ADD REPLY • link 7.2 years ago by gundalav ▴ 380

score 1 · Answer 2 · 2017-02-11

Looks like the problem is indeed that your reads became disordered, so pairs violating your insert size range of 0 to 2000 (which would be virtually all of them) are considered unmapped.

I don't know why the reads became disordered, but you can reorder them with repair.sh from the BBMap package. Overall, though, unless you are restricted to TrimGalore and bowtie, I recommend you use this pipeline instead (command adjusted for your 36bp reads):

bbduk.sh in1=r1.fq in2=r2.fq out1=trimmed1.fq out2=trimmed2.fq literal=CTGTCTCTTATACACATCT ktrim=r k=19 mink=6 hdist=1 tbo tpe mininsert=20 qtrim=r trimq=12
bbmap.sh ref=ref.fa in1=trimmed1.fq in2=trimmed2.fq out=mapped.sam