Question: Trim Illumina PE data with trimmomatic
0
gravatar for mschmid
3.8 years ago by
mschmid90
Switzerland
mschmid90 wrote:

I have Illumina MiSeq PE 2X250 data

I got the following info from the sequencing (some irrelevant or private data omitted):

[Header]
IEMFileVersion,4
Investigator Name,xxxx
Experiment Name,xxxxxx
Date,x/x/xxxx
Workflow,Assembly
Application,Assembly
Assay,TruSeq HT
Description,
Chemistry,Amplicon

[Reads]
301
301

[Settings]
ReverseComplement,0
kmer,31
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

[Data]
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,GenomeFolder,Sample_Project,Description
1,AB1,,,D701,ATTACTCG,D501,TATAGCCT,,,
2,AB2,,,D701,ATTACTCG,D502,ATAGAGGC,,,
3,AB3,,,D701,ATTACTCG,D503,CCTATCCT,,,
4,AB4,,,D701,ATTACTCG,D504,GGCTCTGA,,,
5,AB5,,,D701,ATTACTCG,D505,AGGCGAAG,,,
6,AB6,,,D701,ATTACTCG,D506,TAATCTTA,,,
7,AB7,,,D701,ATTACTCG,D507,CAGGACGT,,,

I want to trim with timmomatic and then assemble with Spades (and maybe other assemblers). I do not have the data yet. So I do not know if the adapters and indexes are trimmed away. But let's assume they are still there.

Now some questions:

1. Is it necessary to add the indexes to the adapter file of trimmomatic (so that they get removed as well)? Or do the short sequences in general not interfere with assembly? Or could it even happen that trimmomatic finds "false positives", because those sequences are so short?

2. Even if the adapters and indexes would have been trimmed away would it still be necessary to provide the sequences because of read-throughs? (For example R1 reading through adaptor of R2)

3. How do I have to configure the adapter file of trimmomatic? For which sequences do I have to provide the reverse-complement? And what about the /1 and /2 option of trimmomatic? I did not realy understand...

ADD COMMENTlink modified 3.8 years ago by Brian Bushnell16k • written 3.8 years ago by mschmid90
3
gravatar for Brian Bushnell
3.8 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

1) I like to include the indexes, for maximal sensitivity, but it's not strictly necessary in general.  BBDuk is distributed with a file "adapters.fa" that includes all the standard Illumina Truseq and Nextera adapter sequences, including the indexes.  Adapter trimming prior to assembly is always a good idea, as it increases speed, reduces memory consumption, and can lead to a better assembly (especially with RNA-seq, single cell, or metagenomes, which have highly variable coverage).

2) No, that's what trimming removes.

3) I'm not sure about Trimmomatic, but BBDuk looks for both the forward sequence and reverse-complement by default.  For paired reads I recommend running it like this:

bbduk.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe

Though as long as the names are in the proper format (identical except for the number 1 and 2) it's shorter to do this:

bbduk.sh in=read#.fq out=trimmed#.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tbo tpe
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Brian Bushnell16k

Hi Brain, this is an old post, I'm not sure if you can still see this, but I'm going to try my luck! I wanted to know what would happen if I omit k=23 mink=11 parameters in PE adapter trimming process, so I tried both:


with both parameters

Added 216529 kmers; time:   0.399 seconds.
Memory: max=47667m, free=43439m, used=4228m
Input is being processed as paired
Started output streams: 2.320 seconds.
Processing time:        88.703 seconds.
Input:                      29908174 reads      4516134274 bases.
KTrimmed:                   6290478 reads (21.03%)  386234224 bases (8.55%)
Trimmed by overlap:         648240 reads (2.17%)    4720504 bases (0.10%)
Total Removed:              6396 reads (0.02%)  390954728 bases (8.66%)
Result:                     29901778 reads (99.98%)     4125179546 bases (91.34%)
Time:               91.444 seconds.
Reads Processed:      29908k    327.06k reads/sec
Bases Processed:       4516m    49.39m bases/sec

without the parameters

Added 229571 kmers; time:   0.143 seconds.
Memory: max=47941m, free=43188m, used=4753m
Input is being processed as paired
Started output streams: 0.015 seconds.
Processing time:        89.504 seconds.
Input:                      29908174 reads      4516134274 bases.
KTrimmed:                   5237954 reads (17.51%)  366522888 bases (8.12%)
Trimmed by overlap:         1650678 reads (5.52%)   23351468 bases (0.52%)
Total Removed:              5208 reads (0.02%)  389874356 bases (8.63%)
Result:                     29902966 reads (99.98%)     4126259918 bases (91.37%)
Time:               89.684 seconds.
Reads Processed:      29908k    333.48k reads/sec
Bases Processed:       4516m    50.36m bases/sec

It seems like with the parameters more bases get trimmed off. Let's say at this point I only want to remove adapter sequence, is it safe to leave those parameters out?

Thank you!

Yang

ADD REPLYlink modified 22 months ago • written 22 months ago by zhangya30
1

Hi Yang,

If you don't specify "k=23 mink=11" it will use the default of "k=27" and no mink. "k=23 mink=11" is more sensitive, so I recommend that. Since you have PE reads, most of the time, even when adapters are not detected from kmers, they will still be detected based on overlap, which is why the overall amount of data removed (8.66% of bases compared to 8.63%) only changed very slightly.

But I do recommend keeping "k=23 mink=11" because the reads that don't get trimmed when you leave those parameters off almost certainly do have adapters, that should get trimmed. In other words, the 0.03% additional bases you gain by removing those flags are virtually all adapter sequence that you don't want.

ADD REPLYlink written 22 months ago by Brian Bushnell16k

Hi Brain,

Thank you for the explanation, very helpful.

There is something else I'm confused about. The workflow I tried was:

bbmerge-auto.sh in1=read1.fastq in2=read2.fastq out=merged.fastq outu=unmerged.fastq ihist=ihist.txt ecct extend2=20 iteration=5 to merge non-overlapping PE reads (read is 151 bp, fragment size is larger than 400), bbmerge.sh in=merged.fastq outa=adapters.fa to discover adapter sequences, at the end use bbduk.sh to filter low quality reads. What I found was after bbmerge-auto.sh, I couldn't detect any adapter sequences in the merged file (instead, they were all in unmerged file).

  1. the total size of read 1 and read 2 are 11 Gb, after merging, the merged file is 5.1 Gb. Is this 5.9 Gb size reduction expected? what does that say about my data?
  2. when I moved on using bbduk, nothing got trimmed/filtered if I set Q=10, should I increase Q to 20? What is the difference between quality trimming and quality filtering?

Look forward hearing back from you!

Yang

ADD REPLYlink modified 22 months ago • written 22 months ago by zhangya30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1499 users visited in the last hour