I have paired end libraries (prepared by sureselect target enrichments hs protocol) with insert sizes of approximately 150 - 200 bp. A 600 cycle v3 kit was used for sequencing on a MiSeq with the intention of setting the cycle number to 101 x 101 however it was mistakenly set to 301 x 301. When I attempt to correct the read length by adapter trimming to remove the read through there is still a sharp peak of reads close to 300 bp in size. FastQC adapter module tells me there is no adapter left. Is it poor practice to force trim reads down to a certain length?
Hi h.mon, Thanks for your reply. There is a reference genome available. I'll try what you suggested and follow up.
Hello,
I carried out trimming with bbduk.sh using the tbo and tpe flags, as suggested + quality trimmed reads on right to q20 (qtrim=r trimq=20) and chose a minimum length of 50. A peak remained close to 300 bp (now 293 bp after trimming).
I then mapped the reads to my reference sequence using BBMap and plotted the mhist, which is below. I did not include Match1 and 2 in the plot but they maintain 0.99 across the positions until position 293 where they fall to 0.95325 and 0.96665, respectively.
When I remove position 293, I get the following:
As I'm quite new to this type of analysis do you have any recommendations on how to proceed?
You should trim using
ktrim=r
which would actually remove the adapter contamination (qtrim is not going to do that). Please post results of what happens with that option. You can leave theqtrim
off. There is no need to worry about that upfront. Also post the entirebbduk.sh
command line you use.Hi genomax,
Sorry I was not clear. I did trim with ktrim=r also before the above plots.
This is the command I used:
bbduk.sh in1=in_R1.fq in2=in_R2.fq out1=out_R1.fq out2=out_R2.fq ref=adapters.fa ktrim=r k=23 tbo tpe qtrim=r trimq=20 minlen=50
adapters are Truseq:
adapter read 1= AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
adapter read 2 = AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Can you provide stats of trimming? How many reads were trimmed and how many were completely removed?
If you are sure that the inserts are definitely under 200 bp then almost every read should have adapter read through and will get trimmed. Based on the
mhist
plot it looks like there may be inserts that are much longer than what you are expecting.Can you also try
bbmerge.sh
to see how many reads are able to merge? That should give us an idea of what the insert size actually is.The above command without quality trimming:
With quality trimming:
Your insert sizes appear to be longer than 200 bp for sure and you have low adapter dimers (so most fragments are real inserts). I am not sure how you estimated the insert sizes but the reality seems to be different. If you absolutely expect the inserts to be 150-200bp, then you may want to investigate what is going on. Otherwise at this point (after the trimming) you should just go ahead and proceed with your analysis (without any hard trimming).
BBMerge results:
This is a lot clearer now. I was getting confused by sequence length distribution of individual r1 and r2 reads in the fastqc module.
Thanks for your help genomax and h.mon.
Great. You should be able to move forward with the analysis.