Question: Why 50bp Illumina run produces 51bp long sequencs?
gravatar for Tori
3.5 years ago by
Tori80 wrote:

Why 50bp Illumina run produces 51bp reads? Do I have to remove one extra base before starting the analysis?

next-gen illumina • 2.8k views
ADD COMMENTlink modified 3.5 years ago by Brian Bushnell16k • written 3.5 years ago by Tori80
gravatar for Daniel Swan
3.5 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

It depends on the base mask used. See: for a good, but brief discussion on why you might want to drop the n+1th base

ADD COMMENTlink written 3.5 years ago by Daniel Swan13k
gravatar for Brian Bushnell
3.5 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I highly recommend removing the extra base.  In all of my tests, the extra base (101 on a 100bp run, 151 on a 150bp run, etc) had an extremely high error rate, typically 5x that of the previous base - and the quality score of the last base is not very accurate, so quality-trimming will not reliably remove it - and fastQC will not reliably indicate whether or not it needs to be trimmed, either.  For that reason, I put a feature in BBDuk, "forcetrimmodulo", just to remove that last base.  For example: in=reads.fq out=trimmed.fq forcetrimmodulo=5

That trims all of the reads so that the length is equal to zero modulo 5, meaning that 51bp reads will be trimmed to 50bp; 76 will be trimmed to 75; 101 will be trimmed to 100; etc.  If the reads are already (for example) 100bp long, then nothing will happen.  In the processing order internal to bbduk, this "forcetrimmodulo" (or ftm for short) trimming happens before adapter trimming (if that is enabled), which is important because trimming the last extra base makes it easier to identify adapter sequence due to the lower error rate.  As a result I always enable ftm when doing adapter-trimming.

I've posted various graphs showing the huge drop in quality for the extra base, but you can generate them yourself if you want, like this: in=reads.fq ref=reference.fa mhist=mhist.txt qist=qhist.txt reads=10m

Then plot the mhist and qhist to see what the true accuracy of the bases is, by location.

Edit: Actually, I should have just followed Daniel's link :)

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Brian Bushnell16k

reviving an old post to quickly add a "For future reference/users" message:

If your data has been processed with bcf2fastq (with adapter trimming activated, which for some reason seems to be quite common recently) , you will get reads of all different lengths so not only eg 151.

To compensate this (and for me personally the more general approach) I prefer to use the ftr=<read-length> in stead of ftm=5 . The latter will for instance also trim back reads of 78 to 75 and then you unnecessarily trim those back as well. Using ftr will trim back to the specified length but leave the shorter ones untouched.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by lieven.sterck4.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 924 users visited in the last hour