16S amplicon sequencing unequal read lengths
0
0
Entering edit mode
3.5 years ago
Earendil ▴ 50

I have downloaded fastq files from two completely unrelated projects that are using 16s amplicon sequencing.

Both have a weird read length distribution. Forward reads are mostly 301 bp long, as seen from awk 'NR%4==2{print length}' fw_1.fastq | sort | uniq -c | less

Reverse reads are mostly 300bp long, as seen from awk 'NR%4==2{print length}' rv_1.fastq | sort | uniq -c | less

Why would that be so? I want to analyse the datasets with DADA2 and FIGARO (FIGARO detects optimal trim parameters for DADA2) and FIGARO requires that all reads are of the same length. If I filter reads based on their length and decide to keep only reads and their pair that are, for instance, 300 bp, I will loose the vast majority of reads.

I am at a complete loss as to why would the forward and reverse reads be of mostly unequal length.

sequencing DADA2 FIGARO 16s amplicon sequencing • 1.5k views
ADD COMMENT
2
Entering edit mode

Hi,

I do not recall why the Illumina add one base at the end of the read but you could use the Force-trim modulo (ftm) of bbduk to remove the very last base. Here is an explanation of how this works:

The right end so that the read’s length is equal to zero modulo 5 (ftm=5). The reason for this is that with Illumina sequencing, normal runs are usually a multiple of 5 in length (50bp, 75bp, 100bp, etc), but sometimes they are generated with an extra base (51bp, 76bp, 151bp, etc). This last base is very inaccurate and has badly calibrated quality as well, so it’s best to trim it before doing anything else. But you don’t want to simply always trim the last base, because sometimes the last base will already be clipped by Illumina’s software. “ftm=5” will, for example, convert a 151bp read to 150bp, but leave a 150bp read alone.

source

ADD REPLY
0
Entering edit mode

Thanks for that, it would be interesting to know why this happens, but if it's just a low quality base then I think I'll simply trim it as you say and move forward from there.

ADD REPLY
0
Entering edit mode

Force-trim modulo (ftm) will work if all reads are 300bp or 300bp, but if there are shorter reads (e.g. of 299bp), then ftm alone won't work, as a 299bp read will be trimmed to 295bp (ftm=5) or 290bp (ftm=10), and so on. Using the minlength (ml) parameter would help then: ftm=10 ml=300.

Another option is to use the force-trim right parameter (zero-based), again in conjunction with ml: ftr=299 ml=300.

Maybe it is worth opening an issue at the FIGARO repository. DADA2 doesn't have this read length limitation, it would make more sense if FIGARO didn't have too.

ADD REPLY
0
Entering edit mode

Thanks, I might try to do something similar with cutadapt. Actually they are already working for variable read length support in FIGARO, I think there was especially request for that feature to be able to run reads from ITS sequencing. I need to trim though for now, since it is unknown when that feature branch is going to be ready.

ADD REPLY

Login before adding your answer.

Traffic: 2773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6