Question

Unusual sequence quality plots - DADA2 pipeline

0

Entering edit mode

4.0 years ago

cs1878 • 0

Hi All, thanks for taking the time to read and answer my question. I'm processing some sequencing data for a colleague. Its 16s amplicons from soil DNA extractions, ran on Illumina MiSeq 2 x 250 bp. These samples were commercially processed and I just recieved the .fq.gz files. My machine has 16gb RAM and 45 gb swap memory in Ubuntu.

I started to run my analysis pipeline. I use DADA2 with cutadapt to remove primers. (I understand some only use cutadapt for sequences were you want to keep true biological sequence length variation - and with 16s this is not so important, but I had some issues with the F and R reads being incorrectly labelled - so cutadapt helped me visualise this).

Problems: 1) The sequence quality profiles are very unusual. They have considerable banding in the heatplots of the quality and I've never seen this before. Does anyone know if this is an issue, or could be an indication of the issues I have downstream? I've added the error plots here too just for reference, in case it helps.

quality plots

DADA2 error rates

2) My R session aborts during the dereplication process. This might be a simple memory issue with the computer as it runs relatively far into the reverse reads before it fails each time. Approximately 75% of my reads are unique sequences - e.g. 'Encountered 73757 unique sequences from 103827 total sequences read'. I understand memory requirements in this pipeline scale with the number of unique sequences, so this could be where some of my issues lie. I could try the big data pipeline to get around this and process samples individually if necessary. However, I wondered whether this number of unique sequences could be anything to do with the unusual sequence quality plots above?

Thanks in advance for any tips. Chris

sequencing dada2 software error R • 1.9k views

ADD COMMENT • link updated 2.6 years ago by epb5360 • 0 • written 4.0 years ago by cs1878 • 0

score 1 · Answer 1 · 2020-04-17

1

Entering edit mode

4.0 years ago

JC 13k

1) I think you refer drop-down values after 200 bp, that seems to be an issue when the machine changed the reagents but would be not so important for the following processes.

2) 16GB RAM is very few to analyze this on R, better preprocess in the command-line or get a bigger machine.

ADD COMMENT • link 4.0 years ago by JC 13k

0

Entering edit mode

Hi JC, thank you for your reply. I am happy to learn what the drop-down in the values after 200 bp was - that was one of my concerns. However, my main concern was the the quality plot overall looks very structured, with the grayscale heatmap appearing in distinct bands. The pattern appears that there is very clear structure to the quality of the sequence data, I have not seen quality plots like this before in my experience.

ADD REPLY • link 4.0 years ago by cs1878 • 0

score 0 · Answer 2 · 2021-08-30

0

Entering edit mode

2.6 years ago

epb5360 • 0

I am actually having the same exact issue, also with commercial 16S data. My quality plots look exactly the same and dada2 fails because it assigns 70,000+ ASVs to this weird dataset. Did you get any answers?

ADD COMMENT • link 2.6 years ago by epb5360 • 0