Hi All, thanks for taking the time to read and answer my question. I'm processing some sequencing data for a colleague. Its 16s amplicons from soil DNA extractions, ran on Illumina MiSeq 2 x 250 bp. These samples were commercially processed and I just recieved the .fq.gz files. My machine has 16gb RAM and 45 gb swap memory in Ubuntu.
I started to run my analysis pipeline. I use DADA2 with cutadapt to remove primers. (I understand some only use cutadapt for sequences were you want to keep true biological sequence length variation - and with 16s this is not so important, but I had some issues with the F and R reads being incorrectly labelled - so cutadapt helped me visualise this).
Problems: 1) The sequence quality profiles are very unusual. They have considerable banding in the heatplots of the quality and I've never seen this before. Does anyone know if this is an issue, or could be an indication of the issues I have downstream? I've added the error plots here too just for reference, in case it helps.
2) My R session aborts during the dereplication process. This might be a simple memory issue with the computer as it runs relatively far into the reverse reads before it fails each time. Approximately 75% of my reads are unique sequences - e.g. 'Encountered 73757 unique sequences from 103827 total sequences read'. I understand memory requirements in this pipeline scale with the number of unique sequences, so this could be where some of my issues lie. I could try the big data pipeline to get around this and process samples individually if necessary. However, I wondered whether this number of unique sequences could be anything to do with the unusual sequence quality plots above?
Thanks in advance for any tips. Chris