Question

Low merging rate of 16S PE reads, possibly overrepresented seqs?

0

Entering edit mode

2.9 years ago

A • 0

Hi all, I'm trying to do 16S diversity analysis using Illumina HiSeq3000 reads, but most of the reads (~99%) are marked as 'ambiguous' during the merging step. This happened whether I did trimming/quality filtering with Trimmomatic or BBDuk, and also when merging the raw files.

I think it might be related to overrepresented sequences. According to the FastQC, there's sequences with 18%, 12%, and 7% overrepresentation, and a bunch of others with <1%. They don't appear to be adapter content, at least based on adapter trimming in Trimmomatic and BBDuk files, and FastQC doesn't recognize them as adapter sequences either. Lastly, the same sample was sequenced a few times and thus I have 4 sets of fwd/reverse. Overrepresented sequences are the same in all file pairs, and the % are also similar. I also blasted some sequences and didn't get anything back.

So my questions are: 1) Is it valid to remove the problem sequences, at least the ones over 1%? and 2) Can this be achieved using the 'literal' flag in BBDuk, or is there a better way to do this?

Additional info:

Link to download FastQC of two sets of files, after trimming with Trimmomatic

Examples of overrepresented sequences (single nucleotide difference in bold)

'AGAGTTTGATCCTGGCTCAGAGTGAACGCTGGCGGCGTGCTTAACACATG' 'AGAGTTTGATCCTGGCTCAGAGTGAACGCTGGCGGCGTGCCTAACACATG'

Commands used for trimming and filtering, ${i} is the sample name

Trimmomatic

java -jar trimmomatic-0.32.jar PE -threads 12 \
  ${my_data}raw/${i}R1_001.fastq.gz \
  ${my_data}raw/${i}R2_001.fastq.gz \
  ${my_data}interim/${i}/trimmomatic/${i}R1_paired.fq.gz \
  ${my_data}interim/${i}/trimmomatic/${i}R1_unpaired.fq.gz \
  ${my_data}interim/${i}/trimmomatic/${i}R2_paired.fq.gz \
  ${my_data}interim/${i}/trimmomatic/${i}R2_unpaired.fq.gz \
  ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 \
  LEADING:30 TRAILING:30 SLIDINGWINDOW:5:20 MINLEN:50

BBDuk

bbduk2.sh threads=12  ktrim=r k=23 mink=11 hdist=1 tpe tbo \
ref=adapters.fa tpe tbo \
in=${my_data}raw/${i}R1_001.fastq.gz \
in2=${my_data}raw/${i}R2_001.fastq.gz \
out=${my_data}interim/${i}/trimmomatic/${i}R1_paired.fq.gz \
out2=${my_data}interim/${i}/trimmomatic/${i}R2_paired.fq.gz

Command for merging

bbmerge.sh \
in1=${my_data}interim/${i}/trimmomatic/${i}R1_paired.fq.gz \
in2=${my_data}interim/${i}/trimmomatic/${i}R2_paired.fq.gz \
out=${my_data}interim/${i}/merged_02/${i}merged.fastq \
outu1=${my_data}interim/${i}/merged_02/${i}unmerged1.fasta \
outu2=${my_data}interim/${i}/merged_02/${i}unmerged2.fasta

FastQC Illumina 16S overrepresented paired-end • 984 views

ADD COMMENT • link 2.9 years ago by A • 0

1

Entering edit mode

You are working with 2 × 150 bp reads. With this setup, the R1 and R2 will not overlap unless you are working with a very small insert.

ps. Having duplicated sequence in Illumina 16S sequencing data is perfectly fine