Need help addressing overrepresented sequences in my data. I tried trimmomatic at first then trim_galore. Not sure if these overrepresented sequences are due to population densities of bacteria. I checked one sequence with BLAST but I could not identify it based on the result from BLAST.
Background: data is from sewage system, sequenced using miniseq illumina 16S rRNA sequencing. Aim to understand bacterial community distribution and do further analysis if possible.
Did fastqc analysis on a sample before (1st image) and after (2nd image) trimming, results below.
Parameters and Summary with Trim Galore, 1 of 2 as it is paired:
AUTO-DETECTING ADAPTER TYPE
===========================
Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> s-1_S1_R1_001.fastq.gz <<)
Found perfect matches for the following adapter sequences:
Adapter type Count Sequence Sequences analysed Percentage
Nextera 961 CTGTCTCTTATA 383581 0.25
smallRNA 1 TGGAATTCTCGG 383581 0.00
Illumina 0 AGATCGGAAGAGC 383581 0.00
Using Nextera adapter for trimming (count: 961). Second best hit was smallRNA (count: 1)
SUMMARISING RUN PARAMETERS
==========================
Input filename: s-1_S1_R1_001.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.10
Cutadapt version: 4.6
Python version: 3.10.12
Number of cores used for trimming: 5
Quality Phred score cutoff: 30
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp
=== Summary ===
Total reads processed: 383,581
Reads with adapters: 81,987 (21.4%)
Reads written (passing filters): 383,581 (100.0%)
Total basepairs processed: 57,920,731 bp
Quality-trimmed: 5,863,623 bp (10.1%)
Total written (filtered): 51,831,944 bp (89.5%)
=== Adapter 1 ===
Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 81987 times
Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-12 bp: 1
Bases preceding removed adapters:
A: 32.0%
C: 32.4%
G: 12.5%
T: 23.0%
none/other: 0.1%
Thank you for your reassurance. I checked one overrepresented sequence with my dataset and blast some of the sequences on NCBI and was able to confirm my suspicion.