Question

Overrepresented Sequences persistence

0

Entering edit mode

21 months ago

Luna • 0

Need help addressing overrepresented sequences in my data. I tried trimmomatic at first then trim_galore. Not sure if these overrepresented sequences are due to population densities of bacteria. I checked one sequence with BLAST but I could not identify it based on the result from BLAST.

Background: data is from sewage system, sequenced using miniseq illumina 16S rRNA sequencing. Aim to understand bacterial community distribution and do further analysis if possible.

Did fastqc analysis on a sample before (1st image) and after (2nd image) trimming, results below. enter image description here

Parameters and Summary with Trim Galore, 1 of 2 as it is paired:

     AUTO-DETECTING ADAPTER TYPE

===========================

Attempting to auto-detect adapter type from the first 1 million sequences of the first file (>> s-1_S1_R1_001.fastq.gz <<)

Found perfect matches for the following adapter sequences:
Adapter type    Count   Sequence        Sequences analysed      Percentage
Nextera 961     CTGTCTCTTATA    383581  0.25
smallRNA        1       TGGAATTCTCGG    383581  0.00
Illumina        0       AGATCGGAAGAGC   383581  0.00
Using Nextera adapter for trimming (count: 961). Second best hit was smallRNA (count: 1)

    SUMMARISING RUN PARAMETERS
==========================

Input filename: s-1_S1_R1_001.fastq.gz
Trimming mode: paired-end
Trim Galore version: 0.6.10
Cutadapt version: 4.6
Python version: 3.10.12
Number of cores used for trimming: 5
Quality Phred score cutoff: 30
Quality encoding type selected: ASCII+33
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)
Maximum trimming error rate: 0.1 (default)
Minimum required adapter overlap (stringency): 1 bp
Minimum required sequence length for both reads before a sequence pair gets removed: 30 bp

=== Summary ===

Total reads processed:                 383,581
Reads with adapters:                    81,987 (21.4%)
Reads written (passing filters):       383,581 (100.0%)

Total basepairs processed:    57,920,731 bp
Quality-trimmed:               5,863,623 bp (10.1%)
Total written (filtered):     51,831,944 bp (89.5%)

=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 81987 times

Minimum overlap: 1
No. of allowed errors:
1-9 bp: 0; 10-12 bp: 1

Bases preceding removed adapters:
  A: 32.0%
  C: 32.4%
  G: 12.5%
  T: 23.0%
  none/other: 0.1%

trimmomatic Fastqc trim_galore • 1.2k views

ADD COMMENT • link 21 months ago by Luna • 0

score 0 · Answer 1 · 2023-12-24

You can confirm this yourself but the sequence that you note as being overpresented seems to be aligning to 16S rRNA when you blast it at NCBI, which is logical considering that is what you are working on. (NOTE: I typed the sequence in manually for BLAST search since you posted screenshots so there may be an error or two but the general result should stand).

Variovorax sp. H2R19_pA 16S ribosomal RNA gene, partial sequence
Sequence ID: KX023727.1Length: 910Number of Matches: 1
Range 1: 327 to 375GenBankGraphics
Next Match
Previous Match
Alignment statistics for match #1 Score Expect  Identities  Gaps    Strand
80.5 bits(43)   1e-11   47/49(96%)  0/49(0%)    Plus/Plus

Score   Expect  Identities  Gaps    Strand
80.5 bits(43)   1e-11   47/49(96%)  0/49(0%)    Plus/Plus

Query  1    CCTACGGGCGGCTGCAGTGGGGATTTTGGACAATGGGCGCAAGCCTGAT  49
            |||||||| ||| ||||||||||||||||||||||||||||||||||||
Sbjct  327  CCTACGGGAGGCAGCAGTGGGGATTTTGGACAATGGGCGCAAGCCTGAT  375

Nothing seems to be of concern. Proceed with rest of your analysis.