Question

Increasing per base G content in QC sequencing files

0

Entering edit mode

6 months ago

robertsr • 0

Hello,

I conducted some metagenomic sequencing as follows:

Metagenomic sequencing from human stool samples
PCR-free library prep using NEB kit (450bp insert)
Illumina NovaSeq X Plus sequencing (150bp paired-end) -96 samples multiplexed on 1 lane

I have just got back the QC results and the Q30 scores look good (87-90% for all samples). However the base content along the reads looks strange for some samples whereby the G content begins to increase at the end of both reads whilst C content (and sometime A/T content) begins to decrease. See some photos attached of a mixture of different samples. This occurs only for some samples, whilst others remain relatively stable in GC/AT content.

Should I be worried about this? Any advice would be helpful.

Thanks!

enter image description here

FastQC GC-content QC NovaSeq • 614 views

ADD COMMENT • link updated 6 months ago by GenoMax 145k • written 6 months ago by robertsr • 0

0

Entering edit mode

Should you be worried? Likely not. Keep it at the back of your mind. Proceed with the rest of the analysis. If there is something amiss then backtrack to figure out.

ADD REPLY • link 6 months ago by GenoMax 145k

score 0 · Answer 1 · 2024-02-21

Mind that for the Illumina NovaSeq X Plus chemistry, "absence of signal" is base-called as G. Therefore, the increasing G signal towards the end of the read is presumably indicative of a biological/technical issue that resulted in clusters being lost to detection. After n cycles (~100), these clusters likely entirely dropped out, resulting in reads that exhibit G-homopolymers at the 3' end.

This could be due to (wild-speculation) adapter dimers in your library, particularly if low-input samples are predominantly affected? To understand better what the issue might be, I think you should look specifically at affected reads and at the sequences preceding the dropout. Not tested, but BBDuk should be able to extract those:

bbduk.sh in="read1.fastq.gz" in2="read2.fastq.gz" outm="read1_polyG.fastq.gz" outm2="read1_polyG.fastq.gz" \
                        stats="sample.stats" \
                        k=23 \
                        literal="GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" \
                        hammingdistance=1 \
                        removeifeitherbad=t \
                        pratio=G,C \
                        plen=30

Afterwards, you can use clumpify.sh to order them based on similarity for easier inspection?

You can also use BBDuk to filter/trim/mask the affected reads before proceeding to downstream analysis. Just use out= instead of outm= to save the cleaned reads instead of the affected ones.