Hi all,
I am new to bioinformatics, and I have come across an unexpected pattern in the GC content across the reads.
I have 48 samples (inclusive of two positive controls and one negative control). I am using ddRAD data with the enzyme pair Sbf1 and Msp1. Libraries were paired-end sequenced on an Illumina NovaSeq 6000.
After receiving the sequencing data, I demultiplex the data using process_radtags in STACKS (with flags -r -c -q -D). To assess read quality after demultiplexing I used Fastp (with --trim_poly_g --cut_right and supplying the adapter sequences: AGATCGGAAGAG). I then used the program multiqc to summarize the fastp outputs. Multiqc produced a graph titled GC Content (average GC content over each base of all reads). Each line in these graphs represents a different sample. The prominent read line is the negative control. I have attached images of these graphs for read 1 and read 2, before and after filtering.
My concerns are 1) the pattern of the many peaks across the read (and the large range of these values), and 2) all the samples (including the negative control) are following the same pattern.
I am wondering 1) is this pattern an issue with the reads or if my data is okay to use? (I do have a reference genome to align to), 2) what could possibly be causing this to occur?, and 3) any suggestions to fix this issue (if it even is an issue).
The multiqc summary also provided an average GC content for each individual sample and this value ranged from 50%-53% which from my understanding is pretty normal.
Thank you in advance for any help!
Since this is ddRAD data perhaps this pattern is to be expected? Do you have reason to believe that it should not be the case?