Hi all,
I am hoping to get your thoughts on dealing with some severe sequence duplication in my data. For context, I have single-end 100bp 3RAD data from a few catfish species, none of which have reference genomes. My goal is to use stacks denovo to call SNPs for population genomic analyses.
I've run FastQC on these samples (trimmed adapters and standardized length of 85bp using trimmomatic), and they still show an abnormal GC curve which I believe is due to overrepresented sequences. The most common overrepresented sequences BLASTs to a related catfish's ribosomal DNA. I have read that removing duplicates from single-end data is unreliable, but I am unsure how to proceed with this data, whether it be cleaning it up before stacks or filtering it out during ustacks or something.
I have added the FASTQC files for some of my samples to this repo: https://github.com/SarahBabaei/FastQC
Any advice would be appreciated, thank you in advance!!
With a specialized technique like 3RAD sequencing FastQC is of limited value. It is coded for "normal" genomic sequencing and the various test limits are from that reference. You should probably go ahead with your STACKs workflow and not worry about the FastQC results.
If you are interested in de-duplicating the data then you can use
clumpify.sh
from BBMap suite which can do this purely based on sequence without having to do alignments. See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.Thank you for your reply! Do you think this might be due to trimming? Some of my samples have what FastQC categorizes as Illumina adapters in the overrepresented sequences, so I was thinking maybe I'm not preprocessing my sequences enough.