Question

Sequence Duplication for Single-End 3RAD Data

0

Entering edit mode

9 weeks ago

Sarah • 0

Hi all,

I am hoping to get your thoughts on dealing with some severe sequence duplication in my data. For context, I have single-end 100bp 3RAD data from a few catfish species, none of which have reference genomes. My goal is to use stacks denovo to call SNPs for population genomic analyses.

I've run FastQC on these samples (trimmed adapters and standardized length of 85bp using trimmomatic), and they still show an abnormal GC curve which I believe is due to overrepresented sequences. The most common overrepresented sequences BLASTs to a related catfish's ribosomal DNA. I have read that removing duplicates from single-end data is unreliable, but I am unsure how to proceed with this data, whether it be cleaning it up before stacks or filtering it out during ustacks or something.

I have added the FASTQC files for some of my samples to this repo: https://github.com/SarahBabaei/FastQC

Any advice would be appreciated, thank you in advance!!

duplication reduced-representation • 486 views

ADD COMMENT • link 9 weeks ago by Sarah • 0

0

Entering edit mode

With a specialized technique like 3RAD sequencing FastQC is of limited value. It is coded for "normal" genomic sequencing and the various test limits are from that reference. You should probably go ahead with your STACKs workflow and not worry about the FastQC results.

I have read that removing duplicates from single-end data is unreliable

If you are interested in de-duplicating the data then you can use clumpify.sh from BBMap suite which can do this purely based on sequence without having to do alignments. See --> Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

ADD REPLY • link 9 weeks ago by GenoMax 152k

0

Entering edit mode

Thank you for your reply! Do you think this might be due to trimming? Some of my samples have what FastQC categorizes as Illumina adapters in the overrepresented sequences, so I was thinking maybe I'm not preprocessing my sequences enough.

ADD REPLY • link 9 weeks ago by Sarah • 0