Hi folks,
I have some whole genome bisulfite sequence (WGBS) data where the alignment rates are about 75% to the human genome + known controls that are included. I have trimmed out adapter dimers, as well as clipped adapter sequence from the end of the reads, but the alignment rate remains just above 75%. I've tried pulling out the unaligned reads, assembling them and then pulling out the contigs with the highest representation. However, when I blast these sequences I get either no hit or only a piece of my query/contig matches some reference genome.
So my question is, how do people identify contaminants in bisulfite treated genome data?
I understand that I could create a kraken reference of bisulfite treated genome sequence data for a large number of reference genomes, but I'm hoping there is something a little more accessible.