Question

classify contaminants in bisulfite treated whole genome data

0

Entering edit mode

3.5 years ago

Richard ▴ 590

Hi folks,

I have some whole genome bisulfite sequence (WGBS) data where the alignment rates are about 75% to the human genome + known controls that are included. I have trimmed out adapter dimers, as well as clipped adapter sequence from the end of the reads, but the alignment rate remains just above 75%. I've tried pulling out the unaligned reads, assembling them and then pulling out the contigs with the highest representation. However, when I blast these sequences I get either no hit or only a piece of my query/contig matches some reference genome.

So my question is, how do people identify contaminants in bisulfite treated genome data?

I understand that I could create a kraken reference of bisulfite treated genome sequence data for a large number of reference genomes, but I'm hoping there is something a little more accessible.

wgbs • 685 views

ADD COMMENT • link updated 3.5 years ago by Friederike 8.9k • written 3.5 years ago by Richard ▴ 590

score 0 · Answer 1 · 2020-10-21

That is a very normal mapping efficiency value for WGBS. After all, you're mapping to a reduced alphabet (3 bases instead of 4) and you've treated the DNA rather harshly. The fact that nothing comes up in your BLAST query supports the notion that the 25% reads that couldn't be mapped are most likely not representative of contamination, but artefacts of the experiment.