Question: Removing spike-in sequences from oxidative bisulfite (OXBS) fastq files
gravatar for pawat
3.0 years ago by
pawat10 wrote:

Hello all, I am processing my sequences from an BS/oxBS (bisulfite, oxidative bisulfite) sequencing runs, and observed some amount of contamination from short (~60bp) reads. I suspect that these are the spike-in sequences, because their sizes are also 60bp. I added these as a control during library prep to estimate how the oxidation, conversion went. I would like to remove these reads before alignment. The problem is, these spike-in reads are also bisulfite converted, at various locations and levels.

For example: The SQ6hmC spike in is: TACGATCACGGCGAATCCGATCGAATCAGTCAAGCGCTTTACGAAGTGCGACAGCCTTAG Within this, some Cs are unmethylated, some methylated (5mC), and some hydroxymethylated (5hmC). After BS reaction, all unmethylated Cs will be converted to Ts. After oxBS reaction, all unmethylated C AND 5hmC are converted to Ts.

I've attached the pic here for all spike-in sequences. Green=5hmC; Red=5mC; Grey=C.

What would be the best way to go about removing these spike-in reads? Thank you!

BS/oxBS spike-in sequences

5mc bisulfite 5hmc oxidative oxbs • 960 views
ADD COMMENTlink modified 3.0 years ago by dariober11k • written 3.0 years ago by pawat10
gravatar for dariober
3.0 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

If I'm not mistaken, the spike-in sequences are designed to be different enough from the mouse or human genome so that they will not align. In other words, you may just proceed to genome alignment with these sequences in the fastq files.

Alternatively, create a reference genome (i.e. a fasta file) containing your genome plus the spike-in sequences, index it and align your reads. In this way, the spike-in reads will be captured by the spike-in contigs. I'm not 100% sure, but depending on the aligner you use, you may need to pad the spike-in contigs in the reference fasta with a short string of N left and right (say 5 or 10). Otherwise, some aligners could have problems with reads overhanging the end of the contig (this would affect only those few reads with insertions).

(PS the image you linked is not visible)

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by dariober11k

Thank you for your answer! Using FastQ_Screen indeed showed that the spike-in sequences are distinct from the zebrafish genome I'm working with. The program estimated about 30% zebrafish sequences and 70% something else (not zebrafish, mammal, e.coli, or phiX).

An idea I've been thinking is to pre-convert the spike-in sequences manually and then remove anything that match these within the fastq files. I just don't know of any program that can remove reads containing certain sequences from the pool of fastq...

ps. I'm not sure why the image isn't showing here, but here is the link (

ADD REPLYlink written 3.0 years ago by pawat10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1658 users visited in the last hour