Question

Removing spike-in sequences from oxidative bisulfite (OXBS) fastq files

1

Entering edit mode

6.1 years ago

pawat ▴ 10

Hello all, I am processing my sequences from an BS/oxBS (bisulfite, oxidative bisulfite) sequencing runs, and observed some amount of contamination from short (~60bp) reads. I suspect that these are the spike-in sequences, because their sizes are also 60bp. I added these as a control during library prep to estimate how the oxidation, conversion went. I would like to remove these reads before alignment. The problem is, these spike-in reads are also bisulfite converted, at various locations and levels.

For example: The SQ6hmC spike in is: TACGATCACGGCGAATCCGATCGAATCAGTCAAGCGCTTTACGAAGTGCGACAGCCTTAG Within this, some Cs are unmethylated, some methylated (5mC), and some hydroxymethylated (5hmC). After BS reaction, all unmethylated Cs will be converted to Ts. After oxBS reaction, all unmethylated C AND 5hmC are converted to Ts.

I've attached the pic here for all spike-in sequences. Green=5hmC; Red=5mC; Grey=C.

What would be the best way to go about removing these spike-in reads? Thank you!

BS/oxBS spike-in sequences

oxbs oxidative bisulfite 5mC 5hmC • 1.7k views

ADD COMMENT • link updated 6.1 years ago by dariober 14k • written 6.1 years ago by pawat ▴ 10

score 2 · Accepted Answer · 2018-03-01

2

Entering edit mode

6.1 years ago

dariober 14k

If I'm not mistaken, the spike-in sequences are designed to be different enough from the mouse or human genome so that they will not align. In other words, you may just proceed to genome alignment with these sequences in the fastq files.

Alternatively, create a reference genome (i.e. a fasta file) containing your genome plus the spike-in sequences, index it and align your reads. In this way, the spike-in reads will be captured by the spike-in contigs. I'm not 100% sure, but depending on the aligner you use, you may need to pad the spike-in contigs in the reference fasta with a short string of N left and right (say 5 or 10). Otherwise, some aligners could have problems with reads overhanging the end of the contig (this would affect only those few reads with insertions).

(PS the image you linked is not visible)

ADD COMMENT • link 6.1 years ago by dariober 14k

0

Entering edit mode

Thank you for your answer! Using FastQ_Screen indeed showed that the spike-in sequences are distinct from the zebrafish genome I'm working with. The program estimated about 30% zebrafish sequences and 70% something else (not zebrafish, mammal, e.coli, or phiX).

An idea I've been thinking is to pre-convert the spike-in sequences manually and then remove anything that match these within the fastq files. I just don't know of any program that can remove reads containing certain sequences from the pool of fastq...

ps. I'm not sure why the image isn't showing here, but here is the link (https://ibb.co/bwW8sc)

ADD REPLY • link 6.1 years ago by pawat ▴ 10