Question

How to remove by using specific DNA sequences from my fastq.gz files?

1

Entering edit mode

2.1 years ago

janny.lau ▴ 10

Hello all,

I have been doing some sequencing with Nanopore and during the library prep for nanopore, we add in a DNA Control strand (DCS) to each of our samples as a QC to ensure that our library prep is done right. Now that I have my fastq.gz files, I assume the DCS is still in my sequences. What tools can I use to remove the DCS from my fastq.gz files? (the DCS is given as fasta sequences in the nanopore website).

Thank you so much for the help.

DCS Nanopore fastq.gz • 1.4k views

ADD COMMENT • link updated 2.1 years ago by GenoMax 153k • written 2.1 years ago by janny.lau ▴ 10

score 0 · Answer 1 · 2023-08-02

Two ways:

a) Run alignment as usual, and add the sequence as an extra chromosome to the reference genome fasta. Then you can later easily remove the reads aligning to that extra chromosome from the bam file. That's probably the (imo) best way as it does not really requires much custom code other than adding the sequence to the genome and index that.

b) Use something like https://bioinf.shenwei.me/seqkit/usage/#grep to find the reads with a match for this sequence. Extract the read names for the matches, and then do some Unix-fu to remove them. Probably this can also be done with some seqkit magic, it's a quite powerful toolkit.

score 0 · Answer 2 · 2023-08-02

0

Entering edit mode

2.1 years ago

GenoMax 153k

You may be able to use bbduk.sh from BBMap suite in filter mode. How long is your control sequence? You could provide it as follows

bbduk.sh -Xmx4g in=input.fq.gz out=clean.fq.gz literal=DCS.fa

ADD COMMENT • link 2.1 years ago by GenoMax 153k