How to remove by using specific DNA sequences from my fastq.gz files?
2
1
Entering edit mode
2.1 years ago
janny.lau ▴ 10

Hello all,

I have been doing some sequencing with Nanopore and during the library prep for nanopore, we add in a DNA Control strand (DCS) to each of our samples as a QC to ensure that our library prep is done right. Now that I have my fastq.gz files, I assume the DCS is still in my sequences. What tools can I use to remove the DCS from my fastq.gz files? (the DCS is given as fasta sequences in the nanopore website).

Thank you so much for the help.

DCS Nanopore fastq.gz • 1.4k views
ADD COMMENT
0
Entering edit mode
2.1 years ago
ATpoint 89k

Two ways:

a) Run alignment as usual, and add the sequence as an extra chromosome to the reference genome fasta. Then you can later easily remove the reads aligning to that extra chromosome from the bam file. That's probably the (imo) best way as it does not really requires much custom code other than adding the sequence to the genome and index that.

b) Use something like https://bioinf.shenwei.me/seqkit/usage/#grep to find the reads with a match for this sequence. Extract the read names for the matches, and then do some Unix-fu to remove them. Probably this can also be done with some seqkit magic, it's a quite powerful toolkit.

ADD COMMENT
0
Entering edit mode
2.1 years ago
GenoMax 153k

You may be able to use bbduk.sh from BBMap suite in filter mode. How long is your control sequence? You could provide it as follows

bbduk.sh -Xmx4g in=input.fq.gz out=clean.fq.gz literal=DCS.fa
ADD COMMENT

Login before adding your answer.

Traffic: 5464 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6