Question

Remove reads containing a specific sequence.

0

Entering edit mode

3.5 years ago

agtbeeman • 0

Hello guys,

I have a sample.fastq file containing 20,000,000 reads. After doing a fastqc I notice an over-represented sequence "GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA" with a count of 50,000 according to fastqc.

For some reasons I am interested to persue an analysis without the reads containing this sequence.

However, grep -c show a higher count...

grep -c  'GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA' sample.fastq
167498

Anyway I have ran the following using fastx toolkit in the purpose to delete these reads (http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) :
```
fastx_clipper -a GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA -C -i sample.fq -o clipped.fastq
```

grep -c 'GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA' clipped.fq give 0 which is good. BUT when I count the number of reads it s only about 12,000,000 meaning 8,000,000 reads have been discarded or I expected a maximum of 167498 reads being discarded. (maybie the tools also deleted reads containing subsequences of this sequence ?).

Does anyone know how to fix it or know another tool more adapted to my problem ?

rna-seq sequence software error • 1.2k views

ADD COMMENT • link 3.5 years ago by agtbeeman • 0

score 1 · Answer 1 · 2020-11-06

1

Entering edit mode

3.5 years ago

GenoMax 141k

You can use bbduk.sh from BBMap suite in filter mode to eliminate reads containing these sequences. A guide is available for bbduk.

Basically something like following will work:

bbduk.sh -Xmx2g in=your.fastq outu=clean.fastq literal=GGTCATCTGCAAGTGATTTTACTTCCGCCTAGTGAGAAGTGGAACAATTA k=15

Add rcomp=f if you want to remove only exact sequence and not its reverse complement.

ADD COMMENT • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

Thank you genomax ! it deleted all the reads containing the sequence. However it has deleted 400,000 reads instead of 167,498. I am not sure bu I think it is relative to k-mer size I set it to maximum and it has deleted 300,000 reads. (which is better than fastx !)

But if it has deleted also reads with very close sequence too or with subsequences not too short it's fine to me !

ADD REPLY • link 3.5 years ago by agtbeeman • 0

1

Entering edit mode

Take a look at the inline options for bbduk.sh to adjust the parameters to suite you needs. You can set hdist=0 to allow only perfect matches. Also consider the reverse complement situation I described above. grep only looks at direct matches so it is not going to understand reverse complement of sequence of interest, if present in your reads. So bbduk.sh is doing the right thing.

ADD REPLY • link 3.5 years ago by GenoMax 141k