Question

Find number of reads that contain at least 1 sequence from a list of sequences

0

Entering edit mode

7.9 years ago

joltex • 0

I have MiSeq fastq's and I would like to know how many reads in each fastq contain at least one sequence from a list of ~15 short (15bp) sequences. Or, conversely, how many reads don't contain any of the sequences in the list. I want to allow for a couple of mismatches and potentially an indel.

I've spent a while reading posts related to this question but there seem to be many different answers. I've considered writing my own code and have also tried using EMBOSS fuzznuc (which was suggested in one answer), however the output is massive when searching for multiple sequences in millions of reads. It seems to me that this could probably be accomplished most efficiently using an aligner such as bowtie or bwa, but I'm not sure exactly how to go about this. Ideally I would want to align the reads against the 15 motifs and just determine the percent aligned, however because the motifs are much shorter than the reads it would seem that the alignment needs to go the other way around.

If anyone has some insight into this it would be greatly appreciated!

sequencing alignment NGS • 1.7k views

ADD COMMENT • link 7.8 years ago by joltex • 0

1

Entering edit mode

7.9 years ago

Jorge Amigo 14k

a very quick (maybe not very elegant) way to solve this would be:

hard-code all possibilities allowed for each motif
store them all in allmotifs.txt
grep -f allmotifs.txt file.fastq | wc -l

ADD COMMENT • link 7.9 years ago by Jorge Amigo 14k

score 1 · Accepted Answer · 2016-06-21

In case anyone else is interested in doing something similar, I ended up using a program called 'cutadapt' which is meant as a trimmer but allows you to input a fasta file of sequences to search for in your reads and to specify a max percent errors. It outputs some really nice stats including what percentage of your reads contained at least one of the sequences, as well as how many times each sequence was found and with how many errors. You can also choose to output only those reads that matched something in the list or only those that had no matches to a new fastq file.