Find number of reads that contain at least 1 sequence from a list of sequences
2
0
Entering edit mode
7.9 years ago
joltex • 0

I have MiSeq fastq's and I would like to know how many reads in each fastq contain at least one sequence from a list of ~15 short (15bp) sequences. Or, conversely, how many reads don't contain any of the sequences in the list. I want to allow for a couple of mismatches and potentially an indel.

I've spent a while reading posts related to this question but there seem to be many different answers. I've considered writing my own code and have also tried using EMBOSS fuzznuc (which was suggested in one answer), however the output is massive when searching for multiple sequences in millions of reads. It seems to me that this could probably be accomplished most efficiently using an aligner such as bowtie or bwa, but I'm not sure exactly how to go about this. Ideally I would want to align the reads against the 15 motifs and just determine the percent aligned, however because the motifs are much shorter than the reads it would seem that the alignment needs to go the other way around.

If anyone has some insight into this it would be greatly appreciated!

sequencing alignment NGS • 1.7k views
ADD COMMENT
1
Entering edit mode
7.8 years ago
joltex • 0

In case anyone else is interested in doing something similar, I ended up using a program called 'cutadapt' which is meant as a trimmer but allows you to input a fasta file of sequences to search for in your reads and to specify a max percent errors. It outputs some really nice stats including what percentage of your reads contained at least one of the sequences, as well as how many times each sequence was found and with how many errors. You can also choose to output only those reads that matched something in the list or only those that had no matches to a new fastq file.

ADD COMMENT
1
Entering edit mode
7.9 years ago

a very quick (maybe not very elegant) way to solve this would be:

  1. hard-code all possibilities allowed for each motif

  2. store them all in allmotifs.txt

  3. grep -f allmotifs.txt file.fastq | wc -l

ADD COMMENT

Login before adding your answer.

Traffic: 3443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6