Question: Find number of reads that contain at least 1 sequence from a list of sequences
0
gravatar for joltex
3.5 years ago by
joltex0
Canada
joltex0 wrote:

I have MiSeq fastq's and I would like to know how many reads in each fastq contain at least one sequence from a list of ~15 short (15bp) sequences. Or, conversely, how many reads don't contain any of the sequences in the list. I want to allow for a couple of mismatches and potentially an indel.

I've spent a while reading posts related to this question but there seem to be many different answers. I've considered writing my own code and have also tried using EMBOSS fuzznuc (which was suggested in one answer), however the output is massive when searching for multiple sequences in millions of reads. It seems to me that this could probably be accomplished most efficiently using an aligner such as bowtie or bwa, but I'm not sure exactly how to go about this. Ideally I would want to align the reads against the 15 motifs and just determine the percent aligned, however because the motifs are much shorter than the reads it would seem that the alignment needs to go the other way around.

If anyone has some insight into this it would be greatly appreciated!

sequencing alignment ngs • 944 views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by joltex0
1
gravatar for joltex
3.5 years ago by
joltex0
Canada
joltex0 wrote:

In case anyone else is interested in doing something similar, I ended up using a program called 'cutadapt' which is meant as a trimmer but allows you to input a fasta file of sequences to search for in your reads and to specify a max percent errors. It outputs some really nice stats including what percentage of your reads contained at least one of the sequences, as well as how many times each sequence was found and with how many errors. You can also choose to output only those reads that matched something in the list or only those that had no matches to a new fastq file.

ADD COMMENTlink written 3.5 years ago by joltex0
1
gravatar for Jorge Amigo
3.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

a very quick (maybe not very elegant) way to solve this would be:

  1. hard-code all possibilities allowed for each motif

  2. store them all in allmotifs.txt

  3. grep -f allmotifs.txt file.fastq | wc -l

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Jorge Amigo11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 755 users visited in the last hour