Question: Filter a large fasta or fastq by a query sequence, with parameterized fuzzy matching
0
gravatar for ovon
12 months ago by
ovon20
ovon20 wrote:

I'm looking for a practical option for filtering a large fasta or fastq file. I'm dealing with a MiSeq read set where there was a problem with our Index reads, but we can separate the sequences by primer type. So far I have tried this approach:

  1. use fuzzy matching to query the entire sequence set (agrep) allowing 2 mismatches. save matching sequences in a file.
  2. then figure out which read names correspond to the matching sequences I identified in step 1 (grep -f)
  3. then filtering the original fastq with that list of read names.

This works fine for subsets of the sequencing run, but if I want to do the entire run, it will take ages (many days, or perhaps weeks, depending on the size of the sequence set). This isn't practical.

I'm looking for an existing tool (or even a series of bash commands) that can take a query sequence (my primer) and filter the entire fasta or fastq based on a fuzzy match where I can set the number of allowed mismatches. It should be able to handle a full MiSeq run worth of reads (in my case, 5-10GB in fastq format). It doesn't need to work for fastq, since I can always filter my fastq files using the read names in a hypothetical fasta output.

It seems like something of this nature would exist already, but I'm having trouble finding anything that would work for the sizes of dataset I'm dealing with. This works on small sequence sets, but I downloaded it and modified the scripts and html files so it could handle my inputs, and it just crashes now: http://www.bioinformatics.org/sms2/fuzzy_search_dna.html I think the fact that it's linked with an HTML frontend is the issue. I don't have the expertise to modify the .js files beyond parameter modification. I am more familiar with awk, sed, and other bash commands, perl, and python.

Thanks in advance for any tips/answers!

filtering fastq fasta • 817 views
ADD COMMENTlink modified 12 months ago by finswimmer12k • written 12 months ago by ovon20

http://emboss.sourceforge.net/apps/cvs/emboss/apps/fuzznuc.html

ADD REPLYlink written 12 months ago by cpad011212k
3
gravatar for finswimmer
12 months ago by
finswimmer12k
Germany
finswimmer12k wrote:

Hello,

have a look at seqkit grep.

fin swimmer

ADD COMMENTlink written 12 months ago by finswimmer12k

Thank you very much, I tested this just now and it works very well. Much faster than my agrep-based method. A very useful tool to know about.

ADD REPLYlink written 12 months ago by ovon20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1780 users visited in the last hour