Question: Filter a large fasta or fastq by a query sequence, with parameterized fuzzy matching
gravatar for ovon
23 months ago by
ovon20 wrote:

I'm looking for a practical option for filtering a large fasta or fastq file. I'm dealing with a MiSeq read set where there was a problem with our Index reads, but we can separate the sequences by primer type. So far I have tried this approach:

  1. use fuzzy matching to query the entire sequence set (agrep) allowing 2 mismatches. save matching sequences in a file.
  2. then figure out which read names correspond to the matching sequences I identified in step 1 (grep -f)
  3. then filtering the original fastq with that list of read names.

This works fine for subsets of the sequencing run, but if I want to do the entire run, it will take ages (many days, or perhaps weeks, depending on the size of the sequence set). This isn't practical.

I'm looking for an existing tool (or even a series of bash commands) that can take a query sequence (my primer) and filter the entire fasta or fastq based on a fuzzy match where I can set the number of allowed mismatches. It should be able to handle a full MiSeq run worth of reads (in my case, 5-10GB in fastq format). It doesn't need to work for fastq, since I can always filter my fastq files using the read names in a hypothetical fasta output.

It seems like something of this nature would exist already, but I'm having trouble finding anything that would work for the sizes of dataset I'm dealing with. This works on small sequence sets, but I downloaded it and modified the scripts and html files so it could handle my inputs, and it just crashes now: I think the fact that it's linked with an HTML frontend is the issue. I don't have the expertise to modify the .js files beyond parameter modification. I am more familiar with awk, sed, and other bash commands, perl, and python.

Thanks in advance for any tips/answers!

filtering fastq fasta • 1.5k views
ADD COMMENTlink modified 23 months ago by finswimmer13k • written 23 months ago by ovon20

ADD REPLYlink written 23 months ago by cpad011214k
gravatar for finswimmer
23 months ago by
finswimmer13k wrote:


have a look at seqkit grep.

fin swimmer

ADD COMMENTlink written 23 months ago by finswimmer13k

Thank you very much, I tested this just now and it works very well. Much faster than my agrep-based method. A very useful tool to know about.

ADD REPLYlink written 23 months ago by ovon20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1049 users visited in the last hour