Question: Looking For Reliable Tools To Do Quality Filtering Of Fastq Files
5
gravatar for dataminer89
6.1 years ago by
dataminer8950
USA
dataminer8950 wrote:

I am looking for programs that allow one to pre-process and filter large fastq files for various quality measures.

I know of the fastx toolkit but it seems a little long in the tooth (released in 2009) and the documentation of what it actually does seems to be lacking. Plus there are only one or two tools that would be useful for me, the rest seem to be some sort of plotting helpers.

There are publications out there such as this very recent one NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data in PLoS One 2012 but after reading it I am left scratching my head. This is a pure perl QC tool developed to run on Windows which means it has no internal core that could have been written in C to be fast. Makes me wonder of how this even got accepted.

I need some recommendations of tools that have been tried in practice and were proven to be fast and reliable. Ideally I would like to hear of the tool you use. Beside filtering by average quality, clipping and trimming back reads I would like to be able to detect various artifacts that the data might have, for example duplication, preferential enrichment of subsequences, polyadenylation etc.

Thanks for any input!

fastq qc • 6.4k views
ADD COMMENTlink modified 6.1 years ago by Madelaine Gogol5.0k • written 6.1 years ago by dataminer8950

I don't agree with your comments about developing a tool that will run on Windows, writing software that is portable is a good thing. One of the strengths of Perl, for example (or any scripting language), is the relative ease with which you can perform complex tasks like plotting, creating webpages, etc. and have them run on almost any OS. If you haven't found a program to do everything you mention that is written in C, there is probably a reason.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by SES8.1k

I think the rationale is that parsing and evaluating the fastq format is a surprisingly time consuming operation in interpreted languages due to operations needed to decode a quality character. In addition many of the trimming algorithms may also require various types of inner loops that are again a weakness for these languages. In all it makes it less appropriate for anyone that has large or numerous Fastq files. Heng Li has posted a nice benchmark in this thread How to efficiently parse a huge fastq file?

ADD REPLYlink written 6.1 years ago by Istvan Albert ♦♦ 77k

I agree completely with you about parsing, and I understand the argument. For a lot of tasks, I'll write things in C, but my understanding is that the OP wanted a universal tool to do trimming, plotting, etc. in C and I just haven't seen it. Frankly, I haven't found a tool written in C that actually works for even trimming. They either use way too much memory, or in the case of seqtk, don't actually work. I used seqtk for trimming recently and it is fast, but removed no reads, and left a lot of reads with almost all Ns under default settings.

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by SES8.1k
3
gravatar for SES
6.1 years ago by
SES8.1k
Vancouver, BC
SES8.1k wrote:

I find that PRINSEQ does everything I want, and it will do all the things you listed in your post. It is written in Perl, and while it would be cool to find something producing results this high quality that is written in C, I don't know if it would be as portable, easy to use, or worth the time to develop. But, I'd like to know if you find such a tool!

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by SES8.1k
1

The site, manual and all content looks very professional - it greatly surprises me that I have never heard of it before.

ADD REPLYlink written 6.1 years ago by Istvan Albert ♦♦ 77k
1

They write very nice software, including tagcleaner: http://tagcleaner.sourceforge.net/

and they are professional and responsive to their users, in my experience.

ADD REPLYlink written 6.1 years ago by SES8.1k
2
gravatar for Martin A Hansen
6.1 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Try Biopieces (www.biopieces.org). There is a section on clearning NGS data in the HowTo. It is simple to setup workflows and with GNU Parallel you can easily distribute the tasks to multiple servers.

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Martin A Hansen3.0k
0
gravatar for Sean Davis
6.1 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I don't think there is a single tool that does all that one needs to QC and filter data for all datasets. However, fastqc is one that does give a quick overview in a readable format.

ADD COMMENTlink written 6.1 years ago by Sean Davis25k
0
gravatar for Madelaine Gogol
6.1 years ago by
Madelaine Gogol5.0k
Kansas City
Madelaine Gogol5.0k wrote:

I still use fastx toolkit, and I think it's fast and have no problem with it. I also recently tried trimmomatic for a more complicated trimming situation and I thought it worked nicely.

ADD COMMENTlink written 6.1 years ago by Madelaine Gogol5.0k

this is also something I've never seen before - once this list gets longer I will collect all tools into a tutorial with a bake-off type contest

ADD REPLYlink written 6.1 years ago by Istvan Albert ♦♦ 77k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2121 users visited in the last hour