Tool: FASTA and FASTQ tools
24
gravatar for Kamil
3.0 years ago by
Kamil1.8k
Boston
Kamil1.8k wrote:

Many developers have created tools for manipulating FASTA and FASTQ files. This is a comprehensive list of all the publicly available projects:

Java

  • http://jgi.doe.gov/data-and-tools/bbtools/
    • BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data. BBTools can handle common sequencing file formats such as fastq, fasta, sam, scarf, fasta+qual, compressed or raw, with autodetection of quality encoding and interleaving. It is written in Java and works on any platform supporting Java, including Linux, MacOS, and Microsoft Windows and Linux; there are no dependencies other than Java (version 7 or higher). Program descriptions and options are shown when running the shell scripts with no parameters.

Go

  • https://github.com/shenwei356/seqkit
    • SeqKit is a cross-platform, ultrafast, and practical FASTA/Q manipulations tool that is friendly for researchers to complete wide ranges of FASTA/Q file processing. The toolkit supports plain or gzip-compressed input and output from either standard stream or files, therefore, it could be easily used in command-line pipe.

C/C++

  • https://github.com/lh3/seqtk
    • Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.
  • https://github.com/dcjones/fastq-tools
    • This package provides a number of small and efficient programs to perform common tasks with high throughput sequencing data in the FASTQ format. All of the programs work with typical FASTQ files as well as gzipped FASTQ files.
  • https://github.com/lh3/bioawk
    • Bioawk is an extension to Brian Kernighan's awk, adding the support of several common biological data formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and TAB-delimited formats with column names. It also adds a few built-in functions and an command line option to use TAB as the input/output delimiter. When the new functionality is not used, bioawk is intended to behave exactly the same as the original BWK awk.
  • https://github.com/agordon/fastx_toolkit
    • The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
  • https://github.com/alastair-droop/fqtools
    • fqtools is a software suite for fast processing of FASTQ files.

Python

Perl

  • https://code.google.com/p/biopieces/
    • The Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task.
  • https://github.com/tjparnell/biotoolbox
    • The Bio::ToolBox libraries provide an abstraction layer over a variety of different specialized BioPerl-style modules. For example, there is a special emphasis on the collection data values for defined genomic coordinate regions, regardless of whether the values come from a GFF
      database, Bam file, BigWig file, etc.
  • https://code.google.com/p/ea-utils/
    • Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any FASTQs.

  • https://github.com/sjackman/fastascripts

    • Manipulate FASTA files.

  • https://github.com/tlawrence3/FAST

    • The FAST Analysis of Sequences Toolbox (FAST) is a set of Unix tools (for example fasgrep, fascut, fashead and fastr) for sequence bioinformatics modeled after the Unix textutils (such as grep, cut, head, tr, etc). FAST workflows are designed for "inline" (serial) processing of flatfile biological sequence record databases per-sequence, rather than per-line, through Unix command pipelines. The default data exchange format is multifasta (specifically, a restriction of BioPerl FastA format). FAST tools expose the power of Perl and BioPerl for sequence analysis to non-programmers in an easy-to-learn command-line paradigm.

tool c++ fastq python fasta • 4.9k views
ADD COMMENTlink modified 15 months ago • written 3.0 years ago by Kamil1.8k
1

Also, fastx-toolkit, bioawk, and a gazillion other tools - it's crazy how many of these are around!

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Ram15k
1

And FAST (perl). One is bound to fail when taking up such a task.

ADD REPLYlink written 3.0 years ago by h.mon15k
1

Java - BBMap needs to be added to this list.

ADD REPLYlink written 22 months ago by genomax49k

I'm trying to get bioawk on Ubuntu using "sudo apt-get install bioawk", but it says it can't find a package named bioawk. How could I install this?

ADD REPLYlink written 22 months ago by beneficii30

You can download (or use git clone https://github.com/lh3/bioawk.git) the code and then go into bioawk-master folder and type make. That will compile the program. You can then copy the bioawk executable to a directory in your $PATH (/usr/local/bin should work).

ADD REPLYlink modified 22 months ago • written 22 months ago by genomax49k
2
gravatar for Tariq Daouda
3.0 years ago by
Tariq Daouda190
IRIC | Institute for Research in Immunology and Cancer
Tariq Daouda190 wrote:

Python

There's also the the parsers module of pyGeno. It supports: FASTA,  FASTQ, VCF, GTF and CSV files. With an emphasis on simple and convenient interfaces.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Tariq Daouda190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 629 users visited in the last hour