Tool: SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang
9
gravatar for shenwei356
13 months ago by
shenwei3563.4k
China
shenwei3563.4k wrote:

Hi all,

I'd like to introduce you a cross-platform and every fast FASTA/Q toolkit, Seqkit, written in Golang.

Introduction

Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly.

SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.

I had used SeqKit to solved some problems raised by Biostars users in simple and efficient ways. For examples:

Benchmarks

SeqKit uses author's lightweight and high-performance bioinformatics packages bio for FASTA/Q parsing, which has high performance close to the famous C lib klib (kseq.h).

FASTA manipulations benchmark.5tests.tsv.png

FASTQ manipulations benchmark.5tests.tsv.C.png

Subcommands

Sequence and subsequence

  • seq transform sequences (revserse, complement, extract ID...)
  • subseq get subsequences by region/gtf/bed, including flanking sequences
  • sliding sliding sequences, circular genome supported
  • stat simple statistics of FASTA files
  • faidx create FASTA index file

Format conversion

  • fx2tab covert FASTA/Q to tabular format (and length/GC content/GC skew)
  • tab2fx covert tabular format to FASTA/Q format
  • fq2fa covert FASTQ to FASTA

Searching

  • grep search sequences by pattern(s) of name or sequence motifs
  • locate locate subsequences/motifs

Set operations

  • rmdup remove duplicated sequences by id/name/sequence
  • common find common sequences of multiple files by id/name/sequence
  • split split sequences into files by id/seq region/size/parts
  • sample sample sequences by number or proportion
  • head print first N FASTA/Q records

Edit

  • replace replace name/sequence by regular expression
  • rename rename duplicated IDs

Ordering

  • shuffle shuffle sequences
  • sort sort sequences by id/name/sequence

Misc

  • version print version information and check for update
tool seqkit fastq fasta • 1.1k views
ADD COMMENTlink modified 8 months ago • written 13 months ago by shenwei3563.4k
0
gravatar for John
13 months ago by
John12k
Germany
John12k wrote:

I've only tested it briefly, specifically just the stat module, but it seems to give odd results.

  1. Non-aminoacid/nucleiacid characters like spaces are included in all the lengths
  2. Comments aren't parsed out
  3. It doesn't recognize ";" as being a legitimate alternative to ">" when starting an entry

The FASTA format is really awful so I do not blame you - however, if anyone should get parsing FASTA right it's tools like this :) I've had big problems trying to linearize FASTA files with comments in the past, so this is something i'd be interested in seeing fixed :)

In terms of the User Interface you've created, I think it's the best i've seen so far - for ordered and logical! Great job!

ADD COMMENTlink written 13 months ago by John12k
1

Thanks for your comments John,

  1. Since SeqKit is modulized, the stat command does not clean spaces and other characters, which can be done by seqkit seq --remove-gaps. So you can use seqkit seq -g seqs.fa | seqkit stat. I can also let it work in the way you want.
  2. I don't know what kind of comments you mean? comment after the sequence identifier in the heading line or lines starting with #.
  3. I do not know ";" could be a alternative to ">" :(. ">" is the most common format.
ADD REPLYlink written 13 months ago by shenwei3563.4k

1) Ah, so that is the linearize function. Very cool :) Particularly since your binaries have no dependencies. I think i will use this a lot.

2) Comments look like:

>GENE1
AGCTGTCTC
ACAGTCTAG ;this is a comment
CAGTACCCA

You never see comments in anything produced by a NGS machine, but using comments used to be quite popular back in my old lab when FASTA was used to store SNPs. Comments stored the reference base (or variant) using the format above, but I guess VCF put an end to all that. Still, it was frustrating as hell to parse properly, which is why toolkits like this would have been so useful back when I had to do that.

3) Yes '>' is the most common format, and honestly should be the only format, however I have seen plenty of FASTA files over the years with ";" :( I thought maybe it would be a simple thing to add to your parser module.

Two more points: I just tried using "seq --validate-seq" on some strings with strange characters, but it didn't seem to do anything? I was expecting a sort of 'PASS/FAIL' type output. Second thing, and this just a request and maybe I should have put it on the github page, but it would be nice if the stat module could count the occurences of each character seen in the sequence, for example:

A  - 342342
C  - 435342
G  - 445231
T  - 331459
\n - 4000
F  - 2

This would help people know if there are any characters that shouldn't be in the sequence that could cause problems later on, for example the 'F' above. Or maybe you just look for odd characters in --validate-seq and it prints out the line number the odd characters are on? That would be one small feature i'd really appreciate, and maybe others would too..? I dont know.

ADD REPLYlink modified 13 months ago • written 13 months ago by John12k
1
  1. sed -r 's/;.+//g' could be used to delete the comments.
  2. Supporting ";" is easy, I'll add this.
  3. seq --validate-seq. Here's the mechanism.
    1. If the sequence type not explicitly given by -t (--seq-type), SeqKit will guess it according to the first record. Supported sequence types include: DNA, RNA, Protein and Unlimit. So if you give some records of which the first record has some strange characters, SeqKit thinks all the records are "unlimit" formats. Therefore, no warning came out.
    2. Alternatively, you can set the sequence type, so it will validate the sequences according to given seq type.
  4. For the character stats. Because a FASTA/Q file may contains multiple records, we choose to show bases occurences stats for EVERY record by seqkit fx2tab.

    1. seqkit fx2tab --alphabet converts FASTA/Q formats to tabular formats, and prints alphabet of every records at last column. e.g.

      $ echo -e ">dna\nactg\n>bad dna\nactgo" | seqkit fx2tab --header-line -a
      #name   seq     qual    alphabet
      dna     actg            acgt
      bad dna actgo           acgot
      

      You can locate the record with help of my another tool, csvtk

      $ echo -e ">dna\nactg\n>bad dna\nactgo" | seqkit fx2tab -a | csvtk --tabs --no-header-row  grep --fields 4 --use-regexp --ignore-case --pattern '[^ACTGN]+'
      bad dna actgo           acgot
      
    2. seqkit fx2tab --base-content or seqkit fx2tab -B can outputs base countents not occurences.

      $ echo -e ">dna\nactg\n>dna2\nACGNTC" | seqkit fx2tab --header-line  -B a -B c -B g -B t
      #name   seq     qual    a       c       g       t
      dna     actg            25.00   25.00   25.00   25.00
      dna2    ACGNTC          16.67   33.33   16.67   16.67
      
ADD REPLYlink modified 13 months ago • written 13 months ago by shenwei3563.4k
1

Best answer I could have possibly got - great work Wei! Thank you so much :)

ADD REPLYlink written 13 months ago by John12k
1

Bug of csvtk fixed in v0.3.8.1, John you can try it now.

ADD REPLYlink written 13 months ago by shenwei3563.4k
1

Hey Wei, i’m really sorry, I re-read the FASTA documentation today and i’m just flat out wrong about “;” being an alternative to “>”. I don't know where I got it into my head that ";" was an alternative to ">", probably Wikipedia. It’s actually only the very first line of the FASTA file, containing a header, than can start with “;”. From my reading of the docs, you can’t even have a header for each sequence entry, it’s literally just 1 per file.

Really sorry for the lost time. If you send me a paypal link or similar i’ll buy you a beer.

ADD REPLYlink written 13 months ago by John12k
1

Don't worry, I haven't changed the code yet.

ADD REPLYlink written 13 months ago by shenwei3563.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour