Tool: SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang
9
gravatar for shenwei356
2.0 years ago by
shenwei3564.1k
China
shenwei3564.1k wrote:

Hi all,

I'd like to introduce you a cross-platform and every fast FASTA/Q toolkit, Seqkit, written in Golang.

Introduction

Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly.

SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.

I had used SeqKit to solved some problems raised by Biostars users in simple and efficient ways. For examples:

Benchmarks

SeqKit uses author's lightweight and high-performance bioinformatics packages bio for FASTA/Q parsing, which has high performance close to the famous C lib klib (kseq.h).

FASTA manipulations benchmark.5tests.tsv.png

FASTQ manipulations benchmark.5tests.tsv.C.png

Subcommands

Sequence and subsequence

  • seq transform sequences (revserse, complement, extract ID...)
  • subseq get subsequences by region/gtf/bed, including flanking sequences
  • sliding sliding sequences, circular genome supported
  • stat simple statistics of FASTA files
  • faidx create FASTA index file

Format conversion

  • fx2tab covert FASTA/Q to tabular format (and length/GC content/GC skew)
  • tab2fx covert tabular format to FASTA/Q format
  • fq2fa covert FASTQ to FASTA

Searching

  • grep search sequences by pattern(s) of name or sequence motifs
  • locate locate subsequences/motifs

Set operations

  • rmdup remove duplicated sequences by id/name/sequence
  • common find common sequences of multiple files by id/name/sequence
  • split split sequences into files by id/seq region/size/parts
  • sample sample sequences by number or proportion
  • head print first N FASTA/Q records

Edit

  • replace replace name/sequence by regular expression
  • rename rename duplicated IDs

Ordering

  • shuffle shuffle sequences
  • sort sort sequences by id/name/sequence

Misc

  • version print version information and check for update
tool seqkit fastq fasta • 2.4k views
ADD COMMENTlink modified 9 days ago by bio_d0 • written 2.0 years ago by shenwei3564.1k

I just used seqkit to make a shell wrapper to take fasta length distribution that I wanted to share in order to let you know that how useful this (seqkit) could be.

Script

#shell wrapper around seqkit to calculate length distribution of fasta sequences

# Usage 
# [cmd_prompt]$ source length_dist.sh <fasta_file.fa>

# Requires
# seqkit: https://bioinf.shenwei.me/seqkit/

echo -ne "Length < 200:\t"
cat $1 |  seqkit seq -M 199 | seqkit stat -T | grep -v file | cut -f 4

echo -ne "Length >= 200 && <= 300:\t"
cat $1 |  seqkit seq -m 200 -M 300 | seqkit stat -T | grep -v file | cut -f 4

for i in $(seq 300 100 900); do 
  echo -ne "Length > $i && <= $(expr $i + 100):\t" 
  cat $1 |  seqkit seq -m $(expr $i + 1) -M $(expr $i + 100) | seqkit stat -T | grep -v file | cut -f 4
done

echo -ne "Length > 1000 && <=5000:\t"
cat $1 |  seqkit seq -m 1001 -M 5000 | seqkit stat -T | grep -v file | cut -f 4

echo -ne "Length >5000:\t" 
cat $1 |  seqkit seq -m 5000 | seqkit stat -T | grep -v file | cut -f 4

Output

[user]$ sh  length_range_stats.sh file.fasta 
Length < 200:   457
Length >= 200 && <= 300:    2458
Length > 300 && <= 400: 2746
Length > 400 && <= 500: 2815
Length > 500 && <= 600: 2537
Length > 600 && <= 700: 2366
Length > 700 && <= 800: 2123
Length > 800 && <= 900: 1953
Length > 900 && <= 1000:    1905
Length > 1000 && <=5000:    17672
Length >5000:   215

Ofcourse, there is scope for improvement and can be modified according to requirements. For me, that was required!

Thank you my friend Wei

ADD REPLYlink modified 4 months ago • written 4 months ago by Vijay Lakhujani3.1k
1

How about outputting sequence lengths and ploting using other tools

$ seqkit fx2tab hairpin.fa -l | csvtk cut -t -f 4 | csvtk plot hist -H > hist.png

image

Or

# https://github.com/derwiki/histogram
$ seqkit fx2tab hairpin.fa -l | csvtk cut -t -f 4 | python histogram.py                      
['histogram.py']
28645 items, 12 keys
0-99           17670================================================================================
100-199        9717 ===========================================
200-299        907  ====
300-399        210  
400-499        102  
500-599        20   
600-699        8    
700-799        3    
800-899        2    
900-999        3    
1000-1099      2    
1100-1199      1
ADD REPLYlink modified 4 months ago • written 4 months ago by shenwei3564.1k

That's even better !!

ADD REPLYlink written 4 months ago by Vijay Lakhujani3.1k

Hi,

I am trying to extract sequences from a gzipped fastq file(17GB) using sequence ID list in a text file (2.8GB) using the following:

seqkit grep --pattern-file id.txt raw-reads.fastq.gz > subset.fastq.gz

However, the resulting subset.fastq.gz file is empty. Could you please tell how to deal with such huge files? Or is the command is incorrect in the first place?

ADD REPLYlink written 9 days ago by bio_d0

Can you post the output of head -6 id.txt?

ADD REPLYlink written 9 days ago by genomax57k

head -6 id.txt

@D00723:299:CCRTLANXX:1:1101:1281:1987 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1301:1993 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1660:1986 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1769:1980 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:1755:1982 2:N:0:1

@D00723:299:CCRTLANXX:1:1101:2165:1989 2:N:0:1

ADD REPLYlink modified 8 days ago • written 8 days ago by bio_d0

you need remove the leading symbol @ by

sed 's/^@//' id.txt > id2.txt
ADD REPLYlink written 8 days ago by shenwei3564.1k
0
gravatar for John
2.0 years ago by
John12k
Germany
John12k wrote:

I've only tested it briefly, specifically just the stat module, but it seems to give odd results.

  1. Non-aminoacid/nucleiacid characters like spaces are included in all the lengths
  2. Comments aren't parsed out
  3. It doesn't recognize ";" as being a legitimate alternative to ">" when starting an entry

The FASTA format is really awful so I do not blame you - however, if anyone should get parsing FASTA right it's tools like this :) I've had big problems trying to linearize FASTA files with comments in the past, so this is something i'd be interested in seeing fixed :)

In terms of the User Interface you've created, I think it's the best i've seen so far - for ordered and logical! Great job!

ADD COMMENTlink written 2.0 years ago by John12k
1

Thanks for your comments John,

  1. Since SeqKit is modulized, the stat command does not clean spaces and other characters, which can be done by seqkit seq --remove-gaps. So you can use seqkit seq -g seqs.fa | seqkit stat. I can also let it work in the way you want.
  2. I don't know what kind of comments you mean? comment after the sequence identifier in the heading line or lines starting with #.
  3. I do not know ";" could be a alternative to ">" :(. ">" is the most common format.
ADD REPLYlink written 2.0 years ago by shenwei3564.1k

1) Ah, so that is the linearize function. Very cool :) Particularly since your binaries have no dependencies. I think i will use this a lot.

2) Comments look like:

>GENE1
AGCTGTCTC
ACAGTCTAG ;this is a comment
CAGTACCCA

You never see comments in anything produced by a NGS machine, but using comments used to be quite popular back in my old lab when FASTA was used to store SNPs. Comments stored the reference base (or variant) using the format above, but I guess VCF put an end to all that. Still, it was frustrating as hell to parse properly, which is why toolkits like this would have been so useful back when I had to do that.

3) Yes '>' is the most common format, and honestly should be the only format, however I have seen plenty of FASTA files over the years with ";" :( I thought maybe it would be a simple thing to add to your parser module.

Two more points: I just tried using "seq --validate-seq" on some strings with strange characters, but it didn't seem to do anything? I was expecting a sort of 'PASS/FAIL' type output. Second thing, and this just a request and maybe I should have put it on the github page, but it would be nice if the stat module could count the occurences of each character seen in the sequence, for example:

A  - 342342
C  - 435342
G  - 445231
T  - 331459
\n - 4000
F  - 2

This would help people know if there are any characters that shouldn't be in the sequence that could cause problems later on, for example the 'F' above. Or maybe you just look for odd characters in --validate-seq and it prints out the line number the odd characters are on? That would be one small feature i'd really appreciate, and maybe others would too..? I dont know.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by John12k
1
  1. sed -r 's/;.+//g' could be used to delete the comments.
  2. Supporting ";" is easy, I'll add this.
  3. seq --validate-seq. Here's the mechanism.
    1. If the sequence type not explicitly given by -t (--seq-type), SeqKit will guess it according to the first record. Supported sequence types include: DNA, RNA, Protein and Unlimit. So if you give some records of which the first record has some strange characters, SeqKit thinks all the records are "unlimit" formats. Therefore, no warning came out.
    2. Alternatively, you can set the sequence type, so it will validate the sequences according to given seq type.
  4. For the character stats. Because a FASTA/Q file may contains multiple records, we choose to show bases occurences stats for EVERY record by seqkit fx2tab.

    1. seqkit fx2tab --alphabet converts FASTA/Q formats to tabular formats, and prints alphabet of every records at last column. e.g.

      $ echo -e ">dna\nactg\n>bad dna\nactgo" | seqkit fx2tab --header-line -a
      #name   seq     qual    alphabet
      dna     actg            acgt
      bad dna actgo           acgot
      

      You can locate the record with help of my another tool, csvtk

      $ echo -e ">dna\nactg\n>bad dna\nactgo" | seqkit fx2tab -a | csvtk --tabs --no-header-row  grep --fields 4 --use-regexp --ignore-case --pattern '[^ACTGN]+'
      bad dna actgo           acgot
      
    2. seqkit fx2tab --base-content or seqkit fx2tab -B can outputs base countents not occurences.

      $ echo -e ">dna\nactg\n>dna2\nACGNTC" | seqkit fx2tab --header-line  -B a -B c -B g -B t
      #name   seq     qual    a       c       g       t
      dna     actg            25.00   25.00   25.00   25.00
      dna2    ACGNTC          16.67   33.33   16.67   16.67
      
ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by shenwei3564.1k
1

Best answer I could have possibly got - great work Wei! Thank you so much :)

ADD REPLYlink written 2.0 years ago by John12k
1

Bug of csvtk fixed in v0.3.8.1, John you can try it now.

ADD REPLYlink written 2.0 years ago by shenwei3564.1k
2

Hey Wei, i’m really sorry, I re-read the FASTA documentation today and i’m just flat out wrong about “;” being an alternative to “>”. I don't know where I got it into my head that ";" was an alternative to ">", probably Wikipedia. It’s actually only the very first line of the FASTA file, containing a header, than can start with “;”. From my reading of the docs, you can’t even have a header for each sequence entry, it’s literally just 1 per file.

Really sorry for the lost time. If you send me a paypal link or similar i’ll buy you a beer.

ADD REPLYlink written 24 months ago by John12k
1

Don't worry, I haven't changed the code yet.

ADD REPLYlink written 24 months ago by shenwei3564.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1806 users visited in the last hour