Hi all,
I'd like to introduce you a cross-platform and every fast FASTA/Q toolkit, Seqkit, written in Golang.
- Documents: http://bioinf.shenwei.me/seqkit (Usage, FAQ (New!), Tutorial, Benchmark and Development Notes)
- Source code: https://github.com/shenwei356/seqkit
- Latest version:
- Citation:
Introduction
Common manipulations of FASTA/Q file include converting, searching, filtering, deduplication, splitting, shuffling, and sampling. Existing tools only implement some of these manipulations, and not particularly efficiently, and some are only available for certain operating systems. Furthermore, the complicated installation process of required packages and running environments can render these programs less user friendly.
SeqKit provides executable binary files for all major operating systems, including Windows, Linux, and Mac OS X, and can be directly used without any dependencies or pre-configurations. SeqKit demonstrates competitive performance in execution time and memory usage compared to similar tools. The efficiency and usability of SeqKit enable researchers to rapidly accomplish common FASTA/Q file manipulations.
I had used SeqKit to solved some problems raised by Biostars users in simple and efficient ways. For examples:
- How to get contigs from scaffolds
- parsing fasta file
- How to append strings (from one file) to Fasta headers (in another file)
- Renaming fasta file according to a name list (blast output)
- Filter Fasta using regexp on header
Benchmarks
SeqKit uses author's lightweight and high-performance bioinformatics packages bio for FASTA/Q parsing, which has high performance close to the famous C lib klib (kseq.h).
FASTA manipulations
FASTQ manipulations
Subcommands
Sequence and subsequence
seq
transform sequences (revserse, complement, extract ID...)subseq
get subsequences by region/gtf/bed, including flanking sequencessliding
sliding sequences, circular genome supportedstat
simple statistics of FASTA filesfaidx
create FASTA index file
Format conversion
fx2tab
covert FASTA/Q to tabular format (and length/GC content/GC skew)tab2fx
covert tabular format to FASTA/Q formatfq2fa
covert FASTQ to FASTA
Searching
grep
search sequences by pattern(s) of name or sequence motifslocate
locate subsequences/motifs
Set operations
rmdup
remove duplicated sequences by id/name/sequencecommon
find common sequences of multiple files by id/name/sequencesplit
split sequences into files by id/seq region/size/partssample
sample sequences by number or proportionhead
print first N FASTA/Q records
Edit
replace
replace name/sequence by regular expressionrename
rename duplicated IDs
Ordering
shuffle
shuffle sequencessort
sort sequences by id/name/sequence
Misc
version
print version information and check for update
I just used seqkit to make a shell wrapper to take fasta length distribution that I wanted to share in order to let you know that how useful this (seqkit) could be.
Script
Output
Ofcourse, there is scope for improvement and can be modified according to requirements. For me, that was required!
Thank you my friend Wei
How about outputting sequence lengths and ploting using other tools
Or
That's even better !!
Hi,
I am trying to extract sequences from a gzipped fastq file(17GB) using sequence ID list in a text file (2.8GB) using the following:
seqkit grep --pattern-file id.txt raw-reads.fastq.gz > subset.fastq.gz
However, the resulting subset.fastq.gz file is empty. Could you please tell how to deal with such huge files? Or is the command is incorrect in the first place?
Can you post the output of
head -6 id.txt
?head -6 id.txt
@D00723:299:CCRTLANXX:1:1101:1281:1987 2:N:0:1
@D00723:299:CCRTLANXX:1:1101:1301:1993 2:N:0:1
@D00723:299:CCRTLANXX:1:1101:1660:1986 2:N:0:1
@D00723:299:CCRTLANXX:1:1101:1769:1980 2:N:0:1
@D00723:299:CCRTLANXX:1:1101:1755:1982 2:N:0:1
@D00723:299:CCRTLANXX:1:1101:2165:1989 2:N:0:1
you need remove the leading symbol
@
by