7.3 years ago by
United States
The following list shows the time for converting 2 million 100bp sequences in fastq to fasta with different approaches (locale set to "C"):
================================================================================================
Real(s) CPU(s) Command line
------------------------------------------------------------------------------------------------
1.8 1.8 seqtk seq -A t.fq > /dev/null
3.1 3.1 sed -n '1~4s/^@/>/p;2~4p' t.fq > /dev/null
5.8 12.4 paste - - - - < t.fq | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > /dev/null
7.6 7.5 bioawk -c fastx '{print ">"$name"\n"$seq}' t.fq > /dev/null
11.9 12.9 awk 'NR%4==1||NR%4==2' t.fq | tr "@" ">" > /dev/null
22.2 22.2 seqret -sequence t.fq -out /dev/null # 6.4.0
26.5 25.4 fastq_to_fasta -Q32 -i t.fq -o /dev/null # 0.0.13
================================================================================================
In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't. If you just want to use the standard unix tools, rtliu's sed solution is preferred, both short and efficient. It should be noted that file t.fq
is put in /dev/shm
and the results are written to /dev/null
. In real applications, I/O may take more wall-clock time than CPU. In addition, frequently the sequence file is gzip'd. For seqtk, decompression takes more CPU time than parsing fastq.
Additional comments:
SES observed that seqret was faster than Irsan's command. At my hand, seqret is always slower than most. Is it because of version, locale or something else?
I have not tried the native bioperl parser. Probably it is much slower. Using bioperl on large fastq is discouraged.
I do agree 4-line fastq is much more convenient for many people. However, fastq is never "defined" to consist of 4 lines. When it was first introduced at the Sanger Institute for ~1kb capillary reads, it allows multiple lines.
Tools in many ancient unix distributions (e.g. older AIX) do not work with long lines. I was working with a couple of such unix even in 2005. I guess this is why older outputs/formats, such as fasta, blast and genbank/embl, used short lines. This is not a concern any more nowadays.
To convert multi-line fastq to 4-line (or multi-line fasta to 2-line fasta): seqtk seq -l0 multi-line.fq > 4-line.fq
•
link
written
7.3 years ago by
lh3 ♦ 32k
why so complicated ?
Why so complicated? :)
seqkit fq2fa in.fastq.gz -o out.fasta
https://bioinf.shenwei.me/seqkit/usage/#fq2fa
and 2.4 years later...
Hi Pierre, I recently learned that the '@' from a fastq file is trouble when it's left in the fasta file for alignment... https://github.com/samtools/samtools/issues/773
So maybe your oneliner could use another update almost two years later... :)
Hi Wouter,
Here's an update with '@' removal and multi-threaded unzipping thrown in as a bonus. https://zlib.net/pigz/
Common problem for new folk, no need for a perl script to do this, built in commands like that posted by Pierre to use awk are good.
Or this might be a little easier to remember / type:
http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_to_fasta_usage
can we also apply this for whole folder. so that all file can we converted at once..
Thanks
Sandeep