Fastq Convert To Fasta
11
13
Entering edit mode
8.0 years ago
biolab ★ 1.3k

Dear All, Am I right to use the following script to transform a fastq file (named test.fastq) to a fasta file? THANKS a lot!

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $in=Bio::SeqIO->new(-file=>"test.fastq",-format=>'fastq'); my$out=Bio::SeqIO->new(-file=>">test.fasta",-format=>'fasta');
while(my $seq=$in->next_seq()){
$out->write_seq($seq);
}

perl • 100k views
11
Entering edit mode

why so complicated ?

gunzip -c file.fq.gz | awk '{if(NR%4==1) {printf(">%s\n",substr(0,2));} else if(NR%4==2) print;}' > file.fa  ADD REPLY 22 Entering edit mode Why so complicated? :) seqtk seq -a in.fastq.gz > out.fasta  ADD REPLY 0 Entering edit mode seqkit fq2fa in.fastq.gz -o out.fasta https://bioinf.shenwei.me/seqkit/usage/#fq2fa ADD REPLY 7 Entering edit mode and 2.4 years later... gunzip -c file.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^/>/' | tr "\t" "\n" > file.fa  ADD REPLY 2 Entering edit mode Hi Pierre, I recently learned that the '@' from a fastq file is trouble when it's left in the fasta file for alignment... https://github.com/samtools/samtools/issues/773 So maybe your oneliner could use another update almost two years later... :) ADD REPLY 2 Entering edit mode Hi Wouter, unpigz -cp16 file.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > file.fa  Here's an update with '@' removal and multi-threaded unzipping thrown in as a bonus. https://zlib.net/pigz/ ADD REPLY 0 Entering edit mode Common problem for new folk, no need for a perl script to do this, built in commands like that posted by Pierre to use awk are good. ADD REPLY 0 Entering edit mode Or this might be a little easier to remember / type: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_to_fasta_usage ADD REPLY 0 Entering edit mode can we also apply this for whole folder. so that all file can we converted at once.. Thanks Sandeep ADD REPLY 27 Entering edit mode 8.0 years ago lh3 32k The following list shows the time for converting 2 million 100bp sequences in fastq to fasta with different approaches (locale set to "C"): ================================================================================================ Real(s) CPU(s) Command line ------------------------------------------------------------------------------------------------ 1.8 1.8 seqtk seq -A t.fq > /dev/null 3.1 3.1 sed -n '1~4s/^@/>/p;2~4p' t.fq > /dev/null 5.8 12.4 paste - - - - < t.fq | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > /dev/null 7.6 7.5 bioawk -c fastx '{print ">"name"\n"$seq}' t.fq > /dev/null 11.9 12.9 awk 'NR%4==1||NR%4==2' t.fq | tr "@" ">" > /dev/null 22.2 22.2 seqret -sequence t.fq -out /dev/null # 6.4.0 26.5 25.4 fastq_to_fasta -Q32 -i t.fq -o /dev/null # 0.0.13 ================================================================================================  In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't. If you just want to use the standard unix tools, rtliu's sed solution is preferred, both short and efficient. It should be noted that file t.fq is put in /dev/shm and the results are written to /dev/null. In real applications, I/O may take more wall-clock time than CPU. In addition, frequently the sequence file is gzip'd. For seqtk, decompression takes more CPU time than parsing fastq. Additional comments: • SES observed that seqret was faster than Irsan's command. At my hand, seqret is always slower than most. Is it because of version, locale or something else? • I have not tried the native bioperl parser. Probably it is much slower. Using bioperl on large fastq is discouraged. • I do agree 4-line fastq is much more convenient for many people. However, fastq is never "defined" to consist of 4 lines. When it was first introduced at the Sanger Institute for ~1kb capillary reads, it allows multiple lines. • Tools in many ancient unix distributions (e.g. older AIX) do not work with long lines. I was working with a couple of such unix even in 2005. I guess this is why older outputs/formats, such as fasta, blast and genbank/embl, used short lines. This is not a concern any more nowadays. • To convert multi-line fastq to 4-line (or multi-line fasta to 2-line fasta): seqtk seq -l0 multi-line.fq > 4-line.fq ADD COMMENT 2 Entering edit mode Tranks for taking the time and checking the different solutions! ADD REPLY 2 Entering edit mode Here are my observations with a file of 2 million 100 bp sequences (EMBOSS 6.5.7.0; Fedora 17 desktop): $ time cat t.fq | paste - - - - | cut -f1-2 | sed 's/^@/>/g' | tr '\t' '\n' > /dev/null
real    0m16.919s
user    0m19.930s
sys    0m1.439s

$time seqret -sequence t.fq -outseq stdout > /dev/null real 0m9.435s user 0m9.329s sys 0m0.091s  Here is the same analysis, but using EMBOSS 6.4.0 and a different machine (RHEL 5.9 server): $ time cat t.fq | paste - - - - | cut -f1-2 | sed 's/^@/>/g' | tr '\t' '\n' > /dev/null
real    0m6.902s
user    0m10.273s
sys    0m10.506s

$time /usr/local/emboss/6.4.0/bin/seqret -sequence t.fq -outseq stdout > /dev/null real 0m22.571s user 0m20.679s sys 0m1.888s  This would suggest maybe there are some differences between versions, which I had not considered until I saw your review. ADD REPLY 0 Entering edit mode Thanks for the evaluation. I have tried the latest emboss on the same old CentOS server. Its speed is pretty much the same. I guess these two command lines are sensitive to the system configuration. As to the "paste" command line, could you confirm that the locale is set to "C"? ADD REPLY 0 Entering edit mode I checked and my locale on both machines is set to UTF-8, which I read is the default for most linux systems. I also read somewhere that this affects the sort order, but I'm uncertain about what that means exactly. EDIT: Here's what the manpage of sort says: "** WARNING ** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values." I guess that means sort could behave differently on different systems. This is interesting, I'll keep this information in mind. ADD REPLY 0 Entering edit mode You may set "export LC_ALL=C" and then see if the speed changes. Sometimes, locale has a dramatic hit to performance on text processing. For example, regex matching on UTF-8 is much slower than on ASCII. ADD REPLY 22 Entering edit mode 8.0 years ago Irsan ★ 7.4k If you want to convert fastq to fasta try: cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n'  ADD COMMENT 1 Entering edit mode 4 years later. This is amazing! ADD REPLY 0 Entering edit mode I like this one, nice. ADD REPLY 0 Entering edit mode Inspired by najim ameziane :-) ADD REPLY 0 Entering edit mode shame on me :-) Your solution is far more elegant :-) ADD REPLY 0 Entering edit mode @irsan A small typo in your one-liner. It should be test.fastq ADD REPLY 1 Entering edit mode OK, I have changed fasta to fastq ADD REPLY 9 Entering edit mode 8.0 years ago rtliu ★ 2.1k This is the fastest bash one-liner to convert fastq to fasta, assuming 4 lines per FASTQ record, more on https://github.com/stephenturner/oneliners sed -n '1~4s/^@/>/p;2~4p' test.fastq > test.fasta  ADD COMMENT 3 Entering edit mode I can't beat this. However, this is extremely close (within 3% or so): perl -ne 'y/@/>/;print($_.<>)&&<>&&<>' test.fastq > test.fasta


It is also possible to write a terser Perl one-liner but a tad slower, or even a faster version, but longer.

9
Entering edit mode
8.0 years ago
SES 8.5k

Don't use awk/shell to parse fastq. None of those solutions work for multi-line records (except seqtk). You can save yourself a lot of trouble by using a toolkit like EMBOSS for this:

seqret -sequence reads.fastq -outseq reads.fasta


Since you were trying for a BioPerl solution, here is the same thing using the bioperl-run package:

If you are learning your way around NGS data, I recommend using the established tookits. BioPerl is helpful, but not for parsing fastq data because it's just too slow.

EDIT: fix link that broke due to conversion to the new website.

0
Entering edit mode

"Don't use awk/shell to parse fastq. None of those solutions work for multi-line records " I do agree. But (1) no dependency (2)most of the time, the fastq have 4 lines.

0
Entering edit mode

I agree on the first point, BioPerl and EMBOSS are both really large. But, if someone is learning to parse these files, I don't think handing them a long shell pipeline is the best way to learn. Learning these tools allows you spend less time writing and debugging basic parsing code so you can focus on higher level problems. (And, BioPerl was a part of the question).

0
Entering edit mode

(1) By definition one fastq entry consists of 4 lines and (2) for simple tasks like converting fastq files to fasta files I absolutely recommend using command line pipes for learning and understanding the formats. Using and blindly trusting huge packages is dangerous, especially when you are new to the field. BUT, of course, one needs to check check check and double-check the results. But that's bioinformatics!

0
Entering edit mode

1) In my experience, we can't assume this about formats. The first line of the FASTQ definition on wikipedia says, "A FASTQ file normally uses four lines per sequence." Normally, but not always. The convention for FASTA is for the sequence part to be less than 80 columns wide, and that's what many applications (like EMBOSS) will generate with FASTQ, as well. I wish it was always 4 lines, that would be so much easier. 2) I'd be interested in what others think, but I would say learn a scripting language, not shell, for parsing. It's almost 2014 and modern scripting languages are way more expressive and nearly as fast as compiled languages for many tasks. If you do everything with these pipes you will be debugging the next time you use a command, and you will look at it and wonder what in the world it does. I used to do that but wouldn't advise it. Also, that seqret command I posted is 2X faster than Irsan's answer, and much, much shorter. It's not as fast as Pierre's answer, his is 10X faster than Irsan's, because he has super-shell skills.

1
Entering edit mode

In your experience, you can't assume this about formats. :) But how did you learn that? I'm 100% with you... in bioinformatics the file formats are not really defined. But you will never learn that by using toolkits. If you want to learn that by shell pipes, or a scripting language doesn't matter in my view. I use shell scripting to fastly check things, or for fastq to fasta conversions and scripting languages for mostly all other tasks... Depends on what you need. 10x faster is nice, if you need a script more often. Just to check something, I don't care about speed.

0
Entering edit mode

I learned about seeing multi-line fastq from getting data from collaborators. A lot of biologists rely on tools like Geneious, CLC, and other desktop applications. These often produce files that break all parsers by inserting extra information or characters that only that application can read, or by doing things like breaking the lines of all sequence data at a certain length. I found this because I kept getting errors from a sequence reading class I wrote that was not recognizing the format as fastq. This is very frustrating, but I decided to keep it in mind and not assume, instead of wasting more time the future. I don't think speed is most important either, I'd rather go with methods that are easy to remember and use. For doing simple things, it does not matter what your approach is, but I don't see the point of making a procedure slower and more complicated.

0
Entering edit mode

Irsan's solution is neither slow, nor complicated. Nevertheless, I think we are just talking about different things. ;) Never mind...

0
Entering edit mode

That solution is slower (2-10X slower for ~10 million sequences) than all the other answers. That doesn't really matter unless you have like 100 files to convert, then it would be silly to use something that you know will take hours longer. By complicated, I mean the command is longer (than seqtk or seqret) and requires five programs as opposed to one. I'm not disagreeing about using shell for changing formats, those are crucial skills to have. Though, I just have a hard time saying that shell is the better approach in this case, given those two points. Also, the original question was about how to get a bioperl script working, so I don't think giving a shell command is the most specific answer. I think the question has been answered about 10 times now, so we are getting off topic talking about what we would do, but hopefully that helps, too. :) Cheers.

3
Entering edit mode
8.0 years ago

Here is a solution for your question and some other NGS snippets you might find useful: http://www.ecseq.com/NGSsnippets.html

3
Entering edit mode
6.6 years ago
Paul ★ 1.4k

gunzip -c in.fastq.gz | paste - - - - | awk '{print ">"$1 ; print$3;}' > out.fasta

1
Entering edit mode

you should always put a tab in awk, otherwise, awk might use space as delimiter.

time gunzip -c file.fastq.gz | paste - - - - | awk -F '\t' '{print ">"$1 ; print$2;}' >file.fa

0
Entering edit mode

Thanks a lot!

0
Entering edit mode

You are welcome.. that is probably most easy way :-)

2
Entering edit mode
6.6 years ago
wpwupingwp ▴ 110
from Bio import SeqIO
SeqIO.convert('input','fastq','output','fasta')

0
Entering edit mode

Thanks a lot!

0
Entering edit mode

I used this method too, but some long reads in fastq file were truncated in to 2 lines in fasta file, like this:

in fastq:

GCGCCGCGACCGGCTCCGGTTCGGCTTGGAAAGTTCGGAAGGGCACCGGGTTAAATCCCAGTCCCCTTGTTCTCTC


in fasta:

GCGCCGCGACCGGCTCCGGTTCGGCTTGGAAAGTTCGGAAGGGCACCGGGTTAAATCCCA
GTCCCCTTGTTCTCTC


is the the case that happens commonly?

thanks.

1
Entering edit mode

It's normal for fasta files to be wrapped, typically at 70bp per line.

0
Entering edit mode
5.8 years ago
Echo ▴ 70

https://github.com/swuecho/fastq_2_fasta

c,c++,go,perl, python version

0
Entering edit mode
2.6 years ago

paste - - - - < file.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > file.fa

1
Entering edit mode

Very late indeed, as there already is an almost identical answer, posted years ago (here is the direct link to the answer). The difference being your answer has sed 's/^@/>/', while the posted has sed 's/^@/>/g'.

0
Entering edit mode
24 months ago
brianpenghe ▴ 50

You can use my script

0
Entering edit mode

What does your script do that other, more well-tested tools don't? As in, why should someone use this script instead of, say, a simple awk or sed or bioawk (Or anything from lh3's benchmarked answer)? The script won't work with compressed files, there is no performance optimization - to me that seems like a list of downsides/risks with no gain.

0
Entering edit mode

The upsides: 1. it can do reads trimming at the same time, which is frequently required in small RNA fields 2. a functional command is short 3. it's in python

1
Entering edit mode

How well tested and benchmarked are these functionalities? How is the tool being in python an advantage when tools written in C/C++ are a lot more performant in File IO?

EDIT: I apologize for coming off a little too confrontational. I think it would help to test and benchmark your tool to showcase how fast and/or accurate it is.

0
Entering edit mode

Thanks for contributing! It's probably helpful to potential users if you can provide some more documentation, usage, installation, dependencies, advantages,... in your post here.