Fastq Convert To Fasta
10
16
Entering edit mode
11.0 years ago
biolab ★ 1.4k

Dear All, Am I right to use the following script to transform a fastq file (named test.fastq) to a fasta file? THANKS a lot!

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $in=Bio::SeqIO->new(-file=>"test.fastq",-format=>'fastq');
my $out=Bio::SeqIO->new(-file=>">test.fasta",-format=>'fasta');
while(my $seq=$in->next_seq()){
      $out->write_seq($seq);
}
perl • 135k views
ADD COMMENT
11
Entering edit mode

why so complicated ?

gunzip -c file.fq.gz | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
ADD REPLY
25
Entering edit mode

Why so complicated? :)

seqtk seq -a in.fastq.gz > out.fasta
ADD REPLY
1
Entering edit mode

seqkit fq2fa in.fastq.gz -o out.fasta

https://bioinf.shenwei.me/seqkit/usage/#fq2fa

ADD REPLY
7
Entering edit mode

and 2.4 years later...

gunzip -c file.fq.gz | paste - - - -  | cut -f 1,2 | sed 's/^/>/'  | tr "\t" "\n" > file.fa
ADD REPLY
2
Entering edit mode

Hi Pierre, I recently learned that the '@' from a fastq file is trouble when it's left in the fasta file for alignment... https://github.com/samtools/samtools/issues/773

So maybe your oneliner could use another update almost two years later... :)

ADD REPLY
2
Entering edit mode

Hi Wouter,

unpigz -cp16 file.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > file.fa

Here's an update with '@' removal and multi-threaded unzipping thrown in as a bonus. https://zlib.net/pigz/

ADD REPLY
0
Entering edit mode

Common problem for new folk, no need for a perl script to do this, built in commands like that posted by Pierre to use awk are good.

ADD REPLY
0
Entering edit mode

Or this might be a little easier to remember / type:

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_to_fasta_usage

ADD REPLY
0
Entering edit mode

Can we also apply this for whole folder. so that all file can we converted at once..

Thanks
Sandeep

ADD REPLY
37
Entering edit mode
11.0 years ago
lh3 33k

The following list shows the time for converting 2 million 100bp sequences in fastq to fasta with different approaches (locale set to "C"):

================================================================================================
Real(s)    CPU(s)   Command line
------------------------------------------------------------------------------------------------
  1.8        1.8    seqtk seq -A t.fq > /dev/null
  3.1        3.1    sed -n '1~4s/^@/>/p;2~4p' t.fq > /dev/null
  5.8       12.4    paste - - - - < t.fq | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > /dev/null
  7.6        7.5    bioawk -c fastx '{print ">"$name"\n"$seq}' t.fq > /dev/null
 11.9       12.9    awk 'NR%4==1||NR%4==2' t.fq | tr "@" ">" > /dev/null
 22.2       22.2    seqret -sequence t.fq -out /dev/null        # 6.4.0
 26.5       25.4    fastq_to_fasta -Q32 -i t.fq -o /dev/null    # 0.0.13
================================================================================================

In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't. If you just want to use the standard unix tools, rtliu's sed solution is preferred, both short and efficient. It should be noted that file t.fq is put in /dev/shm and the results are written to /dev/null. In real applications, I/O may take more wall-clock time than CPU. In addition, frequently the sequence file is gzip'd. For seqtk, decompression takes more CPU time than parsing fastq.

Additional comments:

  • SES observed that seqret was faster than Irsan's command. At my hand, seqret is always slower than most. Is it because of version, locale or something else?

  • I have not tried the native bioperl parser. Probably it is much slower. Using bioperl on large fastq is discouraged.

  • I do agree 4-line fastq is much more convenient for many people. However, fastq is never "defined" to consist of 4 lines. When it was first introduced at the Sanger Institute for ~1kb capillary reads, it allows multiple lines.

  • Tools in many ancient unix distributions (e.g. older AIX) do not work with long lines. I was working with a couple of such unix even in 2005. I guess this is why older outputs/formats, such as fasta, blast and genbank/embl, used short lines. This is not a concern any more nowadays.

  • To convert multi-line fastq to 4-line (or multi-line fasta to 2-line fasta): seqtk seq -l0 multi-line.fq > 4-line.fq

ADD COMMENT
2
Entering edit mode

Tranks for taking the time and checking the different solutions!

ADD REPLY
2
Entering edit mode

Here are my observations with a file of 2 million 100 bp sequences (EMBOSS 6.5.7.0; Fedora 17 desktop):

$ time cat t.fq | paste - - - - | cut -f1-2 | sed 's/^@/>/g' | tr '\t' '\n' > /dev/null
real    0m16.919s
user    0m19.930s
sys    0m1.439s

$ time seqret -sequence t.fq -outseq stdout > /dev/null
real    0m9.435s
user    0m9.329s
sys    0m0.091s

Here is the same analysis, but using EMBOSS 6.4.0 and a different machine (RHEL 5.9 server):

$ time cat t.fq | paste - - - - | cut -f1-2 | sed 's/^@/>/g' | tr '\t' '\n' > /dev/null
real    0m6.902s
user    0m10.273s
sys    0m10.506s

$ time /usr/local/emboss/6.4.0/bin/seqret -sequence t.fq -outseq stdout > /dev/null
real    0m22.571s
user    0m20.679s
sys    0m1.888s

This would suggest maybe there are some differences between versions, which I had not considered until I saw your review.

ADD REPLY
0
Entering edit mode

Thanks for the evaluation. I have tried the latest emboss on the same old CentOS server. Its speed is pretty much the same. I guess these two command lines are sensitive to the system configuration. As to the "paste" command line, could you confirm that the locale is set to "C"?

ADD REPLY
0
Entering edit mode

I checked and my locale on both machines is set to UTF-8, which I read is the default for most linux systems. I also read somewhere that this affects the sort order, but I'm uncertain about what that means exactly. EDIT: Here's what the manpage of sort says: "** WARNING ** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values." I guess that means sort could behave differently on different systems. This is interesting, I'll keep this information in mind.

ADD REPLY
0
Entering edit mode

You may set "export LC_ALL=C" and then see if the speed changes. Sometimes, locale has a dramatic hit to performance on text processing. For example, regex matching on UTF-8 is much slower than on ASCII.

ADD REPLY
24
Entering edit mode
11.0 years ago
Irsan ★ 7.8k

If you want to convert fastq to fasta try:

cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n'
ADD COMMENT
1
Entering edit mode

4 years later. This is amazing!

ADD REPLY
0
Entering edit mode

I like this one, nice.

ADD REPLY
0
Entering edit mode

Inspired by najim ameziane :-)

ADD REPLY
0
Entering edit mode

shame on me :-) Your solution is far more elegant :-)

ADD REPLY
0
Entering edit mode

@irsan A small typo in your one-liner. It should be test.fastq

ADD REPLY
1
Entering edit mode

OK, I have changed fasta to fastq

ADD REPLY
11
Entering edit mode
11.0 years ago
rtliu ★ 2.2k

This is the fastest bash one-liner to convert fastq to fasta, assuming 4 lines per FASTQ record, more on https://github.com/stephenturner/oneliners

sed -n '1~4s/^@/>/p;2~4p' test.fastq > test.fasta
ADD COMMENT
3
Entering edit mode

I can't beat this.

However, this is extremely close (within 3% or so):

perl -ne 'y/@/>/;print($_.<>)&&<>&&<>' test.fastq > test.fasta

It is also possible to write a terser Perl one-liner but a tad slower, or even a faster version, but longer.

ADD REPLY
10
Entering edit mode
11.0 years ago
SES 8.6k

Don't use awk/shell to parse fastq. None of those solutions work for multi-line records (except seqtk). You can save yourself a lot of trouble by using a toolkit like EMBOSS for this:

seqret -sequence reads.fastq -outseq reads.fasta

Since you were trying for a BioPerl solution, here is the same thing using the bioperl-run package:

If you are learning your way around NGS data, I recommend using the established tookits. BioPerl is helpful, but not for parsing fastq data because it's just too slow.

EDIT: fix link that broke due to conversion to the new website.

ADD COMMENT
0
Entering edit mode

"Don't use awk/shell to parse fastq. None of those solutions work for multi-line records " I do agree. But (1) no dependency (2)most of the time, the fastq have 4 lines.

ADD REPLY
0
Entering edit mode

I agree on the first point, BioPerl and EMBOSS are both really large. But, if someone is learning to parse these files, I don't think handing them a long shell pipeline is the best way to learn. Learning these tools allows you spend less time writing and debugging basic parsing code so you can focus on higher level problems. (And, BioPerl was a part of the question).

ADD REPLY
0
Entering edit mode

(1) By definition one fastq entry consists of 4 lines and (2) for simple tasks like converting fastq files to fasta files I absolutely recommend using command line pipes for learning and understanding the formats. Using and blindly trusting huge packages is dangerous, especially when you are new to the field. BUT, of course, one needs to check check check and double-check the results. But that's bioinformatics!

ADD REPLY
0
Entering edit mode

1) In my experience, we can't assume this about formats. The first line of the FASTQ definition on wikipedia says, "A FASTQ file normally uses four lines per sequence." Normally, but not always. The convention for FASTA is for the sequence part to be less than 80 columns wide, and that's what many applications (like EMBOSS) will generate with FASTQ, as well. I wish it was always 4 lines, that would be so much easier. 2) I'd be interested in what others think, but I would say learn a scripting language, not shell, for parsing. It's almost 2014 and modern scripting languages are way more expressive and nearly as fast as compiled languages for many tasks. If you do everything with these pipes you will be debugging the next time you use a command, and you will look at it and wonder what in the world it does. I used to do that but wouldn't advise it. Also, that seqret command I posted is 2X faster than Irsan's answer, and much, much shorter. It's not as fast as Pierre's answer, his is 10X faster than Irsan's, because he has super-shell skills.

ADD REPLY
1
Entering edit mode

In your experience, you can't assume this about formats. :) But how did you learn that? I'm 100% with you... in bioinformatics the file formats are not really defined. But you will never learn that by using toolkits. If you want to learn that by shell pipes, or a scripting language doesn't matter in my view. I use shell scripting to fastly check things, or for fastq to fasta conversions and scripting languages for mostly all other tasks... Depends on what you need. 10x faster is nice, if you need a script more often. Just to check something, I don't care about speed.

ADD REPLY
0
Entering edit mode

I learned about seeing multi-line fastq from getting data from collaborators. A lot of biologists rely on tools like Geneious, CLC, and other desktop applications. These often produce files that break all parsers by inserting extra information or characters that only that application can read, or by doing things like breaking the lines of all sequence data at a certain length. I found this because I kept getting errors from a sequence reading class I wrote that was not recognizing the format as fastq. This is very frustrating, but I decided to keep it in mind and not assume, instead of wasting more time the future. I don't think speed is most important either, I'd rather go with methods that are easy to remember and use. For doing simple things, it does not matter what your approach is, but I don't see the point of making a procedure slower and more complicated.

ADD REPLY
0
Entering edit mode

Irsan's solution is neither slow, nor complicated. Nevertheless, I think we are just talking about different things. ;) Never mind...

ADD REPLY
0
Entering edit mode

That solution is slower (2-10X slower for ~10 million sequences) than all the other answers. That doesn't really matter unless you have like 100 files to convert, then it would be silly to use something that you know will take hours longer. By complicated, I mean the command is longer (than seqtk or seqret) and requires five programs as opposed to one. I'm not disagreeing about using shell for changing formats, those are crucial skills to have. Though, I just have a hard time saying that shell is the better approach in this case, given those two points. Also, the original question was about how to get a bioperl script working, so I don't think giving a shell command is the most specific answer. I think the question has been answered about 10 times now, so we are getting off topic talking about what we would do, but hopefully that helps, too. :) Cheers.

ADD REPLY
3
Entering edit mode
11.0 years ago

Here is a solution for your question and some other NGS snippets you might find useful: http://www.ecseq.com/NGSsnippets.html

ADD COMMENT
3
Entering edit mode
9.6 years ago
Paul ★ 1.5k

what about

gunzip -c in.fastq.gz | paste - - - - | awk '{print ">"$1 ; print $3;}' > out.fasta
ADD COMMENT
1
Entering edit mode

you should always put a tab in awk, otherwise, awk might use space as delimiter.

time gunzip -c file.fastq.gz | paste - - - - | awk -F '\t' '{print ">"$1 ; print $2;}' >file.fa
ADD REPLY
0
Entering edit mode

Thanks a lot!

ADD REPLY
0
Entering edit mode

You are welcome.. that is probably most easy way :-)

ADD REPLY
3
Entering edit mode
9.6 years ago
wpwupingwp ▴ 120
from Bio import SeqIO
SeqIO.convert('input','fastq','output','fasta')
ADD COMMENT
0
Entering edit mode

Thanks a lot!

ADD REPLY
0
Entering edit mode

I used this method too, but some long reads in fastq file were truncated in to 2 lines in fasta file, like this:

in fastq:

GCGCCGCGACCGGCTCCGGTTCGGCTTGGAAAGTTCGGAAGGGCACCGGGTTAAATCCCAGTCCCCTTGTTCTCTC

in fasta:

GCGCCGCGACCGGCTCCGGTTCGGCTTGGAAAGTTCGGAAGGGCACCGGGTTAAATCCCA
GTCCCCTTGTTCTCTC

is the the case that happens commonly?

thanks.

ADD REPLY
2
Entering edit mode

It's normal for fasta files to be wrapped, typically at 70bp per line.

ADD REPLY
0
Entering edit mode
8.9 years ago
Echo ▴ 70

https://github.com/swuecho/fastq_2_fasta

c,c++,go,perl, python version

ADD COMMENT
0
Entering edit mode
5.6 years ago

Its probably late to reply but if anyone wondering about this issue they might use this simple command

paste - - - - < file.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > file.fa
ADD COMMENT
1
Entering edit mode

Very late indeed, as there already is an almost identical answer, posted years ago (here is the direct link to the answer). The difference being your answer has sed 's/^@/>/', while the posted has sed 's/^@/>/g'.

ADD REPLY
0
Entering edit mode
5.0 years ago
brianpenghe ▴ 80

You can use my script

ADD COMMENT
0
Entering edit mode

What does your script do that other, more well-tested tools don't? As in, why should someone use this script instead of, say, a simple awk or sed or bioawk (Or anything from lh3's benchmarked answer)? The script won't work with compressed files, there is no performance optimization - to me that seems like a list of downsides/risks with no gain.

ADD REPLY
0
Entering edit mode

The upsides: 1. it can do reads trimming at the same time, which is frequently required in small RNA fields 2. a functional command is short 3. it's in python

But I have to admit that Irsan's answer is good enough.

ADD REPLY
1
Entering edit mode

How well tested and benchmarked are these functionalities? How is the tool being in python an advantage when tools written in C/C++ are a lot more performant in File IO?

EDIT: I apologize for coming off a little too confrontational. I think it would help to test and benchmark your tool to showcase how fast and/or accurate it is.

ADD REPLY
0
Entering edit mode

Thanks for contributing! It's probably helpful to potential users if you can provide some more documentation, usage, installation, dependencies, advantages,... in your post here.

ADD REPLY

Login before adding your answer.

Traffic: 2309 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6