Question: Fastq Convert To Fasta
7
gravatar for biolab
3.9 years ago by
biolab950
biolab950 wrote:

Dear All, Am I right to use the following script to transform a fastq file (named test.fastq) to a fasta file? THANKS a lot!

#!/usr/bin/perl
use strict;
use Bio::SeqIO;
my $in=Bio::SeqIO->new(-file=>"test.fastq",-format=>'fastq');
my $out=Bio::SeqIO->new(-file=>">test.fasta",-format=>'fasta');
while(my $seq=$in->next_seq()){
      $out->write_seq($seq);
}
perl • 48k views
ADD COMMENTlink modified 21 months ago by Echo70 • written 3.9 years ago by biolab950
9

why so complicated ?

  gunzip -c file.fq.gz | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum98k
11

Why so complicated? :)

seqtk seq -a in.fastq.gz > out.fasta
ADD REPLYlink modified 3.9 years ago by Ido Tamir4.7k • written 3.9 years ago by matted6.5k
5

and 2.4 years later... gunzip -c file.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^/>/' | tr "\t" "\n" > file.fa

ADD REPLYlink written 17 months ago by Pierre Lindenbaum98k

Common problem for new folk, no need for a perl script to do this, built in commands like that posted by Pierre to use awk are good.

ADD REPLYlink written 3.9 years ago by rob234king530

Or this might be a little easier to remember / type:

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastq_to_fasta_usage

ADD REPLYlink written 3.9 years ago by Madelaine Gogol4.8k

can we also apply this for whole folder. so that all file can we converted at once..

Thanks

Sandeep

ADD REPLYlink written 3.2 years ago by Sandy0
20
gravatar for lh3
3.9 years ago by
lh330k
United States
lh330k wrote:

The following list shows the time for converting 2 million 100bp sequences in fastq to fasta with different approaches (locale set to "C"):

================================================================================================
Real(s)    CPU(s)   Command line
------------------------------------------------------------------------------------------------
  1.8        1.8    seqtk seq -A t.fq > /dev/null
  3.1        3.1    sed -n '1~4s/^@/>/p;2~4p' t.fq > /dev/null
  5.8       12.4    paste - - - - < t.fq | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n' > /dev/null
  7.6        7.5    bioawk -c fastx '{print ">"$name"\n"$seq}' t.fq > /dev/null
 11.9       12.9    awk 'NR%4==1||NR%4==2' t.fq | tr "@" ">" > /dev/null
 22.2       22.2    seqret -sequence t.fq -out /dev/null        # 6.4.0
 26.5       25.4    fastq_to_fasta -Q32 -i t.fq -o /dev/null    # 0.0.13
================================================================================================

In the list, seqtk, bioawk and seqret work with multi-line fastq; the rest don't. If you just want to use the standard unix tools, rtliu's sed solution is preferred, both short and efficient. It should be noted that file t.fq is put in /dev/shm and the results are written to /dev/null. In real applications, I/O may take more wall-clock time than CPU. In addition, frequently the sequence file is gzip'd. For seqtk, decompression takes more CPU time than parsing fastq.

Additional comments:

  • SES observed that seqret was faster than Irsan's command. At my hand, seqret is always slower than most. Is it because of version, locale or something else?

  • I have not tried the native bioperl parser. Probably it is much slower. Using bioperl on large fastq is discouraged.

  • I do agree 4-line fastq is much more convenient for many people. However, fastq is never "defined" to consist of 4 lines. When it was first introduced at the Sanger Institute for ~1kb capillary reads, it allows multiple lines.

  • Tools in many ancient unix distributions (e.g. older AIX) do not work with long lines. I was working with a couple of such unix even in 2005. I guess this is why older outputs/formats, such as fasta, blast and genbank/embl, used short lines. This is not a concern any more nowadays.

  • To convert multi-line fastq to 4-line (or multi-line fasta to 2-line fasta): seqtk seq -l0 multi-line.fq > 4-line.fq

ADD COMMENTlink written 3.9 years ago by lh330k
2

Tranks for taking the time and checking the different solutions!

ADD REPLYlink written 3.9 years ago by David Langenberger7.9k
2

Here are my observations with a file of 2 million 100 bp sequences (EMBOSS 6.5.7.0; Fedora 17 desktop):

$ time cat t.fq | paste - - - - | cut -f1-2 | sed 's/^@/>/g' | tr '\t' '\n' > /dev/null
real    0m16.919s
user    0m19.930s
sys    0m1.439s

$ time seqret -sequence t.fq -outseq stdout > /dev/null
real    0m9.435s
user    0m9.329s
sys    0m0.091s

Here is the same analysis, but using EMBOSS 6.4.0 and a different machine (RHEL 5.9 server):

$ time cat t.fq | paste - - - - | cut -f1-2 | sed 's/^@/>/g' | tr '\t' '\n' > /dev/null
real    0m6.902s
user    0m10.273s
sys    0m10.506s

$ time /usr/local/emboss/6.4.0/bin/seqret -sequence t.fq -outseq stdout > /dev/null
real    0m22.571s
user    0m20.679s
sys    0m1.888s

This would suggest maybe there are some differences between versions, which I had not considered until I saw your review.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by SES7.9k

Thanks for the evaluation. I have tried the latest emboss on the same old CentOS server. Its speed is pretty much the same. I guess these two command lines are sensitive to the system configuration. As to the "paste" command line, could you confirm that the locale is set to "C"?

ADD REPLYlink written 3.9 years ago by lh330k

I checked and my locale on both machines is set to UTF-8, which I read is the default for most linux systems. I also read somewhere that this affects the sort order, but I'm uncertain about what that means exactly. EDIT: Here's what the manpage of sort says: "** WARNING ** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values." I guess that means sort could behave differently on different systems. This is interesting, I'll keep this information in mind.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by SES7.9k

You may set "export LC_ALL=C" and then see if the speed changes. Sometimes, locale has a dramatic hit to performance on text processing. For example, regex matching on UTF-8 is much slower than on ASCII.

ADD REPLYlink written 3.9 years ago by lh330k
12
gravatar for Irsan
3.9 years ago by
Irsan6.3k
Amsterdam
Irsan6.3k wrote:

If you want to convert fastq to fasta try:

cat test.fastq | paste - - - - | sed 's/^@/>/g'| cut -f1-2 | tr '\t' '\n'
ADD COMMENTlink modified 11 months ago • written 3.9 years ago by Irsan6.3k

I like this one, nice.

ADD REPLYlink written 3.9 years ago by matted6.5k

Inspired by najim ameziane :-)

ADD REPLYlink written 3.9 years ago by Irsan6.3k

shame on me :-) Your solution is far more elegant :-)

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum98k

@irsan A small typo in your one-liner. It should be test.fastq

ADD REPLYlink modified 11 months ago • written 11 months ago by Prakki Rama2.0k

OK, I have changed fasta to fastq

ADD REPLYlink written 11 months ago by Irsan6.3k
8
gravatar for rtliu
3.9 years ago by
rtliu2.0k
New Zealand
rtliu2.0k wrote:

This is the fastest bash one-liner to convert fastq to fasta, assuming 4 lines per FASTQ record, more on https://github.com/stephenturner/oneliners

sed -n '1~4s/^@/>/p;2~4p' test.fastq > test.fasta
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by rtliu2.0k
2

I can't beat this. However, this is extremely close (within 3% or so):

perl -ne 'y/@/>/;print($_.<>)&&<>&&<>' test.fastq > test.fasta

It is also possible to write a terser Perl one-liner but a tad slower, or even a faster version, but longer.

ADD REPLYlink written 3.9 years ago by emazep20
6
gravatar for SES
3.9 years ago by
SES7.9k
Vancouver, BC
SES7.9k wrote:

Don't use awk/shell to parse fastq. None of those solutions work for multi-line records (except seqtk). You can save yourself a lot of trouble by using a toolkit like EMBOSS for this:

seqret -sequence reads.fastq -outseq reads.fasta

Since you were trying for a BioPerl solution, here is the same thing using the bioperl-run package:

If you are learning your way around NGS data, I recommend using the established tookits. BioPerl is helpful, but not for parsing fastq data because it's just too slow.

EDIT: fix link that broke due to conversion to the new website.

ADD COMMENTlink modified 3.0 years ago • written 3.9 years ago by SES7.9k

"Don't use awk/shell to parse fastq. None of those solutions work for multi-line records " I do agree. But (1) no dependency (2)most of the time, the fastq have 4 lines.

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum98k

I agree on the first point, BioPerl and EMBOSS are both really large. But, if someone is learning to parse these files, I don't think handing them a long shell pipeline is the best way to learn. Learning these tools allows you spend less time writing and debugging basic parsing code so you can focus on higher level problems. (And, BioPerl was a part of the question).

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by SES7.9k

(1) By definition one fastq entry consists of 4 lines and (2) for simple tasks like converting fastq files to fasta files I absolutely recommend using command line pipes for learning and understanding the formats. Using and blindly trusting huge packages is dangerous, especially when you are new to the field. BUT, of course, one needs to check check check and double-check the results. But that's bioinformatics!

ADD REPLYlink written 3.9 years ago by David Langenberger7.9k

1) In my experience, we can't assume this about formats. The first line of the FASTQ definition on wikipedia says, "A FASTQ file normally uses four lines per sequence." Normally, but not always. The convention for FASTA is for the sequence part to be less than 80 columns wide, and that's what many applications (like EMBOSS) will generate with FASTQ, as well. I wish it was always 4 lines, that would be so much easier. 2) I'd be interested in what others think, but I would say learn a scripting language, not shell, for parsing. It's almost 2014 and modern scripting languages are way more expressive and nearly as fast as compiled languages for many tasks. If you do everything with these pipes you will be debugging the next time you use a command, and you will look at it and wonder what in the world it does. I used to do that but wouldn't advise it. Also, that seqret command I posted is 2X faster than Irsan's answer, and much, much shorter. It's not as fast as Pierre's answer, his is 10X faster than Irsan's, because he has super-shell skills.

ADD REPLYlink written 3.9 years ago by SES7.9k
1

In your experience, you can't assume this about formats. :) But how did you learn that? I'm 100% with you... in bioinformatics the file formats are not really defined. But you will never learn that by using toolkits. If you want to learn that by shell pipes, or a scripting language doesn't matter in my view. I use shell scripting to fastly check things, or for fastq to fasta conversions and scripting languages for mostly all other tasks... Depends on what you need. 10x faster is nice, if you need a script more often. Just to check something, I don't care about speed.

ADD REPLYlink written 3.9 years ago by David Langenberger7.9k

I learned about seeing multi-line fastq from getting data from collaborators. A lot of biologists rely on tools like Geneious, CLC, and other desktop applications. These often produce files that break all parsers by inserting extra information or characters that only that application can read, or by doing things like breaking the lines of all sequence data at a certain length. I found this because I kept getting errors from a sequence reading class I wrote that was not recognizing the format as fastq. This is very frustrating, but I decided to keep it in mind and not assume, instead of wasting more time the future. I don't think speed is most important either, I'd rather go with methods that are easy to remember and use. For doing simple things, it does not matter what your approach is, but I don't see the point of making a procedure slower and more complicated.

ADD REPLYlink written 3.9 years ago by SES7.9k

Irsan's solution is neither slow, nor complicated. Nevertheless, I think we are just talking about different things. ;) Never mind...

ADD REPLYlink written 3.9 years ago by David Langenberger7.9k

That solution is slower (2-10X slower for ~10 million sequences) than all the other answers. That doesn't really matter unless you have like 100 files to convert, then it would be silly to use something that you know will take hours longer. By complicated, I mean the command is longer (than seqtk or seqret) and requires five programs as opposed to one. I'm not disagreeing about using shell for changing formats, those are crucial skills to have. Though, I just have a hard time saying that shell is the better approach in this case, given those two points. Also, the original question was about how to get a bioperl script working, so I don't think giving a shell command is the most specific answer. I think the question has been answered about 10 times now, so we are getting off topic talking about what we would do, but hopefully that helps, too. :) Cheers.

ADD REPLYlink written 3.9 years ago by SES7.9k
3
gravatar for Paul
2.5 years ago by
Paul700
European Union
Paul700 wrote:

what about

gunzip -c in.fastq.gz | paste - - - - | awk '{print ">"$1 ; print $3;}' > out.fasta
ADD COMMENTlink written 2.5 years ago by Paul700

Thanks a lot!

ADD REPLYlink written 2.5 years ago by biolab950

You are welcome.. that is probably most easy way :-)

ADD REPLYlink written 2.2 years ago by Paul700
2
gravatar for David Langenberger
3.9 years ago by
Deutschland
David Langenberger7.9k wrote:

Here is a solution for your question and some other NGS snippets you might find useful: http://www.ecseq.com/NGSsnippets.html

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by David Langenberger7.9k
2
gravatar for wpwupingwp
2.5 years ago by
wpwupingwp110
China
wpwupingwp110 wrote:

from Bio import SeqIO

SeqIO.convert('input','fastq','output','fasta')

ADD COMMENTlink written 2.5 years ago by wpwupingwp110

Thanks a lot!

ADD REPLYlink written 2.5 years ago by biolab950

I used this method too, but some long reads in fastq file were truncated in to 2 lines in fasta file, like this:

in fastq:

GCGCCGCGACCGGCTCCGGTTCGGCTTGGAAAGTTCGGAAGGGCACCGGGTTAAATCCCAGTCCCCTTGTTCTCTC

in fasta:

GCGCCGCGACCGGCTCCGGTTCGGCTTGGAAAGTTCGGAAGGGCACCGGGTTAAATCCCA
GTCCCCTTGTTCTCTC

 

is the the case that happens commonly?

 

thanks. 

 

 

ADD REPLYlink written 2.2 years ago by xiangwulu20

It's normal for fasta files to be wrapped, typically at 70bp per line.

ADD REPLYlink written 2.2 years ago by Brian Bushnell14k
0
gravatar for Echo
21 months ago by
Echo70
United States
Echo70 wrote:

https://github.com/swuecho/fastq_2_fasta

c,c++,go,perl, python version

ADD COMMENTlink written 21 months ago by Echo70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour