Question: BioPython: convert fasta to fastq without quality score input file
4
gravatar for Josh Herr
3.3 years ago by
Josh Herr5.4k
University of Nebraska
Josh Herr5.4k wrote:

Here's another beginner BioPython question from me...  

I'm running some genome assemblies for someone who has some new Illumina sequence data and also had done some sequencing a few years ago.  They have some Sanger and 454 sequences (a couple thousand sequences with a couple thousand base pairs for each) and everything is in fasta format.  They are trusted contigs and I'd like to use the data to help bridge assembly gaps of Illumina sequencing data (from the same extracted DNA that was kept frozen for a few years -- same bacterial strain).  There are no quality files for the older data and if they ever existed they are long gone from the data directories.

What I want to do is create FASTQ files for the old Sanger and 454 FASTA files with a phred score of 40 -- I think that would be a "I" under PHRED-33.

(I know there are assemblers out there that take fasta files inputs along with fastq -- but my in-house makefile and pipeline does not and I really would like to incorporate these sequences in our current workflow without a lot of changes).

There are lots of resources for converting from fasta to fastq WITH a qual file -- see here, here, here, here, and here; but I've looked all over but can't find anything for converting without a qual file.

Here is a public perl script that does exactly what I want it to do -- but I can't get it to work for me and I keep getting an error.  

As an exercise, I quickly wrote the script below, but I can't get it to provide the quality score as the length of the sequence.  The quality string needs to be the same length as the input sequence string.  Anyone have any hints for me?  Any help is appreciated.

Here's my code so far:

#!/usr/bin/env python

"""
Convert FASTA to FASTQ file with a static quality read

Usage:
$ ./fasta_to_fastq NAME.fasta NAME.fastq
"""

# import libraries
import sys, os
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator

# Get inputs
fa_path = sys.argv[1]
fq_path = sys.argv[2]
phred_quality = "I"

# Check inputs
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)

# make fastq
with open(fq_path, "w") as handle:
    records = PairedFastaQualIterator(open(fa_path), print("phred_quality"))
    count = SeqIO.write(records, handle, "fastq")

# print fastq
print "Converted %i records" % count

 

ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 3.3 years ago by Josh Herr5.4k

Pretty sure Q40 is an "I" (73).

ADD REPLYlink written 3.3 years ago by Matt Shirley7.7k

Thanks Matt Shirley, that was my first gut feeling, but I eventually thought wrong -- fixed my mistake.  

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Josh Herr5.4k
8
gravatar for Matt Shirley
3.3 years ago by
Matt Shirley7.7k
Cambridge, MA
Matt Shirley7.7k wrote:

ADD COMMENTlink written 3.3 years ago by Matt Shirley7.7k
1

this is a better solution as it runs as a command line tool

ADD REPLYlink written 3.3 years ago by Istvan Albert ♦♦ 73k
1

Thanks Matt Shirley, this is awesome!  Much appreciated!

ADD REPLYlink written 3.3 years ago by Josh Herr5.4k

Hello Matt Shirley,

I used your code and it worked great! I have a question though, I am trying to now import these files to Geneious and it is asking "What quality encoding is used by these files?" The three options are: 1. Sanger/ Illumina 1.8+, 2. Solexa/ Illumina 1.0-1.2 or 3. Illumina 1.3-1.7. Thank you for any help you can give me!

ADD REPLYlink modified 15 months ago • written 15 months ago by lizgzara10
1

I would choose Sanger since this is the standard for all new Illumina sequencing, and a Sanger scaled Phred quality score of 40 equals an error probability of 1 in 10,000. The Solexa and Illumina <1.8 quality score encoding uses a different range of ASCII characters, and so you'll want to stay away from those.

ADD REPLYlink written 15 months ago by Matt Shirley7.7k
4
gravatar for Istvan Albert
3.3 years ago by
Istvan Albert ♦♦ 73k
University Park, USA
Istvan Albert ♦♦ 73k wrote:

The solution is to create a record within each loop and assign qualities to that.  This has some explanations: http://biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html 

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("NACGTACGTA")
rec = SeqRecord(seq, id="foo")
rec.letter_annotations["phred_quality"] = [40] * len(seq)
recs = [ rec ]
handle = file("out.fq", "wt")
SeqIO.write(recs, handle, "fastq")

Better yet do it with bioawk:

$ more fasta.awk 
{ 
  qual = ""
  for (i=0; i<length($seq); i++){
    qual = qual "I"
  }
  printf ("@%s\n%s\n+\n%s\n", $name, $seq, qual)
}

then run it as

cat test.fa | awk -c fastx -f fasta.awk 
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Istvan Albert ♦♦ 73k
2

And for those of us who have only "vanilla" awk:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq

Command corrected as suggested by h.mom (see below).

ADD REPLYlink modified 23 months ago • written 3.3 years ago by Frédéric Mahé2.6k
1

very cool indeed, but I guess for a correct fastq output the command should be:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq
ADD REPLYlink written 23 months ago by h.mon8.0k

Oops, you are right h.mom, I used the wrong symbols (`>` and `@` instead of `@` and `+`).

ADD REPLYlink written 23 months ago by Frédéric Mahé2.6k

Thank you, Frédéric Mahé, once again your skills come to the rescue!

ADD REPLYlink written 3.3 years ago by Josh Herr5.4k

Thank you Istvan Albert, both the python solution and the bioawk solution worked.  I have to spend more time with both, but I keep forgetting to use bioawk.  Thanks again!

ADD REPLYlink written 3.3 years ago by Josh Herr5.4k

You've beaten me to the punch with an explanation, so I'm just trying out Gist embedding.

ADD REPLYlink written 3.3 years ago by Matt Shirley7.7k
4
gravatar for Peter
3.3 years ago by
Peter5.5k
Scotland, UK
Peter5.5k wrote:

Since Matt & Istvan have given you SeqRecord based solutions using SeqIO, here's a light weight solution using strings:

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open("input.fasta") as in_handle:
    with open("output.fastq", "w") as out_handle:
        for title, seq in SimpleFastaParser(in_handle):
            out_handle.write("@%s\n%s\n+\n%s\n" \
                             % (title, seq, "I" * len(seq)))

Note that faking quality scores like this is probably not a good idea, and to be used with caution.

ADD COMMENTlink modified 2.3 years ago • written 3.3 years ago by Peter5.5k

Thank you, Peter, much appreciated.  I need to spend some time with SeqIO.

I have thought long and hard about faking the quality scores and I'm not completely convinced it's the way to go.  I'm actually running the assemblies (as a test) with just the Illumina data, the Illumina data with Sanger/454 fasta input and with Illumina data with Sanger/454 in "faked" fastq.  I'm going to compare integrity to the reference material we have.  Maybe I will have more on that soon.  

Thanks again for your help and your open source tools.  Cheers.

ADD REPLYlink written 3.3 years ago by Josh Herr5.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 763 users visited in the last hour