BioPython: convert fasta to fastq without quality score input file
3
5
Entering edit mode
7.0 years ago
Josh Herr 5.7k

Here's another beginner BioPython question from me...  

I'm running some genome assemblies for someone who has some new Illumina sequence data and also had done some sequencing a few years ago.  They have some Sanger and 454 sequences (a couple thousand sequences with a couple thousand base pairs for each) and everything is in fasta format.  They are trusted contigs and I'd like to use the data to help bridge assembly gaps of Illumina sequencing data (from the same extracted DNA that was kept frozen for a few years -- same bacterial strain).  There are no quality files for the older data and if they ever existed they are long gone from the data directories.

What I want to do is create FASTQ files for the old Sanger and 454 FASTA files with a phred score of 40 -- I think that would be a "I" under PHRED-33.

(I know there are assemblers out there that take fasta files inputs along with fastq -- but my in-house makefile and pipeline does not and I really would like to incorporate these sequences in our current workflow without a lot of changes).

There are lots of resources for converting from fasta to fastq WITH a qual file -- see here, here, here, here, and here; but I've looked all over but can't find anything for converting without a qual file.

Here is a public perl script that does exactly what I want it to do -- but I can't get it to work for me and I keep getting an error.  

As an exercise, I quickly wrote the script below, but I can't get it to provide the quality score as the length of the sequence.  The quality string needs to be the same length as the input sequence string.  Anyone have any hints for me?  Any help is appreciated.

Here's my code so far:

#!/usr/bin/env python

"""
Convert FASTA to FASTQ file with a static quality read

Usage:
$ ./fasta_to_fastq NAME.fasta NAME.fastq
"""

# import libraries
import sys, os
from Bio import SeqIO
from Bio.SeqIO.QualityIO import PairedFastaQualIterator

# Get inputs
fa_path = sys.argv[1]
fq_path = sys.argv[2]
phred_quality = "I"

# Check inputs
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)

# make fastq
with open(fq_path, "w") as handle:
    records = PairedFastaQualIterator(open(fa_path), print("phred_quality"))
    count = SeqIO.write(records, handle, "fastq")

# print fastq
print "Converted %i records" % count

 

fasta fastq phred parsing BioPython • 16k views
ADD COMMENT
0
Entering edit mode

Pretty sure Q40 is an "I" (73).

ADD REPLY
0
Entering edit mode

Thanks Matt Shirley, that was my first gut feeling, but I eventually thought wrong -- fixed my mistake.

ADD REPLY
8
Entering edit mode
7.0 years ago

ADD COMMENT
1
Entering edit mode

this is a better solution as it runs as a command line tool

ADD REPLY
1
Entering edit mode

Thanks Matt Shirley, this is awesome! Much appreciated!

ADD REPLY
0
Entering edit mode

Hello Matt Shirley,

I used your code and it worked great! I have a question though, I am trying to now import these files to Geneious and it is asking "What quality encoding is used by these files?" The three options are: 1. Sanger/ Illumina 1.8+, 2. Solexa/ Illumina 1.0-1.2 or 3. Illumina 1.3-1.7. Thank you for any help you can give me!

ADD REPLY
1
Entering edit mode

I would choose Sanger since this is the standard for all new Illumina sequencing, and a Sanger scaled Phred quality score of 40 equals an error probability of 1 in 10,000. The Solexa and Illumina <1.8 quality score encoding uses a different range of ASCII characters, and so you'll want to stay away from those.

ADD REPLY
4
Entering edit mode
7.0 years ago

The solution is to create a record within each loop and assign qualities to that. This has some explanations: http://biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

seq = Seq("NACGTACGTA")
rec = SeqRecord(seq, id="foo")
rec.letter_annotations["phred_quality"] = [40] * len(seq)
recs = [ rec ]
handle = file("out.fq", "wt")
SeqIO.write(recs, handle, "fastq")

Better yet do it with bioawk:

$ more fasta.awk 
{ 
  qual = ""
  for (i=0; I<length($seq); i++){
    qual = qual "I"
  }
  printf ("@%s\n%s\n+\n%s\n", $name, $seq, qual)
}

then run it as

cat test.fa | awk -c fastx -f fasta.awk
ADD COMMENT
2
Entering edit mode

And for those of us who have only "vanilla" awk:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq

Command corrected as suggested by h.mon (see below).

ADD REPLY
1
Entering edit mode

very cool indeed, but I guess for a correct fastq output the command should be:

awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq
ADD REPLY
0
Entering edit mode

Oops, you are right h.mon, I used the wrong symbols (> and @ instead of @ and +).

ADD REPLY
0
Entering edit mode

Thank you, Frédéric Mahé, once again your skills come to the rescue!

ADD REPLY
0
Entering edit mode

Thank you Istvan Albert, both the python solution and the bioawk solution worked. I have to spend more time with both, but I keep forgetting to use bioawk. Thanks again!

ADD REPLY
0
Entering edit mode

You've beaten me to the punch with an explanation, so I'm just trying out Gist embedding.

ADD REPLY
4
Entering edit mode
7.0 years ago
Peter 5.9k

Since Matt & Istvan have given you SeqRecord based solutions using SeqIO, here's a light weight solution using strings:

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open("input.fasta") as in_handle:
    with open("output.fastq", "w") as out_handle:
        for title, seq in SimpleFastaParser(in_handle):
            out_handle.write("@%s\n%s\n+\n%s\n" \
                             % (title, seq, "I" * len(seq)))

Note that faking quality scores like this is probably not a good idea, and to be used with caution.

ADD COMMENT
0
Entering edit mode

Thank you, Peter, much appreciated. I need to spend some time with SeqIO.

I have thought long and hard about faking the quality scores and I'm not completely convinced it's the way to go. I'm actually running the assemblies (as a test) with just the Illumina data, the Illumina data with Sanger/454 fasta input and with Illumina data with Sanger/454 in "faked" fastq. I'm going to compare integrity to the reference material we have. Maybe I will have more on that soon.

Thanks again for your help and your open source tools. Cheers.

ADD REPLY

Login before adding your answer.

Traffic: 2978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6