BioPython: convert fasta to fastq without quality score input file
3
5
Entering edit mode
8.4 years ago
Josh Herr 5.7k

Here's another beginner BioPython question from me...

I'm running some genome assemblies for someone who has some new Illumina sequence data and also had done some sequencing a few years ago. They have some Sanger and 454 sequences (a couple thousand sequences with a couple thousand base pairs for each) and everything is in fasta format. They are trusted contigs and I'd like to use the data to help bridge assembly gaps of Illumina sequencing data (from the same extracted DNA that was kept frozen for a few years -- same bacterial strain). There are no quality files for the older data and if they ever existed they are long gone from the data directories.

What I want to do is create FASTQ files for the old Sanger and 454 FASTA files with a phred score of 40 -- I think that would be a I under PHRED-33.

(I know there are assemblers out there that take fasta files inputs along with fastq -- but my in-house makefile and pipeline does not and I really would like to incorporate these sequences in our current workflow without a lot of changes).

There are lots of resources for converting from fasta to fastq WITH a qual file -- see here, here, here, here, and here; but I've looked all over but can't find anything for converting without a qual file.

Here is a public perl script that does exactly what I want it to do -- but I can't get it to work for me and I keep getting an error.

As an exercise, I quickly wrote the script below, but I can't get it to provide the quality score as the length of the sequence. The quality string needs to be the same length as the input sequence string. Anyone have any hints for me? Any help is appreciated.

Here's my code so far:

#!/usr/bin/env python

"""
Convert FASTA to FASTQ file with a static quality read

Usage:
$./fasta_to_fastq NAME.fasta NAME.fastq """ # import libraries import sys, os from Bio import SeqIO from Bio.SeqIO.QualityIO import PairedFastaQualIterator # Get inputs fa_path = sys.argv[1] fq_path = sys.argv[2] phred_quality = "I" # Check inputs if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path) # make fastq with open(fq_path, "w") as handle: records = PairedFastaQualIterator(open(fa_path), print("phred_quality")) count = SeqIO.write(records, handle, "fastq") # print fastq print "Converted %i records" % count  parsing phred fasta BioPython fastq • 19k views ADD COMMENT 0 Entering edit mode Pretty sure Q40 is an "I" (73). ADD REPLY 0 Entering edit mode Thanks Matt Shirley, that was my first gut feeling, but I eventually thought wrong -- fixed my mistake. ADD REPLY 8 Entering edit mode 8.4 years ago ADD COMMENT 1 Entering edit mode this is a better solution as it runs as a command line tool ADD REPLY 1 Entering edit mode Thanks Matt Shirley, this is awesome! Much appreciated! ADD REPLY 0 Entering edit mode Hello Matt Shirley, I used your code and it worked great! I have a question though, I am trying to now import these files to Geneious and it is asking "What quality encoding is used by these files?" The three options are: 1. Sanger/ Illumina 1.8+, 2. Solexa/ Illumina 1.0-1.2 or 3. Illumina 1.3-1.7. Thank you for any help you can give me! ADD REPLY 1 Entering edit mode I would choose Sanger since this is the standard for all new Illumina sequencing, and a Sanger scaled Phred quality score of 40 equals an error probability of 1 in 10,000. The Solexa and Illumina <1.8 quality score encoding uses a different range of ASCII characters, and so you'll want to stay away from those. ADD REPLY 4 Entering edit mode 8.4 years ago The solution is to create a record within each loop and assign qualities to that. This has some explanations: http://biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO seq = Seq("NACGTACGTA") rec = SeqRecord(seq, id="foo") rec.letter_annotations["phred_quality"] = [40] * len(seq) recs = [ rec ] handle = file("out.fq", "wt") SeqIO.write(recs, handle, "fastq")  Better yet do it with bioawk: $ more fasta.awk
{
qual = ""
for (i=0; I<length($seq); i++){ qual = qual "I" } printf ("@%s\n%s\n+\n%s\n",$name, $seq, qual) }  then run it as cat test.fa | awk -c fastx -f fasta.awk  ADD COMMENT 2 Entering edit mode And for those of us who have only "vanilla" awk: awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", $2)}' test.fa > test.fastq  Command corrected as suggested by h.mon (see below). ADD REPLY 1 Entering edit mode very cool indeed, but I guess for a correct fastq output the command should be: awk 'BEGIN {RS = ">" ; FS = "\n"} NR > 1 {print "@"$1"\n"$2"\n+"$1"\n"gensub(/./, "I", "g", \$2)}' test.fa > test.fastq

0
Entering edit mode

Oops, you are right h.mon, I used the wrong symbols (> and @ instead of @ and +).

0
Entering edit mode

Thank you, Frédéric Mahé, once again your skills come to the rescue!

0
Entering edit mode

Thank you Istvan Albert, both the python solution and the bioawk solution worked. I have to spend more time with both, but I keep forgetting to use bioawk. Thanks again!

0
Entering edit mode

You've beaten me to the punch with an explanation, so I'm just trying out Gist embedding.

4
Entering edit mode
8.4 years ago
Peter 6.0k

Since Matt & Istvan have given you SeqRecord based solutions using SeqIO, here's a light weight solution using strings:

from Bio.SeqIO.FastaIO import SimpleFastaParser
with open("input.fasta") as in_handle:
with open("output.fastq", "w") as out_handle:
for title, seq in SimpleFastaParser(in_handle):
out_handle.write("@%s\n%s\n+\n%s\n" \
% (title, seq, "I" * len(seq)))

Note that faking quality scores like this is probably not a good idea, and to be used with caution.

0
Entering edit mode

Thank you, Peter, much appreciated. I need to spend some time with SeqIO.

I have thought long and hard about faking the quality scores and I'm not completely convinced it's the way to go. I'm actually running the assemblies (as a test) with just the Illumina data, the Illumina data with Sanger/454 fasta input and with Illumina data with Sanger/454 in "faked" fastq. I'm going to compare integrity to the reference material we have. Maybe I will have more on that soon.