Question: Phred/Phrap pipeline starting with FASTA file of paired-end reads and using a reference sequence
3.6 years ago by
Hello everyone,

this is my first question here, and I am still quite new with this topic. I need to assemble short reads guided (or not) by a reference sequence using Phrap.

I have a FASTA file with 50bp paired-end reads (I also have it in SAM, BAM, and FASTQ formats) mapping to a full reference sequence I have in FASTA format as well. I obtained my read maps with Bowtie2 and SamTools.

I explicitly want to use Phrap to obtain a full-length assemble of the reads and compare it to the reference via a pairwise alignment with Needle. I want to do it using the reference as guide, and not using it as well.

I have been sent the Phred and Phrap programs, but I am quite lost. I have tried Phrap alone with no quality file, but I get many short contigs instead of one long one.

I understand I should follow the whole Phred -> Phd2fasta -> CrossMatch -> Phrap protocol, but I do not seem to find my way around it. It seems Phred uses a chromatogram file as input, but I do not know how to obtain it.

So my question is how should I follow the Phred/Phrap protocol starting with a FASTA file (SAM, BAM, or FASTQ) with 50bp reads mapping a reference FASTA file, as inputs? I want to obtain a contig that spans the full length of the reference (using the reference and not using it as input).

Thanks a lot!

3.6 years ago by
Walnut Creek, USA
50bp reads will not give you a good assembly no matter what you do, unless you are trying to assemble a tiny virus.

You might, possibly, get a better assembly using Spades, which is very easy to use.  There's no point in using OLC/String Graph assemblers on such tiny reads.  But unless you are working on a virus (and often, even then, as viruses can be hard to assemble), you will not get a 1 contig assembly from 50bp reads.  Certainly, never for a bacteria.  You'd be lucky to get a 1000-contig assembly of a bacteria using 50bp reads.

What kind of organism are you trying to assemble?  And why are you using 50bp reads?

I am just trying to assemble VDJ combinations, not whole genome; I run bowtie2 with all the cell reads against a certain combination. I wasn't getting totally bad results with velvet, but I was getting a perfect assembly with codoncode, which uses phred and phrap, that's why I wanted to use phred and phrap to automate the process... how should I use phred and phrap? I will look into spades too.

How can I run Phred/Phrap with a FASTQ/FASTA/BAM/SAM file as input and with a reference FASTA sequence?

I am trying to convert my input FASTA/FASTQ file into a chromatogram SCF or ABI file using BioPerl as indicated in this other thread C: Converting A Dna Sequence To Abi Or Scf Format but this approach does not work... any clue?

Can anyone help with this? Thanks!

