Question: What can I do after my Pacbio genome assembly ?
2
gravatar for Picasa
23 months ago by
Picasa390
Picasa390 wrote:

Hi

I have just assembled my genome with CANU. I am a beginner with Pacbio data and I am wondering now if there is something I can do more ?

1) I heard about Quiver for polishing the assembly: For this I think I need a sam file where my reads have been mapped to my genome. Can I use the mapper blasr for that case?

pacbio • 2.0k views
ADD COMMENTlink modified 23 months ago by Roxane Boyer880 • written 23 months ago by Picasa390
6
gravatar for Roxane Boyer
23 months ago by
Roxane Boyer880
France / Toulouse / GeT-Plage
Roxane Boyer880 wrote:

Hi Picasa !

Indeed, you will need to perform a polishing step after assembly, this will lower your error rate from ~1% (after assembly with canu) to ~0.001-0.0000001% (depending of the coverage used for polishing, you can find information in this link : https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/FAQ.rst section "What is the expected quiver accuracy? ")

So, you indeed need a alignement file, but with PacBio, everything is a bit more complicated. Considering their bas/bax.h5 format, you will need a particular aligner (blasr you said it right) that will produce an alignement. However, this alignment will be not in bam format, but in a specific format called .cmp.h5 (cmp strand for comparison I guess).

Blasr align file by file. In order to align all your .bas/bax.h5, you are going to use an other tool called pbalign. This tool can take as an input a .fofn file (file of file ... forget about the n) which is a list of all the path that lead to your raw data, and will literally call blasr for all your raw data file and then compile their result.

Here a small example of how a .fofn file should look (sorry for horrible paths) :

/media/DATAPART1/pacbio_reads/run6/H01_1/Analysis_Results/m151220_023946_42237_c100942502550000001823214006101687_s1_p0.1.bax.h5
/media/DATAPART1/pacbio_reads/run6/H01_1/Analysis_Results/m151220_023946_42237_c100942502550000001823214006101687_s1_p0.2.bax.h5

As it was really hard for me to gather this informations, here how I used pbalign :

pbalign --forQuiver raw_reads.fofn assembly_to_polish.fasta your_output.cmp.h5

Then, you can use quiver using something like this :

quiver your_output.cmp.h5 -r assembly_to_polish.fasta -o polished_assembly.fasta

For installing pacbio tools, I redirect you to this github thread : https://github.com/PacificBiosciences/pbalign/issues/67

Don't try to install all PacBio tools by yourself, don't even try pitchfork, you will lose a lot of time just as I do.

Also, a related Biostar post here : A: Choosing de novo genome assembly

Good luck with your polishing ! This way worked fine for me after weeks or research on this subject.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Roxane Boyer880
1

If he needs bam file I think he can use bwa mem -x pacbio ref.fa reads.fq | samtools view -F 4 -Sbh - | samtools sort - my_result_sorted.bam
instead of blasr

ADD REPLYlink modified 23 months ago • written 23 months ago by Medhat8.1k
2

Yes but the problem after that is that quiver will not necesserly work well with bam files. Indeed, the particular format for PacBio files take account of various chemistery parameters and quality values. As I said in my previous post, I think that soon or later, PacBio is going to fully abandon theses format and fully use bam/sam. For now, I will not recommand to use anything else but PacBio format because I had a lot of struggle trying to use classic bam/sam format. You can see my struggle on github here : https://github.com/PacificBiosciences/pbalign/issues/48 . I also in the past tried to use quiver using a bam/sam file. It gaves me errors (I don't have wrote them to transcribe it there) because quiver wasn't taking care of sam/bam file (even if it is written on their wikipage...)

ADD REPLYlink written 23 months ago by Roxane Boyer880

Hi Roxane and many thanks for your help !

However I don't have .bas/bax.h5 files but only fastq (my raw reads are fastq files) so can I still use pbalign with your command ? I mean, does it recognize fastq ?

ADD REPLYlink written 23 months ago by Picasa390

It may be better overall if you can get the *.h5 files from your source.

ADD REPLYlink written 23 months ago by genomax62k

Hi Picasa,

I agree with Genomax, I had a lot of trouble and could never achieve the final polishing step by using fastq files and pâcbio tools. You have to find your raw reads, they are essentials. Also, try to acknowledge what your sequencing coverage was, it is a very essential information !

ADD REPLYlink written 23 months ago by Roxane Boyer880

I have the sequencing coverage information; but why is it important ?

ADD REPLYlink modified 23 months ago • written 23 months ago by Picasa390

Coverage is a major parameter with PacBio long read sequencing, because as their raw reads have bigger error rate (~15%), your strategy to lower down this error rate will be different depending on the number of PacBio reads you have.

If you are new to PacBio sequencing, I advise you to read this manual wrote by Mr Chakraborty : https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkw654

It show the importance of sequencing coverage and how to obtain good assembly metrics depending on your coverage.

You have different strategy to polish your genome, one is called Self corrected, because you use raw PacBio reads to correct your assembly, and other method are called hybrid, because, in certain cases, you can use shorter Illumina reads to correct your assembly.

Self correction is used when you have enough coverage, Hybrid correction is used when you have low coverage.

I hope I was clear with that, if you have more question about that, don't hesitate to ask me ! But I'm sure you will find all you need in this really great manual !

ADD REPLYlink written 23 months ago by Roxane Boyer880

Hi Roxane,

I was reading this for quiver

https://github.com/PacificBiosciences/GenomicConsensus

and I would like to know if you have some benchmark using the different algorithm proposed by variantCaller (quiver,arrow,plurality,poa,best).

I mean, does it change a lot ? The link above said that apparently arrow is better than quiver

ADD REPLYlink modified 21 months ago • written 21 months ago by Picasa390

Hi Picasa !

Sadly, I never managed to get arrow to work on my computer (installation issue), same thing with the others algorithm... I will try to test it again, I also really want to know how different the two algorithm work, and especially want to know if one of them is parallelizable (as quiver is not)

ADD REPLYlink written 21 months ago by Roxane Boyer880
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 996 users visited in the last hour