Alternative to BLASR ?
3
2
Entering edit mode
4.9 years ago
Rox ★ 1.3k

Hi everyone,

In order to polish my Falcon assembly of my D. suzukii genome, from PacBio long reads, I need to align the raw reads to my genome and create a .bam or a .sam file.

Such a tool already exist at PacBio, called blasr, but is currently not working (the 5.3 version I've installed do not provide a .bam or a .sam output... Which is VERY annoying).

Does an alternative of such aligner exist ? It has to be a aligner aware of the higher error rate of PacBio reads (that's why I can't use a classic aligner).

Any ideas ? I'm truely desesperate, I absolutely need to polish my assembly before going further in my analysis.

Roxane

PacBio assembly alignement • 4.1k views
5
Entering edit mode
4.9 years ago
Medhat 9.0k

bwa and BBmap (as far as I know limited to 6kb)

bwa mem -x pacbio ref.fa reads.fq > aln.sam

more here

Update.

Minimap2

is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP.

1
Entering edit mode

Roxane Boyer : With BBMap PacBio reads needed to be in fasta format and those above 6kb were broken into 6kb pieces during search automagically (with an _1, _2 etc). Since Brian Bushnell does not remove functionality from BBMap suite it should still work. Also see this original post. Try and let us know.

0
Entering edit mode

Okay thanks you a lot for the advices ! I'm going to try this !

0
Entering edit mode

Hi Genomax,

I'm coming back to you again about BBmap. After having some issues with bwa-mem (depending on the number of threads used, I didn't get the same results), I want to try bbmap. I understood that the large reads from my pacbio sequencing will have to be sliced in smaller pieces. However, as it's my first time using this tool, I was looking for some advanced parameters for PacBio reads, here what I found ( http://seqanswers.com/forums/showthread.php?t=58221 ) :

\$ mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400


In this post, it is written that :

The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.

But this section was "Oxford Nanopore" section, I assume that the author meant long reads generally, but as I'm not really sure, I wanted to have your opinion.

Should I use 6kb or 1kb for the reads slicing for PacBio data ?

0
Entering edit mode

What do you mean limited to 6kb ? The read length ? Because some of my reads are much longer than 6kb. Anyway, thanks for the tools I'm going to have a look on it

0
Entering edit mode

Okay, I've tried the following lines :

./bwa index /media/loutre/SUZUKII/assembly_tries/pb_268_p_ctg.fasta


Then I wanted to visualize my alignements on IGV with the decompressed sam file, but no reads where shown, IGV said that no reads was corresponding to my genome (but it was actually the right file).

Here the head of my sam file :

@SQ SN:000000F  LN:20279970
@SQ SN:000001F  LN:15392955
@SQ SN:000002F  LN:12197397
@SQ SN:000003F  LN:8269814
@SQ SN:000004F  LN:7009319
@SQ SN:000005F  LN:6740481
@SQ SN:000006F  LN:5550981
@SQ SN:000007F  LN:4663438
@SQ SN:000008F  LN:4675199
@SQ SN:000009F  LN:4040179


I'm not very familiar with the SAM format, but it looks like something is wrong. Did I missed something ? Do I have to reformat the output so it look like more at something like this :

@HD VN:1.5 SO:coordinate
@SQ SN:ref LN:45
r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;
r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;
r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1

0
Entering edit mode

Why do you pipe bwa mem to gzip? I haven't seen something like that before and would suggest this:

bwa mem ref.fasta input.fastq | samtools view -b - | samtools sort - > output.bam


After alignment the sam file is converted to bam and sorted without intermediate files.

0
Entering edit mode

Well, I've just copied what is written on the README file of bwa mem ! I'm not a original person, I've just followed the example they gived :

./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz


But well, I'm going to try without compressing it.

To go back on my first file output, we agree that this sam file is empty right ? There is no informations inside it ?

0
Entering edit mode

samtools idxstats yourfile.bam
samtools flagstat yourfile.bam

0
Entering edit mode

Well, it says that it failed to load the index, I guess it means it's not a real sam file ? The construction of my file look weird for me. I'm not sure if this is because I didn't saw a lot of Samf ile, or if it's because something really went wrong.

0
Entering edit mode

Did you create an index of the file using the following command?

samtools index yourfile.bam


I've no experience with creating sam.gz files, so it's possible that this is the source of the problem

2
Entering edit mode
4.9 years ago
brent_wilson ▴ 130

I have had much better luck for PACBIO alignment with BLAT, followed by some filtering for "true" alignments (alignment length, number of matches, etc.). It took a bit of work to set the parameters, and I haven't dealt with it in awhile, but it worked well for finding alignments (even for partial hits when I was working with draft assemblies).

Brent Wilson, PhD | Project Scientist | Cofactor Genomics

4044 Clayton Ave. | St. Louis, MO 63110 | tel. 314.531.4647

Catch the latest from Cofactor on our blog.

0
Entering edit mode

It would be useful to list those parameters, if you can dig them up. Will save people having to go through the same effort again (or at least serve as a starting point).

0
Entering edit mode

I would have, but everything is saved over at my previous position where I can't access it anymore. I do remember looking at the identity and coverage of each hit and using dot plots to train the parameters.

0
Entering edit mode

Hi Brent, thanks for your indication ! So you don't have any memory of how we can parameter BLAT for such an use ? it will be really useful for me to have such an information ! If you remember of something, to hesitate to write it down here !

2
Entering edit mode
3.3 years ago
harish ▴ 370

Hi!

Sorry to bring up the thread, but I wanted to put it down in case someone else faced the issue. I have recently used minimap2 to map the PacBio reads and then use pbbamify to put the alignment in the requisite format for Arrow.

I had about 100x data for a genome I am assembling and blasr was taking ages.

Please do give this a try!

Update: You can now simply use PBMM2 from pb-bioconda to create the aligned bams and later user pbindex to generate the BAM index if needed.

1
Entering edit mode

Hello and thanks for all these suggestions. These further comments might help someone else.

I have used minimap2 to map PacBio raw reads (converted to fastq as required for minimap2) against my de novo assembly. I recommend this approach over blasr. Use samtools to get the aligned.bam. Minimap2 does not output bam files.

I then used smrt tools to create a mydata.xml file using the following:

Then use pbbamify to convert the aligned.bam to correct pb format for submission to genomicconsensus arrow. Took a while to work through this process - thought it might help someone else.

pbbamify --input aligned.bam --output=aligned.pb.bam ref.fasta mydata.xml

1
Entering edit mode

Use samtools to get the aligned.bam. Minimap2 does not output bam files.

Just FYI (unclear how you did it) that you can do

minimap2 -t 8 -a genome.fa reads.fastq.gz | samtools sort -@8 -o my_sorted_alignment.bam


and as such avoid intermediate files and directly create a bam file (indeed using samtools).

0
Entering edit mode

Yes, you are correct. I should have been clearer. This is the script I used.

minimap2 -ax map-pb genome.fasta *.fastq \ | samtools view -hF 256 - \ | samtools sort -@ 8 -o genome_aligned.bam

I should also say that I am still having some trouble with pbbamify. So I haven't got this whole step working seamlessly yet.

1
Entering edit mode

Minimap doesn't really care for fastq files tbh. As such the Sequel runs have series of '!!!!!!' representing the qualities anyways.

You can definitely use Minimap's "-a" parameter to get a sam file and then convert it into Arrow specific format using pbbamify.

Also take a look at pbbioconda repos. If you have the Bams from Sequel, you can directly use Pbmm2, which uses minimap api to align the reads to the genome.

Plus these days you don't really need SMRTlink tools unless you are really looking at specific analysis. Most genome analysis these days are covered in pbbioconda!

0
Entering edit mode

Thanks for these tips @harish. I did not know that I could directly use subreads with Pbmm2. Actually this all looks more promising as a way forward.

Pbbamify worked after 130 hours of walltime. It output a very large pb.bam (351G) from aligned.bam (36G). Arrow still needs a pbi indexed file according to error report.

I really appreciate your help on this!

0
Entering edit mode

You can use pbindex to generate the index files.