How to select an aligner?
5
2
Entering edit mode
3.6 years ago

Hello,

I am currently working on a WGS project and I would like to ask whether there are specific restrictions on the choice of the short read aligner. Can a particular aligner be used for both RNAseq and WGS work? Or are there peculiarities that limit the application to either RNAseq and WGS?

Specifically, can BWA be used in RNAseq? Or can HISAT2 or Bowtie2 be used for WGS? What would be the impact on the downstream application, for instance, the structural variance assessment? Would fo instance the SAM files produced by HISAT2 be recognized by on Picard or GATK the same way as are those produced by BWA? would HISAT2 provide the 'read group' heading, for instance?

Thank you

RNA-Seq next-gen alignment • 7.1k views
2
Entering edit mode

I would like to put a reco in for bbmap.sh from BBMap suite. bbmap.sh is a generalist aligner that can tackle data from WGS, RNAseq to PacBio. Easy to understand (and use) options. Only requirement is Java. The suite also includes plenty of other tools.

Since basic SAM file format is codified, any aligner that sticks to the published format should produce valid SAM files that can be read by other tools. If an aligner does not produce standard SAM files then you should stay away from it.

7
Entering edit mode
3.6 years ago
h.mon 33k

can BWA be used in RNAseq?

No, BWA is intended for DNAseq (genome, exome, etc) only.

can HISAT2 or Bowtie2 be used for WGS?

Yes - in fact, Bowtie, much like BWA, is intended for DNAseq only. HISAT2 is more flexible, and the authors claim it can be used for DNAseq and RNAseq.

What would be the impact on the downstream application, for instance, the structural variance assessment?

It is difficult to predict, it depends on what the mapper does, and what information from the mapper the variant-predicting tool uses. For example, different mappers use different quality scoring schemes, different downstream tools may discard or not reads based on alignment quality. This has long been noted for variant calling: Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Would fo instance the SAM files produced by HISAT2 be recognized by on Picard or GATK the same way as are those produced by BWA?

I don't know. I suppose yes, but Picard is picky.

  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field
Note: @RG line only printed when --rg-id is set.

6
Entering edit mode
3.6 years ago

As always, you have to use the right tool for the right job. It is unlikely that one aligner is going to be the best at everything, so genomic DNA alignment will require a different aligner (most often bwa mem) than spliced RNA-seq alignment (commonly STAR or HISAT2).

bwa can do split alignments, but is not designed to do spliced alignment to span introns.

4
Entering edit mode
3.6 years ago
Garan ▴ 670

Might also want to look at Minimap2 by Heng Li

https://github.com/lh3/minimap2

Has a nice introductory tutorial and the readme gives some great examples.

*Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.*

2
Entering edit mode

Just a quick note, on short read sequencing minimap2 throws away more reads than bwa in practice. The guy who wrote bwa also wrote minimap2 (Heng Li, he has a blog on this stuff) and he recommends bwa for short reads and minimap2 for long reads. Minimap2 was not designed for short read overlap which is used in sequence assembly, so it is suboptimal for the much less error-prone short read assembly.

edit to include blog link: http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

4
Entering edit mode
3.6 years ago
YaGalbi ★ 1.5k

It sounds like your final goal is variant calling on WGS data, in which case let the variant calling software decide the aligner you use.

0
Entering edit mode

that's the point: it is possible to move out from BWA? I have the impression that BWA is widely used for historical reasons more than a fundamental link between BWA and GATK...

1
Entering edit mode

There is no "fundamental" link between BWA and GATK, they are developed by different teams. It just so happens the GATK team tested and found BWA works best for some GATK workflows. Nothing "historical" here.

Heng Li, the author of both BWA and minimap2, once thought he would retire BWA, but later realized BWA is still better in some use cases, and he is even considering further improving BWA, by backporting minimap2 features. Read about it here: Minimap2 and the future of BWA.

0
Entering edit mode

Why wouldn't you use BWA?

0
Entering edit mode

Because, from experience, I find HISAT faster than BWA...

1
Entering edit mode

So, it's your call to trade the speed that you gain from HISAT for the years of testing and benchmarking that BWA and GATK provide. I'm not saying HISAT won't do the job, there's just no way of knowing unless you do your own benchmark or dig up a paper that has.

EDIT: So, to answer the original question on how to choose an aligner: Find papers that have compared aligners for the type of downstream analysis you want to do and pick the one that seem best suited for your tasks at hand. Sometimes speed is critical. Sometimes sensitivity is critical. And sometimes it's something else entirely.

2
Entering edit mode

Bear in mind the time lost:

1) While you make your decision is likely to be far more than than the run time difference between the 2 aligners - its already been 2 days since the question was asked - just pointing out something we often overlook. I don't mean that in any sarcastic way. We often go for the "fastest" but take ages to decide, when we look back after some experience we think, I should have just got on with it.

2) While you are manipulating the GATK pipleine to replace BWA with another aligner. This may not be trivial. The reference genome indexing for both is different I think. There may be more differences to consider.

1
Entering edit mode

Well, who said I stopped using BWA? It is just that if there is a faster and more multipurpose aligner, the future analysis might be faster... The main problem, as it has been pointed out, is the interlink with the downstream applications. So probably Friederike is right: there is the need for some benchmarking.

3
Entering edit mode
3.6 years ago

The main issue these days isn't that some tools don't know how to produce valid SAM/BAM files, it's really about the intricacies of specific types of data.

As Wouter wrote: different aligners were developed with different types of data in mind.

The main challenge for RNA-seq, for example, is the lack of a true full reference since mature mRNA lacks the introns which would be needed to align the transcript sequences to the genome. Therefore, some aligners were developed for RNA-seq alignments, optimizing spliced-read-aware alignment (STAR) and possibly isoform prediction (HISAT2). Introns can be several (hundreds) kilobases long, which is something BWA or bowtie2 would, for example, not be able to take into consideration when aligning reads.

A quick search yielded this paper that might be a good starting point to find the aligner(s) you may want to use. Generally, most benchmarks have shown that the choice of aligner (if an appropriate one for the data at hand was used) is not the most crucial one these days, so your best bet might be to think about the downstream analyses you want to do and whether any recommendations regarding a specific aligner are indicated (see YaGalbi's link for variant calling, for example).