Question: How to select an aligner?
1
gravatar for marongiu.luigi
10 months ago by
Germany, Mannheim, UMM
marongiu.luigi380 wrote:

Hello,

I am currently working on a WGS project and I would like to ask whether there are specific restrictions on the choice of the short read aligner. Can a particular aligner be used for both RNAseq and WGS work? Or are there peculiarities that limit the application to either RNAseq and WGS?

Specifically, can BWA be used in RNAseq? Or can HISAT2 or Bowtie2 be used for WGS? What would be the impact on the downstream application, for instance, the structural variance assessment? Would fo instance the SAM files produced by HISAT2 be recognized by on Picard or GATK the same way as are those produced by BWA? would HISAT2 provide the 'read group' heading, for instance?

Thank you

rna-seq alignment next-gen • 1.8k views
ADD COMMENTlink modified 10 months ago by Friederike3.8k • written 10 months ago by marongiu.luigi380
2

I would like to put a reco in for bbmap.sh from BBMap suite. bbmap.sh is a generalist aligner that can tackle data from WGS, RNAseq to PacBio. Easy to understand (and use) options. Only requirement is Java. The suite also includes plenty of other tools.

Since basic SAM file format is codified, any aligner that sticks to the published format should produce valid SAM files that can be read by other tools. If an aligner does not produce standard SAM files then you should stay away from it.

ADD REPLYlink modified 10 months ago • written 10 months ago by genomax66k
6
gravatar for WouterDeCoster
10 months ago by
Belgium
WouterDeCoster38k wrote:

As always, you have to use the right tool for the right job. It is unlikely that one aligner is going to be the best at everything, so genomic DNA alignment will require a different aligner (most often bwa mem) than spliced RNA-seq alignment (commonly STAR or HISAT2).

bwa can do split alignments, but is not designed to do spliced alignment to span introns.

ADD COMMENTlink written 10 months ago by WouterDeCoster38k
6
gravatar for h.mon
10 months ago by
h.mon24k
Brazil
h.mon24k wrote:

can BWA be used in RNAseq?

No, BWA is intended for DNAseq (genome, exome, etc) only.

can HISAT2 or Bowtie2 be used for WGS?

Yes - in fact, Bowtie, much like BWA, is intended for DNAseq only. HISAT2 is more flexible, and the authors claim it can be used for DNAseq and RNAseq.

What would be the impact on the downstream application, for instance, the structural variance assessment?

It is difficult to predict, it depends on what the mapper does, and what information from the mapper the variant-predicting tool uses. For example, different mappers use different quality scoring schemes, different downstream tools may discard or not reads based on alignment quality. This has long been noted for variant calling: Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Would fo instance the SAM files produced by HISAT2 be recognized by on Picard or GATK the same way as are those produced by BWA?

I don't know. I suppose yes, but Picard is picky.

would HISAT2 provide the 'read group' heading, for instance?

  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field
  --rg <text>        add <text> ("lab:value") to @RG line of SAM header.
                     Note: @RG line only printed when --rg-id is set.
ADD COMMENTlink written 10 months ago by h.mon24k
4
gravatar for Garan
10 months ago by
Garan560
United Kingdom
Garan560 wrote:

Might also want to look at Minimap2 by Heng Li

https://github.com/lh3/minimap2

https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/bty191/4994778

Has a nice introductory tutorial and the readme gives some great examples.

*Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 paper or the preprint.*

ADD COMMENTlink written 10 months ago by Garan560
2

Just a quick note, on short read sequencing minimap2 throws away more reads than bwa in practice. The guy who wrote bwa also wrote minimap2 (Heng Li, he has a blog on this stuff) and he recommends bwa for short reads and minimap2 for long reads. Minimap2 was not designed for short read overlap which is used in sequence assembly, so it is suboptimal for the much less error-prone short read assembly.

edit to include blog link: http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa

ADD REPLYlink modified 10 months ago • written 10 months ago by drkennetz360
2
gravatar for YaGalbi
10 months ago by
YaGalbi1.4k
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.4k wrote:

It sounds like your final goal is variant calling on WGS data, in which case let the variant calling software decide the aligner you use.

E.g. GATK best practices uses BWA

ADD COMMENTlink modified 10 months ago • written 10 months ago by YaGalbi1.4k

that's the point: it is possible to move out from BWA? I have the impression that BWA is widely used for historical reasons more than a fundamental link between BWA and GATK...

ADD REPLYlink written 10 months ago by marongiu.luigi380
1

There is no "fundamental" link between BWA and GATK, they are developed by different teams. It just so happens the GATK team tested and found BWA works best for some GATK workflows. Nothing "historical" here.

Heng Li, the author of both BWA and minimap2, once thought he would retire BWA, but later realized BWA is still better in some use cases, and he is even considering further improving BWA, by backporting minimap2 features. Read about it here: Minimap2 and the future of BWA.

ADD REPLYlink written 10 months ago by h.mon24k

Why wouldn't you use BWA?

ADD REPLYlink written 10 months ago by WouterDeCoster38k

Because, from experience, I find HISAT faster than BWA...

ADD REPLYlink written 10 months ago by marongiu.luigi380
1

So, it's your call to trade the speed that you gain from HISAT for the years of testing and benchmarking that BWA and GATK provide. I'm not saying HISAT won't do the job, there's just no way of knowing unless you do your own benchmark or dig up a paper that has.

EDIT: So, to answer the original question on how to choose an aligner: Find papers that have compared aligners for the type of downstream analysis you want to do and pick the one that seem best suited for your tasks at hand. Sometimes speed is critical. Sometimes sensitivity is critical. And sometimes it's something else entirely.

ADD REPLYlink modified 10 months ago • written 10 months ago by Friederike3.8k
1

Bear in mind the time lost:

1) While you make your decision is likely to be far more than than the run time difference between the 2 aligners - its already been 2 days since the question was asked - just pointing out something we often overlook. I don't mean that in any sarcastic way. We often go for the "fastest" but take ages to decide, when we look back after some experience we think, I should have just got on with it.

2) While you are manipulating the GATK pipleine to replace BWA with another aligner. This may not be trivial. The reference genome indexing for both is different I think. There may be more differences to consider.

ADD REPLYlink modified 10 months ago • written 10 months ago by YaGalbi1.4k
1

Well, who said I stopped using BWA? It is just that if there is a faster and more multipurpose aligner, the future analysis might be faster... The main problem, as it has been pointed out, is the interlink with the downstream applications. So probably Friederike is right: there is the need for some benchmarking.

ADD REPLYlink written 10 months ago by marongiu.luigi380
1
gravatar for Friederike
10 months ago by
Friederike3.8k
United States
Friederike3.8k wrote:

The main issue these days isn't that some tools don't know how to produce valid SAM/BAM files, it's really about the intricacies of specific types of data.

As Wouter wrote: different aligners were developed with different types of data in mind.

The main challenge for RNA-seq, for example, is the lack of a true full reference since mature mRNA lacks the introns which would be needed to align the transcript sequences to the genome. Therefore, some aligners were developed for RNA-seq alignments, optimizing spliced-read-aware alignment (STAR) and possibly isoform prediction (HISAT2). Introns can be several (hundreds) kilobases long, which is something BWA or bowtie2 would, for example, not be able to take into consideration when aligning reads.

A quick search yielded this paper that might be a good starting point to find the aligner(s) you may want to use. Generally, most benchmarks have shown that the choice of aligner (if an appropriate one for the data at hand was used) is not the most crucial one these days, so your best bet might be to think about the downstream analyses you want to do and whether any recommendations regarding a specific aligner are indicated (see YaGalbi's link for variant calling, for example).

ADD COMMENTlink written 10 months ago by Friederike3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1364 users visited in the last hour