RNA-seq differential expression analysis, which aligner to choose between BWA/tophat/Bowtie?
6
2
Entering edit mode
7.0 years ago
EagleEye 7.2k

I have RNAseq reads (single-end, 49-bp), I would like to do differential expression analysis. I know the standard pipeline is some rnaseq aligners ( e.g. Tophat, soap etc,) followed by featureCounts/Cufflinks(with assembly)/HTSeq and EdgeR/DESeq.

My question is, 1) will it be appropriate to use BWA aligner instead of other standard RNAseq specific aligners? And 2) why there is lot of difference in percentage of mapping reads when you do alignment with BWA and other RNAseq specific aligners?

RNA-Seq alignment next-gen • 11k views
4
Entering edit mode
7.0 years ago

BWA isn't going to handle spliced alignment terribly well, so it's not normally recommended for RNAseq datasets. That's also why it gives lower alignment rates.

BTW, give STAR a try. It's much faster than tophat.

0
Entering edit mode

Thanks a lot.

6
Entering edit mode
7.0 years ago

As long as we are suggesting different splice-aware aligners...  I'll toss in BBMap and Seal.  BBMap is faster and more sensitive than Tophat, and can find splice junctions even with little 49bp single-ended reads, with no need for annotation files (e.g., gtf).

If you have the transcriptome, you can directly quantify the RPKM/FPKM and coverage using Seal (distributed with BBMap) which uses an alignment-free approach, and is ~80x faster than using BBMap to map to the genome.  For usage information, you can run "seal.sh" with no arguments, but basically you run it like this:

2
Entering edit mode

Hi Brian,

I'm interested in using Seal, however I'm not sure where I would get a 'transcriptome.fa' file. Is that an output generated from BBmap?

On a side note - I'm extremely interested in your background; your training in computer science, your bioinformatics training and experience, how you managed to write an entire next generation analysis suite... I feel like you need a wikipedia page.

2
Entering edit mode

Hi Paul,

Most organisms that have a transcriptome fasta have one through the concerted effort of a group of people dedicated to finding the transcriptome of that organism. Often, that happens through JGI. Sometimes it happens through other agencies. Unfortunately, the only thing that is guaranteed is that for a given organism, it will be hard to figure out whether it has already been sequenced, and where the best version of its genome or transcriptome is located.

NCBI's ftp site is always my first go-to site for genomic data, though. JGI's website is second, since it has a lot of useful information, but is less convenient to use. Recently I've been irritated that NCBI's ftp site lacks the organisms I'm interested in; but, JGI's is much more confusing to navigate. If you have trouble at JGI please send me a message and I'll forward it to those responsible.

BBMap is not normally used to generate transcriptome fastas (it can, but Tadpole is designed for DNA assembly and is not optimal for spliced RNA assembly). JGI uses Trinity for eukaryote RNA assembly, which (should) work better for spliced RNA. For prokaryotic RNA assembly, which does not have multiple isoforms, Tadpole will do a good job.

If you have a transcriptome fasta, you can use either BBMap or Seal to quantify the expression levels. Seal is probably 100x faster (the difference depends on the genome), but it is less tolerant of errors in your data. Which is generally fine. If I had a transcriptome fasta (prok or euk), and my goal was to quantify expression levels, I would use Seal over BBMap. With only a genome, you must use a splice-aware aligner like BBMap. Theoretically, you can get better results (assuming optimal quantification postprocessing) when aligning to the genome compared to the transcriptome; but in practice it's not all that important because there is no current program that does a good job of quantifying gene isoform expression from RNA reads mapped to a genome. I still recommend mapping to the genome, though, if you are interested in the unbiased ground truth or in discovering new isoforms.

As for my background - my formal training is pure computer science. Most of what I know about genomics was taught to me at UT Southwestern and JGI, and thus I am very grateful to my co-workers.

I started writing an analysis suite because at UT Southwestern I was exposed to the Life Technologies Bioscope package, which is horrible, but UTSW was using it for production. It is intolerant of errors, even though the Solid 4 platform has a high error rate; and Bioscope is incapable of detecting indels, which are the most critical mutation type for the data we were analyzing. And it is difficult to use, with myriads of config files needing hundreds of pages of documentation, for the simplest task (like map reads to human -> call variants). So the standard pipeline was producing useless output. It appeared to be both easier and more useful to start from scratch with a new mapper, variant-caller, and annotator than to use anything from Life, so I did that, with good results.

UTSW considers my variant-caller and annotator (and various human-specific tools) their property, which is unfortunate, and the reason they are not in the BBTools package. I wrote BBMap in my spare time while at UTSW, though, so I own it. Most of the rest of BBTools was written at JGI and thus belongs to Berkeley Labs, which is very permissive and allows me to make it publicly available with no restrictions.

As for Wikipedia - it's the website anyone can edit! I've added a page myself (about a JPop vocalist) and fixed another (about circuit verification). Maybe someday I'll get my own!

0
Entering edit mode

sorry Brian I should perform miRNA DE analysis..I have remote access to linux but it is too slow, I have fedora virtual box on my lap top but without anything installed such as python, I tried galaxy but even uploading my files took many hours at last failed to complete... then my last chance is something working on windows like what you designed. am I right about miRNA DE analysis by bbmap in windows?????

1
Entering edit mode

BBMap and Seal can be used for expression analysis, but not differential-expression analysis, without an additional program. However, they do the hard work of the actual mapping. Maybe I should write something for DE analysis...

0
Entering edit mode

sorry Brian,

I have access to recently sequenced smallRNA data from the pathogen fungi Aspergillus fumigatus and Candida albicans. my mentor is interested in the differential expression of miRNAs in response to human blood infection. I trimmed 3 prime illumina universal adapters and reads shorter thsn 15 bp via BBDuk then I mapped reads on reference genome fasta through bowtie2, removed non-aligned reads by samtools. now I have sam files. how I can extract RPKM values from sam or bam files by BBmap please?????? I found a transcriptome fasta contains CDS, proteins and ncRNA might be came in handy.

thank you

1
Entering edit mode

Actually, Pileup can output rpkm values with the "rpkm=file.txt" flag. But that will only be useful if the reads were mapped to the transcriptome, not the genome.

0
Entering edit mode

6
Entering edit mode
7.0 years ago
Rob 5.1k

For spliced RNA-seq aligners, I'd recommend STAR or HISAT.  However, let me just say that I would argue against doing quantification and differential expression analysis with a feature counts / HTSeq-based pipeline.  These tools can tell you about the number of reads that map to a particular genomic position, but they do nothing to help disambiguate which of the (possibly numerous) overlapping transcripts they originate from.  This is not a minor detail, but rather, the transcript of origin (and the overall length of that transcript) can have a huge impact on the appropriate estimate of that transcript's relative abundance.  This leads to the type of problems discussed in e.g. the CuffDiff2 manuscript (take a look at figure 1 in particular).  This is true even if you're only interested in differential expression at the gene (as opposed to, say, the isoform) level --- the count-based models for handling that (like the union-intersection model) still fall short.  You're much better off using one of the many tools that is able to properly quantify transcript abundance (Cufflinks, RSEM, eXpress, Salmon, BitSeq, TIGAR, . . . the list goes on).

0
Entering edit mode

Thanks, I will try to check with these tools.

4
Entering edit mode
7.0 years ago
rtliu ★ 2.1k

HISAT is a fast and sensitive spliced alignment program for mapping RNA-seq reads. It is designed as TopHat2 replacement. In the preprint manuscript

"we show that HISAT is the fastest system currently available, approximately 50 times faster than TopHat2 and 12 times faster than GSNAP, with equal or better accuracy than any other method."

2
Entering edit mode
7.0 years ago

A genomic aligner like BWA won't handle intron splicing putting a huge indel in your reads. Tophat was made for this, and finds a lot of potential places where a read could fall, it's particularly good at finding new splice junctions .

If however you already know what transcripts you want to quantify, say by a database of sequences, or a reference GTF, then my favorite tool is RSEM, it aligns without splicing to a pre-spliced transcriptome reference set, so I trust the accuracy and it even reports confidence intervals around expression quantification.

Tophat by comparison runs for a long time and reports splicing between gene paralogs, because my read pairs fit between them cleanly. That's paired-end, so maybe with your single-end data it will be more linear.

0
Entering edit mode

2
Entering edit mode
5.4 years ago
EagleEye 7.2k

After trying all the tools, STAR seems to be more reliable and much faster (tried Single- and paired-end samples). Thanks again for all your suggestions.

(If possible, I will try to provide some detailed comparison soon with the dataset I used.)

0
Entering edit mode

if possible share a brief about STAR workflow

thank you