For alignment, TopHat is the only one I know of that is intended for spliced reads (transcriptome). You'll get poor results with a genomic aligner. Nothing is perfect, even with tophat, I have found several islands of reads that mapped equally well in several places, and tophat was forced to choose one probabilistically, leaving me with some reads that belong to a gene, just floating in the middle of no where.
For differential expression, the team that made tophat has a continuation tool called CuffLinks that constructions alternate isoforms (multiple transcript variants) and assigns probabilistic expression levels to them quite well. Here you have a choice of supplying it an annotation for exons vs letting it decide blindly. I let it construct novel transcripts because that's what we really wanted to see, but found this to be a noisy process (many novel forms that look like mistakes). Then when every gene has three to six transcripts, comparing expression levels across samples is tricky indeed. Cufflinks will do it all for you, but I'm suspicious of just handing over the p-values to my bosses given they come from such tenuous associations.
Another tool for differential expression is called DESeq, an R/Bioconductor package that requires a set of read counts per region and just does the math on that. So it doesnt try to construct new transcripts, you just supply it a table of regions (I used the refseq gene list from UCSC) by counts (coverageBED is a tool to count how many reads in a huge BAM file fall onto those locations). DESeq isn't trying to make up new transcripts, and doesnt handle overlapping genes so well, (that's up to the human preparing the read count table).. but the math is sound and sensible. Simpler tools feel more trustworthy.
With about any fixed amount of reads, you'll see some transcripts quite clearly, and some only roughly. More reads is better, but youre never going to have enough to see the lightest expression. CL sort of hides this from you, but DESeq is clear about their estimations.
I have found in regions of poor coverage depth, CL can construct incorrect transcripts, (best guess). It shouldnt be expected to be accurate in the lack of data at low coverage genes (most of them!). Since DESeq starts with an annotated set of genes, and simple read counts, the results are more robust in low coverage areas, but I repeat it will not construct novel isoforms or do anything with reads falling outside of known annotated regions!
9.5 years ago by
Karl • 340