I am analyzing paired-end RNAseq data. I've been using TopHat to align to my reference genome. When I align using the default settings, and calculate the insert size, I get very inconsistent results and large standard deviations using various tools (Picard, Qualimap, etc.). I was hoping to be able to estimate the mean inner distance and then go back to re-run TopHat. While changing this parameter does not seem to affect the alignment much, my ultimate results using Cufflinks and Cuffdiff to find DE genes are greatly affected. Searching for an answer, I found some information suggesting that aligning to a reference genome for this calculation may be misleading because of introns, so I tried aligning to a reference transcriptome using Bowtie and calculated yet another insert size value (which results in a negative mean inner distance). This value seems to be pretty consistent among my different data sets, however. Could anyone explain to me the proper way to go about calculating the insert size for paired-end data? Is initially mapping to a reference transcriptome the way to go?
I think when TopHat is asking about an inner distance, it means the actual fragment size of the library that was sequenced. The alignment data, due to introns as you say, will give you a very poor estimate of the mean inner distance. There's no way when sequencing two ends of a fragment on multiple exons to know what introns where spliced in/out of the fragment, and thus what the actual fragment size was. How long is a piece of string.
So i'd find the original library or whomever prepared it and ask them for the insert size to be calculated.
Use BBMap tools as recommended by @Brian in this answer (Post #2) to estimate insert/fragment size: http://seqanswers.com/forums/showthread.php?t=49548
Thank you for the suggestion. I tested BBMap on a few data sets and found the avg. insert size to be larger than expected. However, looking at the histograms, I believe the median insert sizes to be better estimates. These sizes are also very comparable to my previous results using Picard after mapping to the transcriptome. The only difference is the std. deviations, which are much larger with BBMap. I guess my question has now become are my initial transcriptome mapping/Picard estimates valid? Both approaches should theoretically address the problem of the introns, correct?
Using aligners to find inner distances to do alignments that yield more aligned fragments is the definition of overfitting a model.
I'm not saying it isn't done. I'm not saying you can't do it. I'm just saying you will probably get closer to empirical truth by just using BBMap/STAR over TopHat to begin with, and forget optimising TopHat altogether. Also mapping RNA-Seq data is a really hard problem - it's always going to be a tradeoff between sensitivity and accuracy. The tools therefore will not agree and probably shouldn't. Many people try to overcome this by taking the union of two or three tools, but I think this has been shown to just amplify errors rather than improve mapping. In short, choose your fav. aligner and stick to it :) You can spend days comparing the outputs of these tools, but you're better off spending those days looking at just one of those result in detail.
As for the standard deviations being different but the median the same, is it possible that one outputs standard deviation and the other upper/lower quartiles or standard error? They might all be std. dev. but whenever i hear that alarm bells ring to check that they're actually measuring the same thing,
I concur. At this point in time you should either use HISAT2 (tool that superceeded TopHat) or one of the standalone splice-aware aligners.
I don't think they are going to drastically change the alignment result. It may affect a few edge cases but most other reads should align in their proper spot with most aligners. If you only did a million reads as @Brian had suggested to get the insert size then you need to keep in mind that you are sampling a small subset of your data.
Thank you both for your comments, I very much appreciate your help and input. What you've said makes a lot of sense.
John, what you mentioned about whether the tools are measuring the same thing made me look into how both Picard (InsertSizeMetrics) and BBMap were calculating the average insert size and std. dev. The histograms revealed the difference. From what I understand, the InsertSizeMetrics tool trims the "core" distribution around the median insert size using a specified number of median absolute deviations, "to prevent calculation of nonsensical mean and std. dev. values."
This doesn't appear to be the case with BBMap. But when I trim the BBMap distribution myself and recalculate, the average insert sizes and std. dev match very closely with both tools.
Genomax2, I will look into switching over to HISAT2 as you suggest. Thanks again!
Great detective work! That's really good to know for all of us - thanks CF :)