Question

Estimated Fragment Mean In Cuffdiff

1

Entering edit mode

11.5 years ago

apt.university ▴ 70

Dear all, Cuffdiff determines my fragment lenghts to be 188.34 with Std Dev of 58. I am analyzing a paired-end 100 bases library. Which means that according to tophat the mate-inner-dist should be 188 - (2 * 100) = -12. The sequencing center told me that the fragments' mean length is in fact 320 (which I thought was without the primers), so I initially set mate-inner-dist to 120 and I had 85% of the reads aligned (I got the number of aligned reads with samtools falgstat).

On the other hand, using an mate-inner-dist of -12 and Std. Dev. of 58 produces about 70% aligned reads. I have three issues: 1- If indeed my distance is -12, shouldn't my reads overlap by, on average, 12 bases -- I aligned few thousand sequences and none of them do. 2- I don't understand how come the % of aligned with mate-inner-dist of 120 is larger. 3- Are there any other ways of getting useful statistics about bam alignment other than using samtools (idxstats and flagstat)

Thanks for any suggestions you might be able to provide!

Madi

rna-seq cuffdiff • 3.2k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 11.5 years ago by apt.university ▴ 70

score 1 · Answer 1 · 2012-11-04

Yes, this (higher TopHat mapping rates with "wrong" mate inner distance) is a kind of mystery that has been observed by myself and others - there are some discussions on SeqAnswers about this that I don't have the time to locate at the moment (sorry about that - kind of in a rush). Briefly, setting the mate inner distance "too high" somehow seems to give higher mapping rates, just as you have observed.

Useful stats about BAM files other than samtools: try

RSeQC, the BAM_stat.py tool
Picard CollectAlignmentSummaryMetrics (and other related tools)

score 0 · Answer 2 · 2012-11-05

Thanks Mikael! I did find the thread on SeqAnswers and it seems that it is a long one, as you said, without any conclusions! I tried CollectAlignmentSummaryMetrics and it seems that it returns exactly the same value for alignments using distinct values of r. Furthermore, it reports a PFHQMEDIAN_MISMATCHES of 0.8 (80%), which seems rather suspicious! I'll have to dig more into this. Thanks again for taking the time to answer my question.