Question

Tophat --max-multihits option

1

Entering edit mode

9.1 years ago

biolab ★ 1.4k

Dear all

I have a simple question: Tophat2 has --max-multihits option. If I set it to one, does it mean that each read is mapped at unique locus? This will loose many reads for multi-copy genes (for example, Actin genes). Could you please explain to me why some research work used "uniq mapping" reads?

I appreciate any of your comments. Thank you very much!

tophat • 3.9k views

ADD COMMENT • link updated 22 months ago by Ram 43k • written 9.1 years ago by biolab ★ 1.4k

Ram · Answer 1 · 2015-04-15

Yes with --max-multihits 1 you're going to get only uniquely mapped reads. This is not such a bad idea as it may seem and actually a lot of programs for subsequent steps of the analysis will only use uniquely mapped reads (one for all HTSeq-count). This approach is very conservative and you lose quite a good number of reads. But in my experience (I also performed a few simulations to prove this) the results are very reliable. Basically all other possibilities (like with RSEM) make use of some assumptions: for example what happens if the ratio between the expression of two paralogs is different in two conditions? (for example for differential splicing) you will get a bias in the fold change estimate. while if you only consider unambiguous reads, you will only get a lower significance for an eventual differential expression. of these two scenarios I prefer the latter.

Have a look at these slides, to make it clearer. (the simulations are based on SMN1 and SMN2, which to my knowledge are two of the paralogs in the human genome with the highest similarity. they only have 2 mismatches on their sequence. given 100bp SE reads, 85% of the total reads of these two genes will be ambiguous, or multi mapped. 1000 simulations are plotted. the DE analysis was done with DESeq2)

https://www.dropbox.com/s/6w55godj2wetbed/unambiguous_counts.pdf?dl=0

score 0 · Answer 2 · 2016-09-20

To complete the response of Martombo, uniquely mapped reads can be useful when your are studying a special biological event like the translation for instance (With ribosome profiling). When we filter the data from sequencer, we select good quality reads and then the mapping is done keeping only uniquely mapped reads !

So, some duplicated regions will be removed (I mean, no reads will map on these regions), but others mapping are used to study these special cases if needed. ;)