Tophat/cuffdiff versus CLC for mapping RNA-seq reads
1
1
Entering edit mode
7.7 years ago
muppetleague ▴ 10

I've noticed poor overlap between number of differentially expressed genes between CLC's 'Empirical Analysis of DGE' (edgeR test) versus Cuffdiff. CLC maps 20% more reads on average, and finds 200-400% more differentially expressed genes. It is to my understanding that CLC uses a proprietary read mapper that will usually give a higher percentage, while Tophat's probabilistic model spends more time resolving overlapping ends to reduce random placement. However, it is less obvious how to relax stringency in Tophat's parameters to account for the high amount of variation in the hybrids we sequenced versus our reference genome. Any insight would be greatly appreciated.

RNA-Seq Tophat Cuffdiff CLC EdgeR • 2.8k views
2
Entering edit mode
7.7 years ago
Dan D 7.3k

You'll probably want to experiment with different parameters, but Tophat is extremely flexible with regards to its parameters. Here are some you'll want to check out. From the manual (which is well worth your time to read through to get an idea of the knobs you can manipulate):

-N maximum number of mismatches a read can have without being discarded

-a How many bases must be mapped to each side of a junction

-m How many mismatches are allowed in an anchor region of a spliced alignment

--report-secondary-alignments Don't report only the best alignments

-g The maximum total alignments that will be reported for a read

--segment-length You can reduce this from the default of 25 to cut the reads up into smaller chunks, which would make the mapping more generous

--segment-mismatches How many mismatches a given segment is allowed to have

You can also try using --b2-very-sensitive for a quick setup

Without knowing more about your experiment I can't recommend specific starting parameters, but it's worth varying the parameters of interest by large amounts and comparing the results side-by-side in IGV or IGB.