Question: Rna Seq: Alignment Pipeline Using Tophat - Opinion Pls!
gravatar for k.nirmalraman
5.7 years ago by
k.nirmalraman990 wrote:

Dear All,

Can some one help me with the following? Are there any pipelines that one would like to suggest on the other hand that I can refer to?

Using: Tophat2, Bowtie2

  • I have multiplexed single end sequencing data with 40bp reads and I would like to perform differential analysis on this.
  • I am using the index files from Tophat index & annotation downloads for UCSC mouse mm10

I use the following Tophat command to do the alignment

tophat -p 4 -N 3 --no-coverage-search --read-edit-dist 3 --output-dir <path> /Bowtie2Index/genome <input_path>

I would like to know how can I improve the alignment using tophat. I seem to have the following statistics on alignment

`3208092 reads; of these:`
 3208092 (100.00%) were unpaired; of these:
 724883 (22.60%) aligned 0 times
 1845395 (57.52%) aligned exactly 1 time
637814 (19.88%) aligned >1 times
77.40% overall alignment rate`

This averall alignment percentage varies from 75% to 80% for different samples. Is this normal? I only wonder how to account for this 20% - 25% of reads. Increasing the number of mismatches could be one option but any suggestions on that?

Note: I also checked for alignment against phix controls using Bowtie, which is less than 0.15%(in all samples).

Can you also comment on the number of reads (depth) that is required for a good DE gene analysis for a single sample library (say with 3 bio-replicates and 2 tech replicates for each condition) or size of each sample library?.

Anything else, that you think I am missing out or consider?

Thank you very much!

ADD COMMENTlink modified 5.7 years ago by Devon Ryan90k • written 5.7 years ago by k.nirmalraman990
gravatar for Mikael Huss
5.7 years ago by
Mikael Huss4.6k
Mikael Huss4.6k wrote:

I think the mapping rate sounds OK considering that the reads are short.

Alternatives include:

  • Use STAR instead of TopHat (usually gives me higher mapping rate)
  • Use the --transcriptome-index option in TopHat (also usually increases mapping rate)
  • Try adapter and quality trimming before mapping (may or may not change the results)

Are you saying that you have 3 tech replicates for each bio replicate? It's hard to say how many reads you "need" for DE analysis but the consensus seems to be that it is more important to have bio replicates than deep sequencing when it comes to DE analysis.

ADD COMMENTlink written 5.7 years ago by Mikael Huss4.6k

Thanks a lot!!

An yeah... Sorry it is 2 tech replicates!! My mistake!! Can I extend this question further: I used HTSeq on the TopHat output (bam to sam) for counting features... I had the following kind of stats!

no_feature 782850 ambiguous 5151 too_low_aQual 0 not_aligned 0 alignment_not_unique 2153777

there were 4.1 M features in total... The total number of reads aligned to one of the genes is about 1.2 Million. About 2 million reads not being aligned uniquely is disturbing.

What do you think about this? Does this look normal too?

ADD REPLYlink written 5.7 years ago by k.nirmalraman990

For comparisons sake, I looked at a dataset that I aligned with STAR. Of the ~36 million (trimmed) reads for one sample, 31.6 million aligned uniquely. The equivalent numbers from HTSeq-count are: no_feature 3160988, ambiguous 526705, alignment_not_unique 5838118.

BTW, by 4.1M features, do you mean 4.1 million reads or that there were 4.1 million features in the annotation file that you fed to htseq-count?

ADD REPLYlink written 5.7 years ago by Devon Ryan90k

So it looks STAR gives better alignment than TopHat from your numbers!! I shall try STAR then...

4.1M is the total number in HTSeq output (no feature+ ambiguous+too_low_qQual+not_aligned+GeneCountsReported)

ADD REPLYlink written 5.7 years ago by k.nirmalraman990
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

To add to what Mikael Huss mentioned, there's not much benefit to technical replicates except to get enough reads. It's vastly more useful to have biological replicates. The exact number of replicates and reads needed is almost impossible to predict a priori, since it will depend entirely on the exact type of experiment you're doing. Have a look at scotty, which allows you to estimate needed sample sizes given results from similar experiments. There's also a Bioconductor package that says it can do power calculations for RNAseq, but I'd be pretty skeptical about its output.

Edit: If the number of reads that you have in your example is typical of the replicates that you have, then that's way too few unless you're expecting huge changes.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Devon Ryan90k

Thank you very much for those links. But I wonder is there anything like a standard number/range of reads, which you think would be a safe bet?

ADD REPLYlink written 5.7 years ago by k.nirmalraman990

I usually get 20-30 million reads/sample and have at least 6 samples/group for my experiments. That's the minimum number of replicates that I'm comfortable working with for my experiments and would do more except the number of mice needed to do that would be huge (our samples come from independent mouse litters and we're constantly filling all of our cage space as is...). There are no standard numbers for these sorts of things, since what's needed can vary drastically (not to mention that prices have only recently become low enough that people have been able to sequence sufficient numbers of replicates).

ADD REPLYlink written 5.7 years ago by Devon Ryan90k

I sure agree with that.. Thanks!! I also guessed 20 -30M would be safe!! Thanks again for clarifying!

ADD REPLYlink written 5.7 years ago by k.nirmalraman990
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2121 users visited in the last hour