Question

Rna Seq: Alignment Pipeline Using Tophat - Opinion Pls!

2

Entering edit mode

10.6 years ago

k.nirmalraman ★ 1.1k

Dear All,

Can some one help me with the following? Are there any pipelines that one would like to suggest on the other hand that I can refer to?

Using: Tophat2, Bowtie2

I have multiplexed single end sequencing data with 40bp reads and I would like to perform differential analysis on this.
I am using the index files from Tophat index & annotation downloads for UCSC mouse mm10

I use the following Tophat command to do the alignment

tophat -p 4 -N 3 --no-coverage-search --read-edit-dist 3 --output-dir <path> /Bowtie2Index/genome <input_path>

I would like to know how can I improve the alignment using tophat. I seem to have the following statistics on alignment

`3208092 reads; of these:`
 3208092 (100.00%) were unpaired; of these:
 724883 (22.60%) aligned 0 times
 1845395 (57.52%) aligned exactly 1 time
637814 (19.88%) aligned >1 times
77.40% overall alignment rate`

This averall alignment percentage varies from 75% to 80% for different samples. Is this normal? I only wonder how to account for this 20% - 25% of reads. Increasing the number of mismatches could be one option but any suggestions on that?

Note: I also checked for alignment against phix controls using Bowtie, which is less than 0.15%(in all samples).

Can you also comment on the number of reads (depth) that is required for a good DE gene analysis for a single sample library (say with 3 bio-replicates and 2 tech replicates for each condition) or size of each sample library?.

Anything else, that you think I am missing out or consider?

Thank you very much!

tophat alignment rnaseq bowtie pipeline • 7.4k views

ADD COMMENT • link updated 10.6 years ago by Devon Ryan 104k • written 10.6 years ago by k.nirmalraman ★ 1.1k

score 3 · Answer 1 · 2013-10-01

3

Entering edit mode

10.6 years ago

Mikael Huss 4.8k

I think the mapping rate sounds OK considering that the reads are short.

Alternatives include:

Use STAR instead of TopHat (usually gives me higher mapping rate)
Use the --transcriptome-index option in TopHat (also usually increases mapping rate)
Try adapter and quality trimming before mapping (may or may not change the results)

Are you saying that you have 3 tech replicates for each bio replicate? It's hard to say how many reads you "need" for DE analysis but the consensus seems to be that it is more important to have bio replicates than deep sequencing when it comes to DE analysis.

ADD COMMENT • link 10.6 years ago by Mikael Huss 4.8k

0

Entering edit mode

Thanks a lot!!

An yeah... Sorry it is 2 tech replicates!! My mistake!! Can I extend this question further: I used HTSeq on the TopHat output (bam to sam) for counting features... I had the following kind of stats!

no_feature 782850 ambiguous 5151 too_low_aQual 0 not_aligned 0 alignment_not_unique 2153777

there were 4.1 M features in total... The total number of reads aligned to one of the genes is about 1.2 Million. About 2 million reads not being aligned uniquely is disturbing.

What do you think about this? Does this look normal too?

ADD REPLY • link 10.6 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

For comparisons sake, I looked at a dataset that I aligned with STAR. Of the ~36 million (trimmed) reads for one sample, 31.6 million aligned uniquely. The equivalent numbers from HTSeq-count are: no_feature 3160988, ambiguous 526705, alignment_not_unique 5838118.

BTW, by 4.1M features, do you mean 4.1 million reads or that there were 4.1 million features in the annotation file that you fed to htseq-count?

ADD REPLY • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

So it looks STAR gives better alignment than TopHat from your numbers!! I shall try STAR then...

4.1M is the total number in HTSeq output (no feature+ ambiguous+too_low_qQual+not_aligned+GeneCountsReported)

ADD REPLY • link 10.6 years ago by k.nirmalraman ★ 1.1k

score 1 · Answer 2 · 2013-10-01

1

Entering edit mode

10.6 years ago

Devon Ryan 104k

To add to what Mikael Huss mentioned, there's not much benefit to technical replicates except to get enough reads. It's vastly more useful to have biological replicates. The exact number of replicates and reads needed is almost impossible to predict a priori, since it will depend entirely on the exact type of experiment you're doing. Have a look at scotty, which allows you to estimate needed sample sizes given results from similar experiments. There's also a Bioconductor package that says it can do power calculations for RNAseq, but I'd be pretty skeptical about its output.

Edit: If the number of reads that you have in your example is typical of the replicates that you have, then that's way too few unless you're expecting huge changes.

ADD COMMENT • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you very much for those links. But I wonder is there anything like a standard number/range of reads, which you think would be a safe bet?

ADD REPLY • link 10.6 years ago by k.nirmalraman ★ 1.1k

1

Entering edit mode

I usually get 20-30 million reads/sample and have at least 6 samples/group for my experiments. That's the minimum number of replicates that I'm comfortable working with for my experiments and would do more except the number of mice needed to do that would be huge (our samples come from independent mouse litters and we're constantly filling all of our cage space as is...). There are no standard numbers for these sorts of things, since what's needed can vary drastically (not to mention that prices have only recently become low enough that people have been able to sequence sufficient numbers of replicates).

ADD REPLY • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

I sure agree with that.. Thanks!! I also guessed 20 -30M would be safe!! Thanks again for clarifying!

ADD REPLY • link 10.6 years ago by k.nirmalraman ★ 1.1k