After mapping my PE RNA-seq with Tophat with three different options (given below). I found the following mapping summary given below in table.
1) Without reference annotation:
tophat -p 8 -r 50 -o "output" "indexed_genome_file" R1.fastq R2.fastq
2) With reference annotation:
tophat -p 8 -G “genes.gtf” -o “tophat_RABM” “Genome” R1.fastq R2.fastq
3) With reference annotation disabling novel junctions:
Tophat --no-novel-juncs -p 8 -G “genes.gtf” -o “tophat_RABM” “Genome” R1.fastq R2.fastq
Mapping reads to genome with TopHat |
|||||
With reference annotation |
With reference annotation disabling novel junctions |
Without reference annotation |
|||
Left reads |
|||||
Input |
71926313 |
71926313 |
71926313 |
||
Mapped |
62199375 (86.5% of input) |
60663645 (84.3% of input) |
61835864 (86.0% of input) |
||
Multiple alignment |
10352306 (16.6%)(477254 have >20) |
11540865 (19.0%)(508565 have >20) |
15034571 (24.3%) (665149 have >20) |
||
Right reads |
|||||
Input |
71926313 |
71926313 |
71926313 |
||
Mapped |
62071170 (86.3% of input) |
60575371 (84.2% of input) |
61694450 (85.8% of input) |
||
Multiple alignment |
10352990 (16.7%)(477253 have >20 |
11553883 (19.1%)(508573 have >20) |
15030545 (24.4%) (665010 have >20) |
||
Overall mapping rate |
86.40% |
84.30% |
85.90% |
||
Aligned pairs |
57244789 |
55041529 |
56591033 |
||
Multiple alignment |
9527350 (16.6%) |
10609333 (19.3%) |
13776265 (24.3%) |
||
Discordant alignment |
4048217 ( 7.1%) |
4044795 ( 7.3%) |
3755391 ( 6.6%) |
||
Concordant alignment |
74.00% |
70.90% |
73.50% |
||
No. of junctions |
144075 |
97906 |
140296 |
Accordingly, I thought “with reference annotation” is the best one. But when I viewed the BAM file with junctions, I found there is lot of junctions with high depth between very distantly located genes. My genes of interest are duplicate genes. I guess pre-filtering the mapping along with some other arguments will further improve the mapping, so I thought of running the mapping with the following options:
tophat -p 8 -G genes.gtf -o SRX528281_tophat_RABM_Prefilter --no-mixed --no-discordant --max-multihits 2 --prefilter-multihits --read-realign-edit-dist 0 Genome R1.fastq R2.fastq
Whether my approach is correct…?? Whether the options included will improve the mapping without excluding important information’s..?? Any suggestion will be highly appreciated….
I think think you are just making it complicated. If you have a GTF file, just use it. If you are not interested in novel transcripts, disable it.
Anyway these quantitative changes exists even if you run the tools with same set of options multiple times.
Some of the duplicate genes have high similarity or highly similar sequence pattern. So I think, if I do not filter the multiple hits, some false positive novel junctions will be revealed with significance. Though I have a GTF file, the genes on which I am interested are mostly predicted. So from the RNA seq results I am trying to re-annotate the genes and also looking for if the genes have isoforms...