Selecting best RNA-seq mapping with TopHat
0
0
Entering edit mode
6.4 years ago
mjoyraj ▴ 80

After mapping my PE RNA-seq with Tophat with three different options (given below). I found the following mapping summary given below in table.

1) Without reference annotation:

tophat -p 8 -r 50 -o "output" "indexed_genome_file" R1.fastq R2.fastq

2) With reference annotation:

tophat -p 8 -G “genes.gtf” -o “tophat_RABM” “Genome” R1.fastq R2.fastq

3) With reference annotation disabling novel junctions:

Tophat --no-novel-juncs -p 8 -G “genes.gtf” -o “tophat_RABM” “Genome” R1.fastq R2.fastq

     

Mapping reads to genome with TopHat

     

With reference annotation

With reference annotation disabling novel junctions

Without reference annotation

Left reads

         
 

Input

 

71926313

71926313

71926313

 

Mapped

 

62199375 (86.5% of input)

60663645 (84.3% of input)

61835864 (86.0% of input)

   

Multiple alignment

10352306 (16.6%)(477254 have >20)

11540865 (19.0%)(508565 have >20)

15034571 (24.3%) (665149 have >20)

           

Right reads

         
 

Input

 

71926313

71926313

71926313

 

Mapped

 

62071170 (86.3% of input)

60575371 (84.2% of input)

61694450 (85.8% of input)

   

Multiple alignment

10352990 (16.7%)(477253 have >20

11553883 (19.1%)(508573 have >20)

15030545 (24.4%) (665010 have >20)

           

Overall mapping rate

   

86.40%

84.30%

85.90%

           
 

Aligned pairs

 

57244789

55041529

56591033

   

Multiple alignment

9527350 (16.6%)

10609333 (19.3%)

13776265 (24.3%)

   

Discordant alignment

4048217 ( 7.1%)

4044795 ( 7.3%)

3755391 ( 6.6%)

   

Concordant alignment

74.00%

70.90%

73.50%

           
           

No. of junctions

   

144075

97906

140296

 

Accordingly, I thought “with reference annotation” is the best one. But when I viewed the BAM file with junctions, I found there is lot of junctions with high depth between very distantly located genes. My genes of interest are duplicate genes. I guess pre-filtering the mapping along with some other arguments will further improve the mapping, so I thought of running the mapping with the following options:

tophat -p 8 -G genes.gtf -o SRX528281_tophat_RABM_Prefilter --no-mixed --no-discordant --max-multihits 2 --prefilter-multihits --read-realign-edit-dist 0 Genome R1.fastq R2.fastq

Whether my approach is correct…?? Whether the options included will improve the mapping without excluding important information’s..??  Any suggestion will be highly appreciated….

RNA-Seq alignment • 3.3k views
ADD COMMENT
0
Entering edit mode

I think think you are just making it complicated. If you have a GTF file, just use it. If you are not interested in novel transcripts, disable it.

Anyway these quantitative changes exists even if you run the tools with same set of options multiple times.

ADD REPLY
0
Entering edit mode

Some of the duplicate genes have high similarity or highly similar sequence pattern. So I think, if I do not filter the multiple hits, some false positive novel junctions will be revealed with significance. Though I have a GTF file, the genes on which I am interested are mostly predicted. So from the RNA seq results I am trying to re-annotate the genes and also looking for if the genes have isoforms...

ADD REPLY

Login before adding your answer.

Traffic: 1333 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6