Question

Migrating pipeline from TopHat to STAR

1

Entering edit mode

5.9 years ago

fadhil.abubaker ▴ 20

Hey all,

Our current pipeline for alignment and analysis uses TopHat + Cufflinks and we wish to replace TopHat with STAR. We are in particular interested in splice junctions so we ran STAR against a sample and compared both the junctions file from TopHat and STAR.

For some reason, TopHat seems to be reporting more junction reads than STAR, which goes against everything I've read online.

Here are the commands I used for both aligners:

STAR --genomeDir '/home/data/STAR_indexes' --runThreadN 32 --readFilesIn 1B_Stem_Cells_S42_R1_001.fastq 1B_Stem_Cells_S42_R2_001.fastq --outFileNamePrefix ./1B/1B_ --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --sjdbGTFfile /home/data/hg38_ref/hg38.refGene_gene_longest.gtf  --twopass1readsN -1 --twopassMode Basic

tophat2 --b2-sensitive --no-coverage-search -G /home/data/hg38_ref/hg38_refGene_gene_longest.gtf -p 32 -o ./1A /home/data/bowtie2idx/hg38 1A_Stem_Cells_S41_R1_001.fastq 1A_Stem_Cells_S41_R2_001.fastq

Here are the first few lines of the sorted junctions file:

TopHat junctions.bed

chr1    14679   15100   JUNC00000001    1074    -       14679   15100   255,0,0 2       150,131 0,290
chr1    14902   185595  JUNC00000002    453     -       14902   185595  255,0,0 2       75,97   0,170596
chr1    14903   15907   JUNC00000003    215     -       14903   15907   255,0,0 2       135,112 0,892
chr1    15927   16728   JUNC00000004    60      -       15927   16728   255,0,0 2       20,122  0,679
chr1    15962   16714   JUNC00000005    109     -       15962   16714   255,0,0 2       65,108  0,644
chr1    15962   187236  JUNC00000006    42      -       15962   187236  255,0,0 2       65,108  0,171166
chr1    16606   187273  JUNC00000007    68      -       16606   187273  255,0,0 2       42,103  0,170564
chr1    16606   187245  JUNC00000008    54      -       16606   187245  255,0,0 2       48,69   0,170570
chr1    16716   17001   JUNC00000009    177     -       16716   17001   255,0,0 2       49,144  0,141
chr1    16857   187577  JUNC00000010    95      -       16857   187577  255,0,0 2       93,105  0,170615
chr1    16912   187577  JUNC00000011    23      -       16912   187577  255,0,0 2       104,39  0,170626
chr1    16908   17368   JUNC00000012    1277    -       16908   17368   255,0,0 2       147,136 0,324
chr1    16912   17719   JUNC00000013    24      -       16912   17719   255,0,0 2       143,114 0,693
chr1    16919   17974   JUNC00000014    13      -       16919   17974   255,0,0 2       136,60

STAR sj.out.tab

chr1    14830   14929   2       2       1       0       4       69
chr1    14830   14969   2       2       1       117     411     73
chr1    14830   15020   2       2       1       0       3       18
chr1    15039   15795   2       2       1       20      117     69
chr1    15039   186316  2       2       1       0       362     70
chr1    15948   16606   2       2       1       0       4       19
chr1    16028   16606   2       2       1       0       7       49
chr1    16766   16853   2       2       1       1       27      48
chr1    16766   16857   2       2       1       0       21      48
chr1    17056   17232   2       2       1       45      416     75

As you can see, STAR seems to be barely reporting any reads over the junctions, and I'm not sure why.

Has anyone in the community tried moving away from TopHat to STAR and encountered problems with junction reads?

If not, what are some recommended aligners for working with splice junctions?

RNA-Seq rna-seq alignment • 1.3k views

ADD COMMENT • link updated 5.9 years ago by Ram 43k • written 5.9 years ago by fadhil.abubaker ▴ 20

1

Entering edit mode

Note that you can also run STAR in two-pass mode, which might be more accurate.

What makes you think having more junction reads is better?

ADD REPLY • link 5.9 years ago by WouterDeCoster 47k

0

Entering edit mode

I did try two-pass mode with STAR but it barely made any difference. It's just that the discrepancy between the two is so large I'm not actually sure which to trust.

ADD REPLY • link 5.9 years ago by fadhil.abubaker ▴ 20

1

Entering edit mode

Tophat has been shown to be inferior in pretty much every way. That it produces more junctions suggests that more of them are wrong.

ADD REPLY • link 5.9 years ago by Devon Ryan 104k

1

Entering edit mode

Yes, no one should be using tophat anymore. This paper (https://www.nature.com/articles/nmeth.3317.pdf) explains the differences between the programs well and why HISAT/STAR are more superior. The makers of tophat recommend not using it anymore. Per the tophat2 website:

"Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way"

ADD REPLY • link 5.9 years ago by achits ▴ 20

0

Entering edit mode

Is it safe to say that TopHat2's results above are wrong and should not be used for any downstream analysis?

Does anyone know of a different program that I can use to extract splice junction information? This way I have a third dataset for compare.

ADD REPLY • link 5.9 years ago by fadhil.abubaker ▴ 20