Question

HISAT2 and Stringtie: difference in Stringtie results when using pre-built indexes or creating new ones in HISAT2

1

Entering edit mode

7.0 years ago

iraun 6.2k

Hello!

I'm using HISAT2 tool for mapping a RNA-Seq PE dataset. The reference genome I want to align against is GRCh37. I have downloaded genome_snp_tran pre-built index and run hisat2. For other hand, I tried to create my own indexes using the following command:

hisat2-build -p 6 GRCh37-75/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa HISAT_index/

I have run Stringtie with both bam files and the results are quite different. Anyone can guide me to know why? Is because for the generation of my own indexes I'm not using SNP and transcript information?: genome_snp_tran: HGFM index for reference plus SNPs and transcripts

Is better to use pre-built indexes rather than creating new ones?

EDIT

Here I post some of the differences seen in Stringtie output for the two BAM files:

Number of reference transcripts (ENST....) reported in *assembled_transcripts.gtf file: 64500 (pre-built) / 44116 (own)
1442 transcript missing in pre-built but present in own.
21826 transcripts missing in own but present in pre-built.
42673 transcripts in common, let's evaluate the concordance in FPKM values:

     avg = 6.819 / 7.421
     std = 82.219 / 90.447
     max = 8040.339 / 9355.599

EDIT 2:

Let's take a look to a particular transcripts which gives completely different results in Stringtie depending on the hisat2 index used: ENST00000331789

Results for pre-built index:

Format: Chr start end transcript_id gene_id FPKM

7   5566787 5570232 ENST00000425660 ENSG00000075624 646.288086
7   5566782 5570340 ENST00000331789 ENSG00000075624 20.998787
7   5566787 5570232 ENST00000462494 ENSG00000075624 1949.242188
7   5567742 5570233 ENST00000484841 ENSG00000075624 179.878967
7   5567372 5569294 ENST00000493945 ENSG00000075624 13.227101
7   5568223 5603415 ENST00000432588 ENSG00000075624 15.464212
7   5568866 5569613 ENST00000417101 ENSG00000075624 0.410293
7   5568101 5570221 ENST00000477812 ENSG00000075624 0.111127
7   5566782 5567729 ENST00000464611 ENSG00000075624 0.061247
7   5567781 5570235 ENST00000473257 ENSG00000075624 0.002322
7   5568698 5570214 ENST00000480301 ENSG00000075624 0.568141

Results for own index:

7   5566782 5570340 ENST00000331789 ENSG00000075624 4963.816895
7   5566787 5570232 ENST00000462494 ENSG00000075624 105.154701
7   5567372 5569294 ENST00000493945 ENSG00000075624 18.443560
7   5568101 5570221 ENST00000477812 ENSG00000075624 1.338749
7   5568698 5570214 ENST00000480301 ENSG00000075624 0.182248

hisat2 • 3.9k views

ADD COMMENT • link 7.0 years ago by iraun 6.2k

0

Entering edit mode

In what ways are they "quite different"?

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Sorry @Devon Ryan, my question wasn't clear. See my edit. I'm thinking that maybe the difference is not that big...

ADD REPLY • link 7.0 years ago by iraun 6.2k

0

Entering edit mode

I believe the index you built would be equivalent to the genome index, not to the genome_snp_tran index.

ADD REPLY • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Yes. So the SNP and transcript information while building the index is "very" important? I mean, we assume that the analysis performed using genome_snp_tran index is better than the other?

ADD REPLY • link 7.0 years ago by iraun 6.2k

1

Entering edit mode

Yes, though I would really encourage you not to use hisat2 if you care about finding splice sites unless you like tweaking settings.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Mmmm... Can you take a look to my second edit? Why using prebuilt indexes all the weight (expression) is given to ENST00000462494 while using own indexes is given to ENST00000331789?

ADD REPLY • link 7.0 years ago by iraun 6.2k

0

Entering edit mode

Have a look at the BAM files, I suspect that'll be rather more telling.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

What is the difference between genome, genome_tran and genome_snp_tran

ADD REPLY • link 5.8 years ago by Arindam Ghosh ▴ 510

0

Entering edit mode

Genome is the basic index of the genome. genome_tran additionally includes annotated splicing boundaries. genome_snp_tranadditionally includes a number of SNPs, so you can (theoretically) get better alignment around them.

ADD REPLY • link 5.8 years ago by Devon Ryan 104k