Question: HISAT2 and Stringtie: difference in Stringtie results when using pre-built indexes or creating new ones in HISAT2
0
gravatar for iraun
2.4 years ago by
iraun3.6k
Norway
iraun3.6k wrote:

Hello!

I'm using HISAT2 tool for mapping a RNA-Seq PE dataset. The reference genome I want to align against is GRCh37. I have downloaded genome_snp_tran pre-built index and run hisat2. For other hand, I tried to create my own indexes using the following command:

hisat2-build -p 6 GRCh37-75/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa HISAT_index/

I have run Stringtie with both bam files and the results are quite different. Anyone can guide me to know why? Is because for the generation of my own indexes I'm not using SNP and transcript information?: genome_snp_tran: HGFM index for reference plus SNPs and transcripts

Is better to use pre-built indexes rather than creating new ones?

EDIT

Here I post some of the differences seen in Stringtie output for the two BAM files:

  • Number of reference transcripts (ENST....) reported in *assembled_transcripts.gtf file: 64500 (pre-built) / 44116 (own)
  • 1442 transcript missing in pre-built but present in own.
  • 21826 transcripts missing in own but present in pre-built.
  • 42673 transcripts in common, let's evaluate the concordance in FPKM values:
     avg = 6.819 / 7.421
     std = 82.219 / 90.447
     max = 8040.339 / 9355.599
  

EDIT 2:

Let's take a look to a particular transcripts which gives completely different results in Stringtie depending on the hisat2 index used: ENST00000331789

Results for pre-built index:

Format: Chr start end transcript_id gene_id FPKM

7   5566787 5570232 ENST00000425660 ENSG00000075624 646.288086
7   5566782 5570340 ENST00000331789 ENSG00000075624 20.998787
7   5566787 5570232 ENST00000462494 ENSG00000075624 1949.242188
7   5567742 5570233 ENST00000484841 ENSG00000075624 179.878967
7   5567372 5569294 ENST00000493945 ENSG00000075624 13.227101
7   5568223 5603415 ENST00000432588 ENSG00000075624 15.464212
7   5568866 5569613 ENST00000417101 ENSG00000075624 0.410293
7   5568101 5570221 ENST00000477812 ENSG00000075624 0.111127
7   5566782 5567729 ENST00000464611 ENSG00000075624 0.061247
7   5567781 5570235 ENST00000473257 ENSG00000075624 0.002322
7   5568698 5570214 ENST00000480301 ENSG00000075624 0.568141

Results for own index:

7   5566782 5570340 ENST00000331789 ENSG00000075624 4963.816895
7   5566787 5570232 ENST00000462494 ENSG00000075624 105.154701
7   5567372 5569294 ENST00000493945 ENSG00000075624 18.443560
7   5568101 5570221 ENST00000477812 ENSG00000075624 1.338749
7   5568698 5570214 ENST00000480301 ENSG00000075624 0.182248
hisat2 • 1.6k views
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by iraun3.6k

In what ways are they "quite different"?

ADD REPLYlink written 2.4 years ago by Devon Ryan92k

Sorry @Devon Ryan, my question wasn't clear. See my edit. I'm thinking that maybe the difference is not that big...

ADD REPLYlink written 2.4 years ago by iraun3.6k

I believe the index you built would be equivalent to the genome index, not to the genome_snp_tran index.

ADD REPLYlink written 2.4 years ago by h.mon27k

Yes. So the SNP and transcript information while building the index is "very" important? I mean, we assume that the analysis performed using genome_snp_tran index is better than the other?

ADD REPLYlink written 2.4 years ago by iraun3.6k
1

Yes, though I would really encourage you not to use hisat2 if you care about finding splice sites unless you like tweaking settings.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Devon Ryan92k

Mmmm... Can you take a look to my second edit? Why using prebuilt indexes all the weight (expression) is given to ENST00000462494 while using own indexes is given to ENST00000331789?

ADD REPLYlink written 2.4 years ago by iraun3.6k

Have a look at the BAM files, I suspect that'll be rather more telling.

ADD REPLYlink written 2.4 years ago by Devon Ryan92k

What is the difference between genome, genome_tran and genome_snp_tran

ADD REPLYlink written 14 months ago by Arindam Ghosh170

Genome is the basic index of the genome. genome_tran additionally includes annotated splicing boundaries. genome_snp_tranadditionally includes a number of SNPs, so you can (theoretically) get better alignment around them.

ADD REPLYlink written 14 months ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 762 users visited in the last hour