Question: Hisat2 with HTSeq-Count
gravatar for dacacis
5 weeks ago by
University of Granada
dacacis0 wrote:

Hello everyone!

I am working with Tophat2 and HTSeq-count until now but, I want to use HISAT2 and learn about it. For that, I have been trying and working with this tool this week. Everything was fine until I tried to calculate the count from the SAM files using HTSeq-count. When I used this tool for the count, all the transcript counts in the output files are 0. This represents a real problem for me because I need to use HTSeq-count to keep the same distribution than the data that I already have.

So, Is it possible to use these two tools together? or Is there another tool that mantain the same distribution in the output file that HTSeq-count but works with HISAT2?

Both HISAT2 and HTSeq-count are in their last version.

Thank you in advance for your time.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by dacacis0

The first thing that you should check is if the chromosome identifiers in your fasta genome used for creating the index are the same as in your gtf used for counting. Most common mistake.

ADD REPLYlink written 5 weeks ago by WouterDeCoster26k

I am using the release 89 of the human genome for both the fasta and the gtf files. The FASTA file is the Homo_sapiens.GRCh38.cdna and the GFT is the Homo_sapiens.GRCh38.89

ADD REPLYlink written 5 weeks ago by dacacis0

You aligned to the cdna? That's not right, you should align to the genome. The GTF doesn't correspond to the cdna fasta, only to the genome fasta.

ADD REPLYlink written 5 weeks ago by WouterDeCoster26k

Ok, I though that the CDNA is used with tools like HISAT2, SALMON, KALLISTO,etc... So I'll go to try with the whole genome such as in TOPHAT2. I hope this fix my problem. Thank you!

ADD REPLYlink written 5 weeks ago by dacacis0

The CDS FASTA transcriptome is indeed used for Kallisto and Salmon, but not HISAT2

ADD REPLYlink written 5 weeks ago by Kevin Blighe15k

I just aligned the sample with the whole genome, but the problem persist. I have zeros in all the counts of each transcript. This is the message that I obtain when I create the SAM file before the count file:

Converting SRR1663209 sample to SAM file...
14605626 reads; of these:
  14605626 (100.00%) were paired; of these:
    10936288 (74.88%) aligned concordantly 0 times
    1176660 (8.06%) aligned concordantly exactly 1 time
    2492678 (17.07%) aligned concordantly >1 times
    10936288 pairs aligned concordantly 0 times; of these:
      24476 (0.22%) aligned discordantly 1 time
    10911812 pairs aligned 0 times concordantly or discordantly; of these:
      21823624 mates make up the pairs; of these:
        20311900 (93.07%) aligned 0 times
        786761 (3.61%) aligned exactly 1 time
        724963 (3.32%) aligned >1 times
30.47% overall alignment rate

And this is the output of htseq-count:

Warning: 215252 reads with missing mate encountered.
22742460 SAM alignment pairs processed.

I am using the Homo_sapiens.GRCh38.90 for both the FASTA and GTF files. Is it possible that the version of the genome should be the problem?

ADD REPLYlink written 5 weeks ago by dacacis0

Why do you need to use HISAT2? - to identify novel transcripts?

What I would do is the following:

  1. align with HISAT2
  2. Create novel transcriptome GTF with StringTie (guided by the existing known transcriptome GTF)
  3. Commence the HT-seq pipeline from the very beginning (FASTQ files) but use your new GTF
ADD REPLYlink written 5 weeks ago by Kevin Blighe15k

I am working integrating microarray gene expression with RNA-seq gene expression. For that, I use tophat with htseq-count for the read counts. The problem is that tophat2 is very slowly, for that reason I want to accelerate the pipeline with Hisat2.

HTSeq-count is the only tool that maintain the gene expression values in the same range and distribution than microarray, at least as far as I know. For that reason I need solve this problem.

Thank you again for your supporting!

ADD REPLYlink written 5 weeks ago by dacacis0

...but, TopHat2 / HISAT2 are mainly designed for de novo transcript identification and de novo splice isoform identification. They can also be used for standard RNA-seq differential expression analysis, of course.

HTseq does not produce counts on the same distribution as microarray. HTseq produces expression values that follow a negative binomial distributon, whilst microarray expression values are binomially-distributed log (base 2) values.

In my opinion, it is not a good idea to attempt to merge RNA-seq data with microarray data - sorry / lo siento. It would be better to use one dataset as the training and the other as a validation.


ADD REPLYlink written 5 weeks ago by Kevin Blighe15k

A large fraction of your reads do not appear to align. You should investigate that first (grab a sample of those reads and blast at NCBI to see if you have a problem of contamination of some sort) before worrying about 0 counts.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1758 users visited in the last hour