Question

Hisat2 with HTSeq-Count

1

Entering edit mode

6.2 years ago

dacacis ▴ 10

Hello everyone!

I am working with Tophat2 and HTSeq-count until now but, I want to use HISAT2 and learn about it. For that, I have been trying and working with this tool this week. Everything was fine until I tried to calculate the count from the SAM files using HTSeq-count. When I used this tool for the count, all the transcript counts in the output files are 0. This represents a real problem for me because I need to use HTSeq-count to keep the same distribution than the data that I already have.

So, Is it possible to use these two tools together? or Is there another tool that mantain the same distribution in the output file that HTSeq-count but works with HISAT2?

Both HISAT2 and HTSeq-count are in their last version.

Thank you in advance for your time.

sequencing alignment count RNA-Seq genome • 6.1k views

ADD COMMENT • link 6.2 years ago by dacacis ▴ 10

1

Entering edit mode

The first thing that you should check is if the chromosome identifiers in your fasta genome used for creating the index are the same as in your gtf used for counting. Most common mistake.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

I am using the release 89 of the human genome for both the fasta and the gtf files. The FASTA file is the Homo_sapiens.GRCh38.cdna and the GFT is the Homo_sapiens.GRCh38.89

ADD REPLY • link 6.2 years ago by dacacis ▴ 10

1

Entering edit mode

You aligned to the cdna? That's not right, you should align to the genome. The GTF doesn't correspond to the cdna fasta, only to the genome fasta.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Ok, I though that the CDNA is used with tools like HISAT2, SALMON, KALLISTO,etc... So I'll go to try with the whole genome such as in TOPHAT2. I hope this fix my problem. Thank you!

ADD REPLY • link 6.2 years ago by dacacis ▴ 10

1

Entering edit mode

The CDS FASTA transcriptome is indeed used for Kallisto and Salmon, but not HISAT2

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

I just aligned the sample with the whole genome, but the problem persist. I have zeros in all the counts of each transcript. This is the message that I obtain when I create the SAM file before the count file:

Converting SRR1663209 sample to SAM file...
14605626 reads; of these:
  14605626 (100.00%) were paired; of these:
    10936288 (74.88%) aligned concordantly 0 times
    1176660 (8.06%) aligned concordantly exactly 1 time
    2492678 (17.07%) aligned concordantly >1 times
    ----
    10936288 pairs aligned concordantly 0 times; of these:
      24476 (0.22%) aligned discordantly 1 time
    ----
    10911812 pairs aligned 0 times concordantly or discordantly; of these:
      21823624 mates make up the pairs; of these:
        20311900 (93.07%) aligned 0 times
        786761 (3.61%) aligned exactly 1 time
        724963 (3.32%) aligned >1 times
30.47% overall alignment rate

And this is the output of htseq-count:

Warning: 215252 reads with missing mate encountered.
22742460 SAM alignment pairs processed.

I am using the Homo_sapiens.GRCh38.90 for both the FASTA and GTF files. Is it possible that the version of the genome should be the problem?

ADD REPLY • link 6.2 years ago by dacacis ▴ 10

1

Entering edit mode

Why do you need to use HISAT2? - to identify novel transcripts?

What I would do is the following:

align with HISAT2
Create novel transcriptome GTF with StringTie (guided by the existing known transcriptome GTF)
Commence the HT-seq pipeline from the very beginning (FASTQ files) but use your new GTF

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

I am working integrating microarray gene expression with RNA-seq gene expression. For that, I use tophat with htseq-count for the read counts. The problem is that tophat2 is very slowly, for that reason I want to accelerate the pipeline with Hisat2.

HTSeq-count is the only tool that maintain the gene expression values in the same range and distribution than microarray, at least as far as I know. For that reason I need solve this problem.

Thank you again for your supporting!

ADD REPLY • link 6.2 years ago by dacacis ▴ 10

0

Entering edit mode

...but, TopHat2 / HISAT2 are mainly designed for de novo transcript identification and de novo splice isoform identification. They can also be used for standard RNA-seq differential expression analysis, of course.

HTseq does not produce counts on the same distribution as microarray. HTseq produces expression values that follow a negative binomial distributon, whilst microarray expression values are binomially-distributed log (base 2) values.

In my opinion, it is not a good idea to attempt to merge RNA-seq data with microarray data - sorry / lo siento. It would be better to use one dataset as the training and the other as a validation.

Kevin

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

1

Entering edit mode

A large fraction of your reads do not appear to align. You should investigate that first (grab a sample of those reads and blast at NCBI to see if you have a problem of contamination of some sort) before worrying about 0 counts.

ADD REPLY • link 6.2 years ago by GenoMax 141k