Question

HISAT2 output statistics does not match the actual output

0

Entering edit mode

7.0 years ago

Tobias ▴ 150

I have been running HISAT2 for some of my FASTQ files via

hisat2 -x genome --sp 1000,1000 -U /home/a.fastq.gz -S /home/output.sam

It worked and I got out:

16419826 reads; of these:
  16419826 (100.00%) were unpaired; of these:
    1719313 (10.47%) aligned 0 times
    9510754 (57.92%) aligned exactly 1 time
    5189759 (31.61%) aligned >1 times
89.53% overall alignment rate

However, when I looked into the data I found, that it was indeed 16419826 reads and 1719313 were not aligned, but it was 12448816 reads aligning exactly 1 time and 2251697 reads aligning more than 1 time.

Hence, the number I got out while looking at the data were a bit different.

How is that explicable? Did I do something wrong?

Many thanks for your help in advance!

RNA-Seq alignment sequencing • 2.9k views

ADD COMMENT • link updated 7.0 years ago by Biostar 20 • written 7.0 years ago by Tobias ▴ 150

0

Entering edit mode

How did you determine "aligned only once"? My guess is that this corresponds to MAPQ > 3 or so in hisat2.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon, many thanks for question! Well, I just counted the read IDs appearing just once!

ADD REPLY • link 7.0 years ago by Tobias ▴ 150

1

Entering edit mode

Give the counting by MAPQ a go, though make sure to exclude secondary alignments :)

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

But if I would be doing that I would obtain more not aligned reads (which would not make much sense, because that number is a perfect match) and I would obtain less multiple aligned reads. But since I already have fewer multiple aligned reads in my analysis, the impairment would get stronger!

ADD REPLY • link 7.0 years ago by Tobias ▴ 150

1

Entering edit mode

In your original post you asked why you were getting different numbers than expected given what hisat2 output. I gave a possible reason (namely, that you're using different definitions for a multimapper). Whether you prefer one definition or another or whether there are downstream consequences of that is wholly immaterial.

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

I agree. All I wanted to say is that if they would have been using MAPQ>3 for the statistics, they would have been observing less multiple aligned reads than I did by simply counting them, and not more.

ADD REPLY • link 7.0 years ago by Tobias ▴ 150

0

Entering edit mode

No, it would have found more multimappers and fewer "unique" mappers, which is indeed what it found. The presumption is that multimappers aren't always discernible by having multiple entries, but rather by MAPQ (this is essentially how it works in bowtie2 and since the code bases are related...).

ADD REPLY • link 7.0 years ago by Devon Ryan 104k

0

Entering edit mode

Okay great, do you know some code (in R or by using some bash script) with which I could test that out?

ADD REPLY • link 7.0 years ago by Tobias ▴ 150

1

Entering edit mode

samtools view -F 256 foo.bam | cut -f 5 | sort | uniq -c will give you the number of primary alignments per MAPQ. My guess is that some combination of the lower MAPQs will at least be closer to the numbers output by hisat2.

ADD REPLY • link updated 7.0 years ago by h.mon 35k • written 7.0 years ago by Devon Ryan 104k