HISAT2 output statistics does not match the actual output
0
0
Entering edit mode
4.0 years ago
Tobias ▴ 140

I have been running HISAT2 for some of my FASTQ files via

hisat2 -x genome --sp 1000,1000 -U /home/a.fastq.gz -S /home/output.sam

It worked and I got out:

16419826 reads; of these:
16419826 (100.00%) were unpaired; of these:
1719313 (10.47%) aligned 0 times
9510754 (57.92%) aligned exactly 1 time
5189759 (31.61%) aligned >1 times
89.53% overall alignment rate


However, when I looked into the data I found, that it was indeed 16419826 reads and 1719313 were not aligned, but it was 12448816 reads aligning exactly 1 time and 2251697 reads aligning more than 1 time.

Hence, the number I got out while looking at the data were a bit different.

How is that explicable? Did I do something wrong?

RNA-Seq alignment sequencing • 2.0k views
0
Entering edit mode

How did you determine "aligned only once"? My guess is that this corresponds to MAPQ > 3 or so in hisat2.

0
Entering edit mode

Hi Devon, many thanks for question! Well, I just counted the read IDs appearing just once!

1
Entering edit mode

Give the counting by MAPQ a go, though make sure to exclude secondary alignments :)

0
Entering edit mode

But if I would be doing that I would obtain more not aligned reads (which would not make much sense, because that number is a perfect match) and I would obtain less multiple aligned reads. But since I already have fewer multiple aligned reads in my analysis, the impairment would get stronger!

1
Entering edit mode

In your original post you asked why you were getting different numbers than expected given what hisat2 output. I gave a possible reason (namely, that you're using different definitions for a multimapper). Whether you prefer one definition or another or whether there are downstream consequences of that is wholly immaterial.

0
Entering edit mode

I agree. All I wanted to say is that if they would have been using MAPQ>3 for the statistics, they would have been observing less multiple aligned reads than I did by simply counting them, and not more.

0
Entering edit mode

No, it would have found more multimappers and fewer "unique" mappers, which is indeed what it found. The presumption is that multimappers aren't always discernible by having multiple entries, but rather by MAPQ (this is essentially how it works in bowtie2 and since the code bases are related...).

0
Entering edit mode

Okay great, do you know some code (in R or by using some bash script) with which I could test that out?

1
Entering edit mode

samtools view -F 256 foo.bam | cut -f 5 | sort | uniq -c will give you the number of primary alignments per MAPQ. My guess is that some combination of the lower MAPQs will at least be closer to the numbers output by hisat2.