Question: HISAT2 output statistics does not match the actual output
0
gravatar for Tobias
2.6 years ago by
Tobias140
Tobias140 wrote:

I have been running HISAT2 for some of my FASTQ files via

hisat2 -x genome --sp 1000,1000 -U /home/a.fastq.gz -S /home/output.sam

It worked and I got out:

16419826 reads; of these:
  16419826 (100.00%) were unpaired; of these:
    1719313 (10.47%) aligned 0 times
    9510754 (57.92%) aligned exactly 1 time
    5189759 (31.61%) aligned >1 times
89.53% overall alignment rate

However, when I looked into the data I found, that it was indeed 16419826 reads and 1719313 were not aligned, but it was 12448816 reads aligning exactly 1 time and 2251697 reads aligning more than 1 time.

Hence, the number I got out while looking at the data were a bit different.

How is that explicable? Did I do something wrong?

Many thanks for your help in advance!

sequencing rna-seq alignment • 1.5k views
ADD COMMENTlink modified 2.5 years ago by Biostar ♦♦ 20 • written 2.6 years ago by Tobias140

How did you determine "aligned only once"? My guess is that this corresponds to MAPQ > 3 or so in hisat2.

ADD REPLYlink written 2.6 years ago by Devon Ryan92k

Hi Devon, many thanks for question! Well, I just counted the read IDs appearing just once!

ADD REPLYlink written 2.6 years ago by Tobias140
1

Give the counting by MAPQ a go, though make sure to exclude secondary alignments :)

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Devon Ryan92k

But if I would be doing that I would obtain more not aligned reads (which would not make much sense, because that number is a perfect match) and I would obtain less multiple aligned reads. But since I already have fewer multiple aligned reads in my analysis, the impairment would get stronger!

ADD REPLYlink written 2.6 years ago by Tobias140
1

In your original post you asked why you were getting different numbers than expected given what hisat2 output. I gave a possible reason (namely, that you're using different definitions for a multimapper). Whether you prefer one definition or another or whether there are downstream consequences of that is wholly immaterial.

ADD REPLYlink written 2.6 years ago by Devon Ryan92k

I agree. All I wanted to say is that if they would have been using MAPQ>3 for the statistics, they would have been observing less multiple aligned reads than I did by simply counting them, and not more.

ADD REPLYlink written 2.6 years ago by Tobias140

No, it would have found more multimappers and fewer "unique" mappers, which is indeed what it found. The presumption is that multimappers aren't always discernible by having multiple entries, but rather by MAPQ (this is essentially how it works in bowtie2 and since the code bases are related...).

ADD REPLYlink written 2.6 years ago by Devon Ryan92k

Okay great, do you know some code (in R or by using some bash script) with which I could test that out?

ADD REPLYlink written 2.6 years ago by Tobias140
1

samtools view -F 256 foo.bam | cut -f 5 | sort | uniq -c will give you the number of primary alignments per MAPQ. My guess is that some combination of the lower MAPQs will at least be closer to the numbers output by hisat2.

ADD REPLYlink modified 2.6 years ago by h.mon28k • written 2.6 years ago by Devon Ryan92k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1789 users visited in the last hour