Why does Hisat2 report a different number of aligned pairs than samtools view?
1
1
Entering edit mode
3.5 years ago
adam.faranda ▴ 110

Hisat2 reports fewer total aligned reads than samtools view -F 4 my.alignment.bam, and I would like to try and understand why.

According to my Hisat2 results:

54623054 reads; of these:
  54623054 (100.00%) were paired; of these:
    30215613 (55.32%) aligned concordantly 0 times
    18622638 (34.09%) aligned concordantly exactly 1 time
    5784803 (10.59%) aligned concordantly >1 times
    ----
    30215613 pairs aligned concordantly 0 times; of these:
      222801 (0.74%) aligned discordantly 1 time
    ----
    29992812 pairs aligned 0 times concordantly or discordantly; of these:
      59985624 mates make up the pairs; of these:
        39341118 (65.58%) aligned 0 times
        17349298 (28.92%) aligned exactly 1 time
        3295208 (5.49%) aligned >1 times
63.99% overall alignment rate

a total of 18622638 + 5784803 + 222801+ (17349298 / 2) + ( 3295208 / 2)= 34952495 reads aligned to the target transcript.

The SAM flag value of 4 indicates that the given read failed to map.

When I select reads that do not have the SAM flag "4":

gawk '{F4[$1]} 
          END {
               x=0
               for(unique_read_id in F4) {
                        x=x+1
               }
               print x
          }' <(samtools view -F 4 my.alignment.bam)

I get a different number: 44852131 reads in this case; I count ~10 million more unique fragment ids than Hisat2 has reported. What combination of SAM flags can I use to get a result that matches Hisat2?

I am trying to evaluate the efficiency of ribosomal RNA depletion in an RNA Sequencing experiment. I have aligned my fastQ file against the Rn45s pre-spliced transcript in order to estimate ribosomal RNA abundance. I'm aligning reads with Hisat2, and using samtools and gawk for downstream analysis.

According to the Hisat2 manual, it adheres to the SAM Flags standard. I know that there is something I am missing here, I'm just having a hard time seeing it.

Since all of my reads are paired, I get the same results by calling samtools view -F 8 my.alignment.bam

RNA-Seq samtools • 998 views
ADD COMMENT
0
Entering edit mode
16 months ago

I had a similar problem. It wasn't easy to use samtools to fetch non-paired mapped reads since the reads countings didn't match hisat2.

My solution was bamtobed from bedtools to convert the mapping file into a BED file. This BED file has the read ID in the 4th column followed by /1 for paired-end 1 and /2 for paired-end 2. Having this information I could split the mapped reads in 2 different files, remove /1 and /2 from the ID, use sort | uniq and compare the two IDs files using comm. When I check the number of lines exclusive of paired-end 1 and 2, divide /2 and sum with the aligned, then the numbers matched with hisat2.

I strongly believe there is a manner to do this using samtools, but for this purpose samtools is surprisingly user unfriendly.

ADD COMMENT

Login before adding your answer.

Traffic: 2432 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6