Question: How to select uniquely and concordantly reads from hisat2 alingment for raw read count
gravatar for Bioinfonext
3.0 years ago by
Bioinfonext320 wrote:

I mapped pair end RNAseq read to genome using HISAT2 and got statistics like this:

31026735 (100.00%) were paired; of these:
    3579612 (11.54%) aligned concordantly 0 times
    25541051 (82.32%) aligned concordantly exactly 1 time
    1906072 (6.14%) aligned concordantly >1 times
    3579612 pairs aligned concordantly 0 times; of these:
      290787 (8.12%) aligned discordantly 1 time
    3288825 pairs aligned 0 times concordantly or discordantly; of these:
      6577650 mates make up the pairs; of these:
        6068493 (92.26%) aligned 0 times
        475539 (7.23%) aligned exactly 1 time
        33618 (0.51%) aligned >1 times
90.22% overall alignment rate

Now how I can select uniquely and concordantly reads for raw read count using samtools?

rna-seq • 3.4k views
ADD COMMENTlink modified 20 months ago by al.bodrug50 • written 3.0 years ago by Bioinfonext320
gravatar for Macspider
3.0 years ago by
Vienna - BOKU
Macspider3.3k wrote:

Ok, let's go through it. The -f 0x2 bitwise flag selection keeps proper pairs, which are read pairs mapped in correct orientation (I suppose illumina reads facing each other) and within insert size (-X and -I) -> concordantly.

Now, if you want the uniquely mapped, you should basically exclude all those reads that have a secondary alignment. From the HISAT2 manual:

ZS:i:    <N>     Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than [AS:i].

Basically, if the line contains the ZS: field, it has a secondary alignment, i.e. it is not unique. Filter those out, for example using awk:

samtools view -h -f 0x2 file.bam | awk 'substr($1, 0, 1)=="@" || $0 !~ /ZS:/' | samtools view -h -b > filtered_file.bam

The awk part selects only lines that either start with @ (like the header) or don't contain the ZS: tag, like the reads with secondary alignments (i.e. non-uniquely mapping).

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Macspider3.3k

Thanks Macspider! I am also interested to use HTseq for read count which only count uniquely mapped read. Could anyone suggest how to proceed with hisat2 sam file to do read count using HTseq?

ADD REPLYlink written 3.0 years ago by Bioinfonext320

Documentation is your friend!

ADD REPLYlink written 3.0 years ago by Macspider3.3k
gravatar for al.bodrug
20 months ago by
al.bodrug50 wrote:

I think like this you're gonna end up with unpaired reads where one of the mates has a ZS tag and the other doesn't and is the only one kept. So they won't be concordant.

ADD COMMENTlink written 20 months ago by al.bodrug50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 912 users visited in the last hour