Question: How to select uniquely and concordantly reads from hisat2 alingment for raw read count
0
gravatar for Bioinfonext
8 months ago by
Bioinfonext110
Korea
Bioinfonext110 wrote:

I mapped pair end RNAseq read to genome using HISAT2 and got statistics like this:

31026735 (100.00%) were paired; of these:
    3579612 (11.54%) aligned concordantly 0 times
    25541051 (82.32%) aligned concordantly exactly 1 time
    1906072 (6.14%) aligned concordantly >1 times
    ----
    3579612 pairs aligned concordantly 0 times; of these:
      290787 (8.12%) aligned discordantly 1 time
    ----
    3288825 pairs aligned 0 times concordantly or discordantly; of these:
      6577650 mates make up the pairs; of these:
        6068493 (92.26%) aligned 0 times
        475539 (7.23%) aligned exactly 1 time
        33618 (0.51%) aligned >1 times
90.22% overall alignment rate

Now how I can select uniquely and concordantly reads for raw read count using samtools?

rna-seq • 614 views
ADD COMMENTlink modified 8 months ago by Macspider2.5k • written 8 months ago by Bioinfonext110
2
gravatar for Macspider
8 months ago by
Macspider2.5k
Vienna - BOKU
Macspider2.5k wrote:

Ok, let's go through it. The -f 0x2 bitwise flag selection keeps proper pairs, which are read pairs mapped in correct orientation (I suppose illumina reads facing each other) and within insert size (-X and -I) -> concordantly.

Now, if you want the uniquely mapped, you should basically exclude all those reads that have a secondary alignment. From the HISAT2 manual:

ZS:i:    <N>     Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than [AS:i].

Basically, if the line contains the ZS: field, it has a secondary alignment, i.e. it is not unique. Filter those out, for example using awk:

samtools view -h -f 0x2 file.bam | awk 'substr($1, 0, 1)=="@" || $0 !~ /ZS:/' | samtools view -h -b > filtered_file.bam

The awk part selects only lines that either start with @ (like the header) or don't contain the ZS: tag, like the reads with secondary alignments (i.e. non-uniquely mapping).

ADD COMMENTlink modified 8 months ago • written 8 months ago by Macspider2.5k

Thanks Macspider! I am also interested to use HTseq for read count which only count uniquely mapped read. Could anyone suggest how to proceed with hisat2 sam file to do read count using HTseq?

ADD REPLYlink written 8 months ago by Bioinfonext110

Documentation is your friend! https://htseq.readthedocs.io/en/release_0.9.1/tour.html

ADD REPLYlink written 8 months ago by Macspider2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 766 users visited in the last hour