Question: How to select uniquely and concordantly reads from hisat2 alingment for raw read count
0
gravatar for Bioinfonext
20 months ago by
Bioinfonext160
Korea
Bioinfonext160 wrote:

I mapped pair end RNAseq read to genome using HISAT2 and got statistics like this:

31026735 (100.00%) were paired; of these:
    3579612 (11.54%) aligned concordantly 0 times
    25541051 (82.32%) aligned concordantly exactly 1 time
    1906072 (6.14%) aligned concordantly >1 times
    ----
    3579612 pairs aligned concordantly 0 times; of these:
      290787 (8.12%) aligned discordantly 1 time
    ----
    3288825 pairs aligned 0 times concordantly or discordantly; of these:
      6577650 mates make up the pairs; of these:
        6068493 (92.26%) aligned 0 times
        475539 (7.23%) aligned exactly 1 time
        33618 (0.51%) aligned >1 times
90.22% overall alignment rate

Now how I can select uniquely and concordantly reads for raw read count using samtools?

rna-seq • 1.9k views
ADD COMMENTlink modified 5 months ago by al.bodrug50 • written 20 months ago by Bioinfonext160
3
gravatar for Macspider
20 months ago by
Macspider3.0k
Vienna - BOKU
Macspider3.0k wrote:

Ok, let's go through it. The -f 0x2 bitwise flag selection keeps proper pairs, which are read pairs mapped in correct orientation (I suppose illumina reads facing each other) and within insert size (-X and -I) -> concordantly.

Now, if you want the uniquely mapped, you should basically exclude all those reads that have a secondary alignment. From the HISAT2 manual:

ZS:i:    <N>     Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than [AS:i].

Basically, if the line contains the ZS: field, it has a secondary alignment, i.e. it is not unique. Filter those out, for example using awk:

samtools view -h -f 0x2 file.bam | awk 'substr($1, 0, 1)=="@" || $0 !~ /ZS:/' | samtools view -h -b > filtered_file.bam

The awk part selects only lines that either start with @ (like the header) or don't contain the ZS: tag, like the reads with secondary alignments (i.e. non-uniquely mapping).

ADD COMMENTlink modified 20 months ago • written 20 months ago by Macspider3.0k

Thanks Macspider! I am also interested to use HTseq for read count which only count uniquely mapped read. Could anyone suggest how to proceed with hisat2 sam file to do read count using HTseq?

ADD REPLYlink written 20 months ago by Bioinfonext160

Documentation is your friend! https://htseq.readthedocs.io/en/release_0.9.1/tour.html

ADD REPLYlink written 20 months ago by Macspider3.0k
0
gravatar for al.bodrug
5 months ago by
al.bodrug50
France
al.bodrug50 wrote:

I think like this you're gonna end up with unpaired reads where one of the mates has a ZS tag and the other doesn't and is the only one kept. So they won't be concordant.

ADD COMMENTlink written 5 months ago by al.bodrug50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1036 users visited in the last hour