Question

Multiple Hits and Unique Mapped Reads

1

Entering edit mode

7.6 years ago

fusion.slope ▴ 250

Hi All,

am trying to use different softwares for my new sequencing protocol and am comparing 3 different alignment tools:

BWA
Bowtie2
Segemehl

Now am on the step in which i want to retrieve unique mapped reads and as in the community has been already explained different alignment tools have different solution. For example:

BWA uniquely mapped reads are retrieved selecting Quality Score > 0 and removing unmapped reads;
Bowtie2 uniquely mapped reads are retrieved removing all the reads with MAPQ < 1 as well as removing the unmapped reads and not primary aligned

For Segemehl i can't use these information since in the SAM format are not present the flags in wihich i can retrieve these information but i can align using the parameter multiple hits (-M). So can I use this parameter -M to 1 in order to extract unique mapped reads? Does Multiple Hits means that the reads are aligned in several regions of the Genome, so that -M > 2 means that the reads are mapped in different regions of the genome and then are not uniquely mapped?

Thanks in advance for any suggestions.

Cheers!

alignment sequencing Multiple Hits • 5.5k views

ADD COMMENT • link updated 3.1 years ago by Arindam Ghosh ▴ 510 • written 7.6 years ago by fusion.slope ▴ 250

1

Entering edit mode

Your title is wrong. Multiple hits, not hist :)
Hist looks like histogram

ADD REPLY • link 7.6 years ago by chen ★ 2.5k

0

Entering edit mode

thanks chen i have corrected :))

ADD REPLY • link 7.6 years ago by fusion.slope ▴ 250

score 1 · Answer 1 · 2016-10-10

1

Entering edit mode

7.6 years ago

fusion.slope ▴ 250

The NH:i:1 will allow to retrieve the uniquely mapped reads from Segemehl output.

doing:

grep "NH:i:1" myfile.sam > myfile.uniq.sam

will allow to retrieve uniquely mapped reads. I did not know before the NH:i information in SAM format.

Here at page 6 the explanation

http://epbi-radivot.cwru.edu/EPBI473/files/week7seqData/bak/SAMv1.pdf

NM:i Number of reported alignments that contains the query in the current record

ADD COMMENT • link 7.6 years ago by fusion.slope ▴ 250

2

Entering edit mode

Just a couple of notes... grep "NH:i:1" will match also NH:i:10 or NH:i:11 etc.

I don't know Segemehl but NH can be greater than 1 and still you can have a "uniquely" mapped read. For example if one hit has zero mismatches and another hit has several mismatches than NH= 2 but the the second hit is so poor compared to the first one that you can say you have a unique mapping. In fact, the concept of "uniquely mapped" is a bit misleading as you can align any read anywhere you want, you just need to allow enough edits (mismatches), for this reason mapq is preferable.

ADD REPLY • link 7.6 years ago by dariober 14k

0

Entering edit mode

thanks dariober, here the correct script then:

awk -F"\t" '$14 == "NH:i:1" { print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15}' segemehl_output.sam > segemehl_output.uniq.sam

thanks also for the second points about the mapq l

ADD REPLY • link 7.6 years ago by fusion.slope ▴ 250

4

Entering edit mode

I don't like this solution very much. You rely on the NH tag to be in position 14 but the sam format doesn't enforce that. You can still use grep as grep -w 'NH:i:1', this should be fine. Or even better a proper parser for sam format like python pysam, but that requires a bit more coding.

ADD REPLY • link 7.6 years ago by dariober 14k

0

Entering edit mode

ok thanks, you are right about the position 14 but for Segemehl it is in the 14th column the NH:i information in the SAM file. For a general purpose is not a good option i agree..

ADD REPLY • link 7.6 years ago by fusion.slope ▴ 250

0

Entering edit mode

Is read counting on this extracted unique alignments same as not using countMultiMappingReads in featureCounts?

ADD REPLY • link 3.1 years ago by Arindam Ghosh ▴ 510