Question: Multiple Hits and Unique Mapped Reads
1
gravatar for fusion.slope
2.5 years ago by
fusion.slope200
fusion.slope200 wrote:

Hi All,

am trying to use different softwares for my new sequencing protocol and am comparing 3 different alignment tools:

  • BWA
  • Bowtie2
  • Segemehl

Now am on the step in which i want to retrieve unique mapped reads and as in the community has been already explained different alignment tools have different solution. For example:

  • BWA uniquely mapped reads are retrieved selecting Quality Score > 0 and removing unmapped reads;
  • Bowtie2 uniquely mapped reads are retrieved removing all the reads with MAPQ < 1 as well as removing the unmapped reads and not primary aligned

For Segemehl i can't use these information since in the SAM format are not present the flags in wihich i can retrieve these information but i can align using the parameter multiple hits (-M). So can I use this parameter -M to 1 in order to extract unique mapped reads? Does Multiple Hits means that the reads are aligned in several regions of the Genome, so that -M > 2 means that the reads are mapped in different regions of the genome and then are not uniquely mapped?

Thanks in advance for any suggestions.

Cheers!

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by fusion.slope200
1

Your title is wrong. Multiple hits, not hist :)
Hist looks like histogram

ADD REPLYlink written 2.5 years ago by chen1.8k

thanks chen i have corrected :))

ADD REPLYlink written 2.5 years ago by fusion.slope200
1
gravatar for fusion.slope
2.5 years ago by
fusion.slope200
fusion.slope200 wrote:

The NH:i:1 will allow to retrieve the uniquely mapped reads from Segemehl output.

doing:

grep "NH:i:1" myfile.sam > myfile.uniq.sam

will allow to retrieve uniquely mapped reads. I did not know before the NH:i information in SAM format.

Here at page 6 the explanation

http://epbi-radivot.cwru.edu/EPBI473/files/week7seqData/bak/SAMv1.pdf

NM:i Number of reported alignments that contains the query in the current record

ADD COMMENTlink written 2.5 years ago by fusion.slope200
2

Just a couple of notes... grep "NH:i:1" will match also NH:i:10 or NH:i:11 etc.

I don't know Segemehl but NH can be greater than 1 and still you can have a "uniquely" mapped read. For example if one hit has zero mismatches and another hit has several mismatches than NH= 2 but the the second hit is so poor compared to the first one that you can say you have a unique mapping. In fact, the concept of "uniquely mapped" is a bit misleading as you can align any read anywhere you want, you just need to allow enough edits (mismatches), for this reason mapq is preferable.

ADD REPLYlink written 2.5 years ago by dariober10.0k

thanks dariober, here the correct script then:

awk -F"\t" '$14 == "NH:i:1" { print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15}' segemehl_output.sam > segemehl_output.uniq.sam

thanks also for the second points about the mapq l

ADD REPLYlink written 2.5 years ago by fusion.slope200
4

I don't like this solution very much. You rely on the NH tag to be in position 14 but the sam format doesn't enforce that. You can still use grep as grep -w 'NH:i:1', this should be fine. Or even better a proper parser for sam format like python pysam, but that requires a bit more coding.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by dariober10.0k

ok thanks, you are right about the position 14 but for Segemehl it is in the 14th column the NH:i information in the SAM file. For a general purpose is not a good option i agree..

ADD REPLYlink written 2.5 years ago by fusion.slope200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 685 users visited in the last hour