Question: How To Tell How Well A Sam File Was Mapped
2
gravatar for Applet
7.7 years ago by
Applet140
Applet140 wrote:

I'm comparing using STAMPY vs BWA to map some reads to a reference genome.

Now that I've generated two sam files how do I tell how well the mapping did?

samtools mapping bwa sam • 5.3k views
ADD COMMENTlink modified 4.6 years ago by Biostar ♦♦ 20 • written 7.7 years ago by Applet140
4
gravatar for Swbarnes2
7.7 years ago by
Swbarnes21.5k
Swbarnes21.5k wrote:

Convert the .sam to .bam with samtools view. .sams are huge. samtools flagstat will take a .bam, and tell you how many reads mapped total.

ADD COMMENTlink written 7.7 years ago by Swbarnes21.5k
1

I think that's probably true. You could use awk to filter out reads that have mapping quality of 0 from your .sam. I'm not sure that you'd expect a great deal of difference in how many reads have mapping quality of 0 between two aligners, since that's mostly about what your reference sequence is, and how long your reads are.

ADD REPLYlink written 7.7 years ago by Swbarnes21.5k

I fear samtools flagstat counts reads mapping multiple times not unique. Correct me, if I'm wrong.

ADD REPLYlink written 7.7 years ago by David Langenberger9.1k

Well...even a small difference that (might) happen, are differences, right? ;)

ADD REPLYlink written 7.7 years ago by David Langenberger9.1k

Well...even small differences that (might) happen, are differences, right? ;)

ADD REPLYlink written 7.7 years ago by David Langenberger9.1k

Quick Question: Does bam file need to be sorted in order to get the correct flagstat file?

ADD REPLYlink written 7.5 years ago by Alby90

The sorting does not have an effect on flagstat, I do not believe.

ADD REPLYlink written 7.5 years ago by Sean Davis25k

I would recommend you to check it. Run flagstat on both files and compare the results.

ADD REPLYlink written 7.5 years ago by David Langenberger9.1k

Just confirmed.BAM file does not need to be sorted for flagstat!

ADD REPLYlink written 7.5 years ago by Alby90
3
gravatar for Sean Davis
7.7 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

One can generate some pretty useful stats using Picard tools:

http://picard.sourceforge.net/picard-metric-definitions.shtml#AlignmentSummaryMetrics

http://picard.sourceforge.net/picard-metric-definitions.shtml#DuplicationMetrics

If this is a capture experiment, you can also try:

http://picard.sourceforge.net/picard-metric-definitions.shtml#HsMetrics

ADD COMMENTlink modified 8 weeks ago by RamRS24k • written 7.7 years ago by Sean Davis25k
2
gravatar for David Langenberger
7.7 years ago by
Deutschland
David Langenberger9.1k wrote:

e.g. you can count the mapped reads and use the percentage of mapped reads as an indicator of how well the mapping tool performed.

ADD COMMENTlink written 7.7 years ago by David Langenberger9.1k
2
gravatar for David Langenberger
7.7 years ago by
Deutschland
David Langenberger9.1k wrote:

In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:

perl -F'\t' -ane 'if($F[5]=~/M/){print "$F[0]\n";};' file.sam | sort | uniq | wc

short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.

ADD COMMENTlink written 7.7 years ago by David Langenberger9.1k
0
gravatar for Thomasvangurp
7.7 years ago by
Nederland
Thomasvangurp10 wrote:

Use grep to find reads that match to the reference. Generally speaking, only lines with matches contain 'M' at position 6 in each line . Use cut -f 6 file.sam| grep M | wc -l

ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Thomasvangurp10

In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:

perl -F't' -ane 'if($F[5]=~/M/){print "$F[0]n";};' file.sam | sort | uniq | wc -l

short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.

ADD REPLYlink written 7.7 years ago by David Langenberger9.1k

In order to assure, you don't have an M somewhere else (read-ID, mitochondrial chromosome, ...), or even reads mapping to multiple positions in the genome (both would basically destroy your statistics), I recommend to do something like this:

perl -F'\t' -ane 'if($F[5]=~/M/){print "$F[0]\n";};' file.sam | sort | uniq | wc -l

short explanation: if the 6th column contains a 'M' (a match exists), print the first column (the read-ID). then sort and uniq the read-IDs, to assure you count multiple mapped reads only once. and then the word-count as ever.

ADD REPLYlink written 7.7 years ago by David Langenberger9.1k

That doesn't work if you have an M somewhere else, e.g. in your read name

ADD REPLYlink written 7.7 years ago by Christof Winter990

@Christof thanks. Edited the response

ADD REPLYlink written 7.7 years ago by Thomasvangurp10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1329 users visited in the last hour