How to calculate percent identity from sam file?
0
0
Entering edit mode
2.1 years ago
O.rka ▴ 290

I'm using pysam to calculate the percent identity from each mapped read in a samfile. I tried to get some of the more complicated CIGAR strings so I did len(CIGAR) > 4. I can't figure out how to calculate the percent identity from this. It would make sense to get the number of matches - the number of mismatches / alignment length but how can I get this from the CIGAR string?

How will indels affect the percent identity? Is there a method to do this straight from pysam ?

In [39]: import pysam
In [40]: samfile = pysam.AlignmentFile("mapped.sam")
In [41]: i = 0
...:         if len(cigar) > 4:
...:             print()
...:             i += 1
...:             if i == 20:
...:                 break

5S55M
(array('I', [55, 0, 0, 0, 5, 0, 0, 0, 0, 0, 1]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))
[(4, 5), (0, 55)]
HJGKNBGX5:1:11101:10032:11438   0   37  1411003 60  5S55M   -1  -1  55  ACGTGAATTCCTGCACACTGTTGAAATCGTGGACAGTCAGACTGCCTGCTCAAATTGAAC    array('B', [32, 32, 32, 32, 32, 36, 14, 27, 36, 14, 36, 32, 14, 36, 36, 14, 14, 36, 36, 32, 14, 32, 27, 36, 14, 36, 36, 36, 14, 36, 27, 32, 36, 36, 27, 14, 32, 27, 32, 32, 27, 36, 27, 14, 36, 36, 32, 27, 36, 36, 36, 36, 36, 36, 27, 36, 32, 36, 36, 32])    [('AS', -8), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '7C47'), ('YT', 'UU'), ('NH', 1)]

54M6S
(array('I', [54, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))
[(0, 54), (4, 6)]
HJGKNBGX5:1:11101:10044:3447    16  204 64576   60  54M6S   -1  -1  54  ATCGAGAGACTGTAGCCGAAAGGGAAAACAGTGGTCGAGTTGCATAACGTCTGCTCATTC    array('B', [36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 14, 36, 36, 32, 14, 36, 36, 36, 36, 36, 14, 36, 36, 36, 32, 14, 36, 36, 36, 27, 14, 36, 36, 36, 27, 36, 36, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 32, 32, 32, 32, 32])    [('AS', -6), ('XN', 0), ('XM', 0), ('XO', 0), ('XG', 0), ('NM', 0), ('MD', '54'), ('YT', 'UU'), ('NH', 1)]

46M79N14M
(array('I', [60, 0, 0, 79, 0, 0, 0, 0, 0, 0, 3]), array('I', [2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]))
[(0, 46), (3, 79), (0, 14)]
HJGKNBGX5:1:11101:10029:1903    0   144 1949847 60  46M79N14M   -1  -1  60  GAATAATTCGGTTACGTTGGCCCAGTTGACATCACAAAGCTCAAATGGAGAAAAGAAGCC    array('B', [14, 21, 14, 14, 32, 14, 32, 36, 36, 14, 14, 14, 14, 14, 14, 27, 32, 36, 36, 32, 14, 14, 14, 14, 36, 14, 27, 32, 14, 14, 14, 36, 32, 14, 32, 36, 14, 14, 36, 36, 36, 36, 36, 27, 14, 14, 36, 36, 36, 14, 32, 14, 32, 36, 32, 14, 32, 36, 36, 27])    [('AS', -9), ('XN', 0), ('XM', 3), ('XO', 0), ('XG', 0), ('NM', 3), ('MD', '12C0C15A30'), ('YT', 'UU'), ('XS', '+'), ('NH', 1)]

mapping assembly sam samtools cigar • 1.9k views
2
Entering edit mode

you cannot use such cigar strings: The M operator can be a match(=) or a mismatch(X). See also the sam attribute 'NM:i:' ( "Edit distance to the reference" )

1
Entering edit mode

The blog post The history the MD tag and the CIGAR X operator explains why the CIGAR shouldn't be used to calculate percent identity, but a TL;DR is:

First, CIGAR describes alignment, but sequence matches and mismatches are not indispensable properties of alignment.

0
Entering edit mode

I've changed the question to how to calculate from sam file. Is there a straight forward way to do this from the sam record?

0
Entering edit mode

maybe you could use a combination of the MD field and cigar string, explained here https://zombieprocess.wordpress.com/2013/05/21/calculating-percent-identity-from-sam-files/

0
Entering edit mode

How should indels be handled?

0
Entering edit mode

Just out of curiosity, why do you need this?

0
Entering edit mode

This pipeline I use relies on CLC mapper which has a percent identity parameter. It seems really useful and I want to make a public version that only uses open source tools so I can throw it on my GitHub. I want to make it so you can use any mapping program and then you can filter out the reads that don't match a certain percent identity post hoc. For example, with environmental metagenomics you might want a lower threshold but with single cell stuff maybe you want it to be higher percent identity (maybe those are poor examples).

0
Entering edit mode

That would not be very simple to do with large BAM files. It may just be easier/faster to run alignments at two (or more) thresholds and let people pick one they like. Not sure how much difference this will make on final number of alignments. If you go through that exercise post the results here.

With BBMap for example you can set it while aligning:

minid=0.76              Approximate minimum alignment identity to look for.
Higher is faster and less sensitive.

0
Entering edit mode

Yea BBMap is definitely my go for 99% of my mapping jobs mostly because of this feature (and it's super fast, easy to use, and reliable). One thing about BBMap is it can't take in GTF/GFF3 files for exon mapping like HISAT2 & STAR.

0
Entering edit mode

You could post-filter your BBMap alignments using samtools view regions with bed files for the exons.

0
Entering edit mode

Nice, I didn't know you could do that! So you're suggesting for eukaryotic organisms with lots of splice sites to map directly to transcript, then filter with samtools view regions?