Question: Need Help Interpreting The Percent Identity From Blat M8/9 Output
0
7.4 years ago by
biohack92150
United States
biohack92150 wrote:

I'm new to BLAT and I need help interpreting the output.

This is a portion of the m9 output from a BLAT alignment

``````# BLAT 35 [2009/02/26]
# Query: 488:2:6:613:1563
# Database: reference.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
488:2:6:613:1563        gi|315125111|ref|NC_014803.1|   90.74   108     10      0       1       108     1212216 1212109 7.9e-44 175.0
488:2:6:613:1563        gi|315125111|ref|NC_014803.1|   90.74   108     10      0       1       108     808780  808673  7.9e-44 175.0
``````

Does the percent identity (90.74%) indicate % identity of the entire read or is this % identity of the fraction of the read?

I'm interpreting it as the %identity of the aligned portion of a read; how would you calculate the % identity of that particular read to that position?

alignment blat • 3.9k views
modified 7.4 years ago by a.zielezinski9.1k • written 7.4 years ago by biohack92150
6
7.4 years ago by
a.zielezinski9.1k
a.zielezinski9.1k wrote:

Percent identity in BLAT results indicates a fraction of the read sequence that is aligned to a subject sequence. BLAT aims at identifying the best pair of regions, one from each sequence, query and subject, such that the alignment of these two regions is the best possible, but not necessarily across the entire two sequences.

Before you calculate the percent identity of a full-length read's sequence, you have to know four things:

1. Alignment length (`alen`)
2. Number of mismatches (`mis`)
3. Number of gaps (`gaps`)
4. Full length of a read sequence (`len`)

First three things are present in the m9 BLAT output.

Then, to calculate the percent identity use the following equation:

``````Pident = (alen-mis-gaps)/len*100
``````

Awesome, thank you!

I have not used blast for a long time, so I could be wrong. But as I remember, identity=(alen-mis-gaps)/alen. This actually also makes more sense to me.

Be aware that this formula assumes that all non-aligned portions of the read are mismatches, which is not necessarily true if you were actually doing the complete alignment. The FASTA package has a global-local aligner that fources one of the two sequences (the read in this case) to be entirely aligned against a part of another sequence. The PID of such a glocal alignment might be higher than the one of this formula.

I'm aligning sequencing reads to reference genomes so the equation seems to be suitable for the job.

1
7.4 years ago by
Jeremy Leipzig19k wrote:

You might have better luck deriving the actual % identity from .psl output. The m9 output is provided strictly for BLAST lovers - the "gap openings" are not "gapped bases" and the percent identity (google "millibad") is ridiculously affine to account for spliced alignments. Here is some awful PHP code I wrote in my youth to rebuild the alignment from the PSL:

``````function displayAlignment(\$seq,\$target,\$db,\$qName,\$strand,\$qStart,\$qEnd,\$tName,\$tStart,\$tEnd,\$qStarts,\$tStarts,\$blockCount,\$blockSizes,\$qSize)
{
if(!\$target)
{
\$alignHash['alignment']='no target';
\$alignHash['verdict']='none';
return(\$alignHash);
}
\$rawTargetSeq=\$target;
if(!\$seq)
{
\$alignHash['alignment']='no sequence';
\$alignHash['verdict']='none';
return(\$alignHash);
}

if(\$strand=='-')
{
\$seq=revcomp(\$seq);
}

#query sequence
for(\$seqCnt=0,\$qPos=\$qStarts[0],\$tPos=\$tStarts[0]-\$tStart;\$seqCnt<sizeof(\$qStarts);\$seqCnt++)
{
\$tStarts[\$seqCnt]=\$tStarts[\$seqCnt]-\$tStart;
#double gap
#&& \$qStarts[\$seqCnt]-\$qPos==\$tStarts[\$seqCnt]-\$tPos
if(\$qStarts[\$seqCnt]>\$qPos && \$tStarts[\$seqCnt]>\$tPos && true)
{
\$doubleGap=min(\$qStarts[\$seqCnt]-\$qPos,\$tStarts[\$seqCnt]-\$tPos);
for(\$i=0;\$i<\$doubleGap;\$i++)
{
\$qSeq.=substr(\$seq,\$qPos+\$i,1);
\$tSeq.=substr(\$rawTargetSeq,\$tPos+\$i,1);
}
\$qPos+=\$doubleGap;
\$tPos+=\$doubleGap;
}
if(\$qStarts[\$seqCnt]>\$qPos)
{
for(\$i=0;\$i<\$qStarts[\$seqCnt]-\$qPos;\$i++)
{
\$tSeq.='-'; #insert in query seq, next alignment block starts later
\$qSeq.=substr(\$seq,\$qPos+\$i,1);
}
}
if(\$tStarts[\$seqCnt]>\$tPos)
{
for(\$i=0;\$i<\$tStarts[\$seqCnt]-\$tPos;\$i++)
{
\$qSeq.='-';
\$tSeq.=substr(\$rawTargetSeq,\$tPos+\$i,1);
}
}
\$qSeq.=substr(\$seq,\$qStarts[\$seqCnt],\$blockSizes[\$seqCnt]);
\$tSeq.=substr(\$rawTargetSeq,\$tStarts[\$seqCnt],\$blockSizes[\$seqCnt]);
\$qPos=\$qStarts[\$seqCnt]+\$blockSizes[\$seqCnt];
\$tPos=\$tStarts[\$seqCnt]+\$blockSizes[\$seqCnt];
}

if(\$strand=='-')
{
\$qSeq=revcomp(\$qSeq);
\$tSeq=revcomp(\$tSeq);
}

\$mismatches=0;
\$mismatches50=0;
\$mismatches100=0;
\$redMode=false;
\$alignLen=strlen(\$qSeq);#arbitrary, could also be tSeq
for(\$i=0;\$i<strlen(\$qSeq);\$i++) {
if(preg_match('/[NP]/i',substr(\$qSeq, \$i, 1)) || preg_match('/[NP]/i',substr(\$tSeq, \$i, 1)))
{
#do not count for or against
\$alignLen--;
}
else
{
if (strtoupper(substr(\$qSeq, \$i, 1)) != strtoupper(substr(\$tSeq, \$i, 1))) {
\$mismatches++;
if(\$i<50) \$mismatches50++;
if(\$i<100) \$mismatches100++;
if(!\$redMode)
{
\$fqSeq.="<SPAN class=\"red\">";
\$redMode=true;
}
\$fqSeq.=substr(\$qSeq, \$i, 1);
\$ftSeq.=substr(\$tSeq, \$i, 1);
}
else
{
if(\$redMode)
{
\$fqSeq.="</SPAN>";
\$redMode=false;
}
\$fqSeq.=substr(\$qSeq, \$i, 1);
\$ftSeq.=substr(\$tSeq, \$i, 1);
}
}
}
if(\$alignLen>=100)
{
if(\$mismatches50<=1 || \$mismatches100 <=2 || (((\$alignLen-\$mismatches)/\$alignLen)*100) > 98) \$verdict="PASS"; else \$verdict="FAIL";
\$stats.="<br>\$mismatches50 first 50 (".(((50-\$mismatches50)/50)*100)."%)<br>\$mismatches100 first 100 (".(((100-\$mismatches100)/100)*100)."%)<br>\$mismatches total mismatches (".number_format((((\$alignLen-\$mismatches)/\$alignLen)*100),2)."%)";
}
else if(\$alignLen>=50)
{
if(\$mismatches50<=1 || (((\$alignLen-\$mismatches)/\$alignLen)*100) > 98) \$verdict="PASS"; else \$verdict="FAIL";
\$stats.="<br>\$mismatches50 first 50 (".(((50-\$mismatches50)/50)*100)."%)<br>\$mismatches total mismatches (".number_format((((\$alignLen-\$mismatches)/\$alignLen)*100),2)."%)";
}
else
{
if(\$mismatches==0) \$verdict="PASS"; else \$verdict.="FAIL";
\$stats.="<br>\$mismatches total mismatches (".number_format((((\$alignLen-\$mismatches)/\$alignLen)*100),2)."%)";
}

\$alignment.=\$stats;
\$alignment.="<br>query: ".\$fqSeq;
\$alignment.="<br>target:".\$ftSeq;
#fclose(\$fa);
if(strlen(\$alignment)<8000)
\$alignHash['alignment']=\$alignment;
else
\$alignHash['alignment']='too long to display';
\$alignHash['verdict']=\$verdict;
return(\$alignHash);
}
``````