Question: Need Help Interpreting The Percent Identity From Blat M8/9 Output
0
gravatar for biohack92
6.8 years ago by
biohack92150
United States
biohack92150 wrote:

I'm new to BLAT and I need help interpreting the output.

This is a portion of the m9 output from a BLAT alignment

# BLAT 35 [2009/02/26]
# Query: 488:2:6:613:1563
# Database: reference.fasta
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
488:2:6:613:1563        gi|315125111|ref|NC_014803.1|   90.74   108     10      0       1       108     1212216 1212109 7.9e-44 175.0
488:2:6:613:1563        gi|315125111|ref|NC_014803.1|   90.74   108     10      0       1       108     808780  808673  7.9e-44 175.0

Does the percent identity (90.74%) indicate % identity of the entire read or is this % identity of the fraction of the read?

I'm interpreting it as the %identity of the aligned portion of a read; how would you calculate the % identity of that particular read to that position?

alignment blat • 3.7k views
ADD COMMENTlink modified 6.8 years ago by a.zielezinski8.8k • written 6.8 years ago by biohack92150
6
gravatar for a.zielezinski
6.8 years ago by
a.zielezinski8.8k
a.zielezinski8.8k wrote:

Percent identity in BLAT results indicates a fraction of the read sequence that is aligned to a subject sequence. BLAT aims at identifying the best pair of regions, one from each sequence, query and subject, such that the alignment of these two regions is the best possible, but not necessarily across the entire two sequences.

Before you calculate the percent identity of a full-length read's sequence, you have to know four things:

  1. Alignment length (alen)
  2. Number of mismatches (mis)
  3. Number of gaps (gaps)
  4. Full length of a read sequence (len)

First three things are present in the m9 BLAT output.

Then, to calculate the percent identity use the following equation:

Pident = (alen-mis-gaps)/len*100
ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by a.zielezinski8.8k

Awesome, thank you!

ADD REPLYlink written 6.8 years ago by biohack92150

I have not used blast for a long time, so I could be wrong. But as I remember, identity=(alen-mis-gaps)/alen. This actually also makes more sense to me.

ADD REPLYlink written 6.8 years ago by lh331k

Be aware that this formula assumes that all non-aligned portions of the read are mismatches, which is not necessarily true if you were actually doing the complete alignment. The FASTA package has a global-local aligner that fources one of the two sequences (the read in this case) to be entirely aligned against a part of another sequence. The PID of such a glocal alignment might be higher than the one of this formula.

ADD REPLYlink written 6.8 years ago by Christian2.8k

I'm aligning sequencing reads to reference genomes so the equation seems to be suitable for the job.

ADD REPLYlink written 6.8 years ago by biohack92150
1
gravatar for Jeremy Leipzig
6.8 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

You might have better luck deriving the actual % identity from .psl output. The m9 output is provided strictly for BLAST lovers - the "gap openings" are not "gapped bases" and the percent identity (google "millibad") is ridiculously affine to account for spliced alignments. Here is some awful PHP code I wrote in my youth to rebuild the alignment from the PSL:

function displayAlignment($seq,$target,$db,$qName,$strand,$qStart,$qEnd,$tName,$tStart,$tEnd,$qStarts,$tStarts,$blockCount,$blockSizes,$qSize)
{
    if(!$target)
    {
        $alignHash['alignment']='no target';
        $alignHash['verdict']='none';
        return($alignHash);
    }
    $rawTargetSeq=$target;
    if(!$seq)
    {
        $alignHash['alignment']='no sequence';
        $alignHash['verdict']='none';
        return($alignHash);
    }

    if($strand=='-')
    {
        $seq=revcomp($seq);
    }

    #query sequence
    for($seqCnt=0,$qPos=$qStarts[0],$tPos=$tStarts[0]-$tStart;$seqCnt<sizeof($qStarts);$seqCnt++)
    {
        $tStarts[$seqCnt]=$tStarts[$seqCnt]-$tStart;
        #double gap
        #&& $qStarts[$seqCnt]-$qPos==$tStarts[$seqCnt]-$tPos
        if($qStarts[$seqCnt]>$qPos && $tStarts[$seqCnt]>$tPos && true)
        {
            $doubleGap=min($qStarts[$seqCnt]-$qPos,$tStarts[$seqCnt]-$tPos);
            for($i=0;$i<$doubleGap;$i++)
            {
                $qSeq.=substr($seq,$qPos+$i,1);
                $tSeq.=substr($rawTargetSeq,$tPos+$i,1);
            }
            $qPos+=$doubleGap;
            $tPos+=$doubleGap;
        }
        if($qStarts[$seqCnt]>$qPos)
        {
            for($i=0;$i<$qStarts[$seqCnt]-$qPos;$i++)
            {
                $tSeq.='-'; #insert in query seq, next alignment block starts later
                $qSeq.=substr($seq,$qPos+$i,1);
            }
        }
        if($tStarts[$seqCnt]>$tPos)
        {
            for($i=0;$i<$tStarts[$seqCnt]-$tPos;$i++)
            {
                $qSeq.='-';
                $tSeq.=substr($rawTargetSeq,$tPos+$i,1);
            }
        }
        $qSeq.=substr($seq,$qStarts[$seqCnt],$blockSizes[$seqCnt]);
        $tSeq.=substr($rawTargetSeq,$tStarts[$seqCnt],$blockSizes[$seqCnt]);
        $qPos=$qStarts[$seqCnt]+$blockSizes[$seqCnt];
        $tPos=$tStarts[$seqCnt]+$blockSizes[$seqCnt];
    }

    if($strand=='-')
    {
        $qSeq=revcomp($qSeq);
        $tSeq=revcomp($tSeq);
    }

    $mismatches=0;
    $mismatches50=0;
    $mismatches100=0;
    $redMode=false;
    $alignLen=strlen($qSeq);#arbitrary, could also be tSeq
    for($i=0;$i<strlen($qSeq);$i++) {
        if(preg_match('/[NP]/i',substr($qSeq, $i, 1)) || preg_match('/[NP]/i',substr($tSeq, $i, 1)))
        {
            #do not count for or against
            $alignLen--;
        }
        else
        {
            if (strtoupper(substr($qSeq, $i, 1)) != strtoupper(substr($tSeq, $i, 1))) {
                $mismatches++;
                if($i<50) $mismatches50++;
                if($i<100) $mismatches100++;
                if(!$redMode)
                {
                    $fqSeq.="<SPAN class=\"red\">";
                    $redMode=true;
                }
                $fqSeq.=substr($qSeq, $i, 1);
                $ftSeq.=substr($tSeq, $i, 1);
            }
            else
            {
                if($redMode)
                {
                    $fqSeq.="</SPAN>";
                    $redMode=false;
                }
                $fqSeq.=substr($qSeq, $i, 1);
                $ftSeq.=substr($tSeq, $i, 1);
            }
        }
    }
    if($alignLen>=100)
    {
        if($mismatches50<=1 || $mismatches100 <=2 || ((($alignLen-$mismatches)/$alignLen)*100) > 98) $verdict="PASS"; else $verdict="FAIL";
        $stats.="<br>$mismatches50 first 50 (".(((50-$mismatches50)/50)*100)."%)<br>$mismatches100 first 100 (".(((100-$mismatches100)/100)*100)."%)<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
    }
    else if($alignLen>=50)
    {
        if($mismatches50<=1 || ((($alignLen-$mismatches)/$alignLen)*100) > 98) $verdict="PASS"; else $verdict="FAIL";
        $stats.="<br>$mismatches50 first 50 (".(((50-$mismatches50)/50)*100)."%)<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
    }
    else
    {
        if($mismatches==0) $verdict="PASS"; else $verdict.="FAIL";
        $stats.="<br>$mismatches total mismatches (".number_format(((($alignLen-$mismatches)/$alignLen)*100),2)."%)";
    }

        $alignment.=$stats;
        $alignment.="<br>query: ".$fqSeq;
        $alignment.="<br>target:".$ftSeq;
    #fclose($fa);
    if(strlen($alignment)<8000)
        $alignHash['alignment']=$alignment;
    else
        $alignHash['alignment']='too long to display';
    $alignHash['verdict']=$verdict;
    return($alignHash);
}
ADD COMMENTlink written 6.8 years ago by Jeremy Leipzig18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 768 users visited in the last hour