Sequence Identity : Nucleotide Resolution
3
2
Entering edit mode
11.8 years ago

Hi,

It's a pretty simple question but I can not find any good answer on internet : How can I compute the identity between sequences at a nucleotide resolution.

Example :

enter image description here

Thanks a lot,

N.

sequence • 5.6k views
ADD COMMENT
0
Entering edit mode

Hi! Can you provide the link from which you took the figure? I think it is somehow related to PSSM and consensus sequences.

ADD REPLY
0
Entering edit mode

It's from Geneious. And the data are test data to explain my question. I aligned the fours sequences with clustalw and then open the output file in geneious

ADD REPLY
0
Entering edit mode

You could always try emailing Geneious technical support and asking. I've had to deal with them before and found them very helpful

ADD REPLY
2
Entering edit mode
11.8 years ago
Vikas Bansal ★ 2.4k

So I downloaded Geneious and first I used ClustalW on these 5 sequences-

>1
ATCT
>2
AGCA
>3
ATGC
>4
ATGG
>5
CGTA

When I opened alignment file in Geneious-

image

and if you will take your cursor on any bar-

image

So as it clearly says- "Mean pairwise identity over all pairs in column". In your image (image in your post not mine) lets take column 2. You have 6 pairs. 3 pairs are identcal (T-T) and 3 are not (G-T). So (100+100+100+0+0+0)/6 . In your last column - (0+0+0+0+0+0)/6=0, so no bar. Colours is given accordingly.

EDIT: As by OP's comment-

So in your image, 2nd column contains-

T
G
T
T

Make all possible pairs- 1st and 2nd (T,G) - not identical, so 0% , 1st and 3rd (T,T) - identical, so 100%, 1st and 4th (T,T) - 100%, 2nd and 3rd (G,T) - 0%, 2nd and 4th (G,T) - 0%, 3rd and 4th (T,T) - 100%.

Now calculate mean pairwise identity - (0+100+100+0+0+100)/6 = 50%

We have divided by 6 because we have 6 pairs (Mean=sum / number of pairs).

Hope this helps.

ADD COMMENT
1
Entering edit mode

Could someone provide the reason why this answer was down voted?

ADD REPLY
0
Entering edit mode

I don't understand your explanation about : You have 6 pairs. 3 pairs are identcal (T-T) and 3 are not (G-T). So (100+100+100+0+0+0)/6

ADD REPLY
0
Entering edit mode

Have a look at my edit.

ADD REPLY
0
Entering edit mode

Thanks it's much more clear now !

ADD REPLY
1
Entering edit mode
11.8 years ago

I use alistat from the HMMER package.

alistat reads a multiple sequence alignment from the file alignfile in any supported format (including SELEX, GCG MSF, and CLUSTAL), and shows a number of simple statistics about it. These statistics include the name of the format, the number of sequences, the total number of residues, the average and range of the sequence lengths, the alignment length (e.g. including gap characters).

Also shown are some percent identities. A percent pairwise alignment identity is defined as (idents / MIN(len1, len2)) where idents is the number of exact identities and len1, len2 are the unaligned lengths of the two sequences. The "average percent identity", "most related pair", and "most unrelated pair" of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively. The "most distant seq" is calculated by finding the maximum pairwise identity (best relative) for all N sequences, then finding the minimum of these N numbers (hence, the most outlying sequence).

ADD COMMENT
0
Entering edit mode
11.8 years ago

Hi, you are probably confused with 'K' and other strange letters. This is IUPAC code:

http://www.bioinformatics.org/sms/iupac.html

and for example K stands for G or T. This is to give some additional information, that you would lost when writing just N as the unknown letter. First position in you sequence is A at all positions, so its easy. Or what do you mean by identity? How similiar are all sequences to each other or similiar to consensus or what?

ADD COMMENT
0
Entering edit mode

I'm interested on how they compute the identity score for each position. In this example, it's plot with a barplot.

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6