Question

Sequence Identity : Nucleotide Resolution

2

Entering edit mode

11.8 years ago

Nicolas Rosewick 10k

Hi,

It's a pretty simple question but I can not find any good answer on internet : How can I compute the identity between sequences at a nucleotide resolution.

Example :

enter image description here

Thanks a lot,

N.

sequence • 5.6k views

ADD COMMENT • link updated 8.5 years ago by Biostar 20 • written 11.8 years ago by Nicolas Rosewick 10k

0

Entering edit mode

Hi! Can you provide the link from which you took the figure? I think it is somehow related to PSSM and consensus sequences.

ADD REPLY • link 11.8 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

It's from Geneious. And the data are test data to explain my question. I aligned the fours sequences with clustalw and then open the output file in geneious

ADD REPLY • link 11.8 years ago by Nicolas Rosewick 10k

0

Entering edit mode

You could always try emailing Geneious technical support and asking. I've had to deal with them before and found them very helpful

ADD REPLY • link 11.8 years ago by Davy ▴ 410

score 2 · Answer 1 · 2012-06-21

2

Entering edit mode

11.8 years ago

Vikas Bansal ★ 2.4k

So I downloaded Geneious and first I used ClustalW on these 5 sequences-

>1
ATCT
>2
AGCA
>3
ATGC
>4
ATGG
>5
CGTA

When I opened alignment file in Geneious-

and if you will take your cursor on any bar-

So as it clearly says- "Mean pairwise identity over all pairs in column". In your image (image in your post not mine) lets take column 2. You have 6 pairs. 3 pairs are identcal (T-T) and 3 are not (G-T). So (100+100+100+0+0+0)/6 . In your last column - (0+0+0+0+0+0)/6=0, so no bar. Colours is given accordingly.

EDIT: As by OP's comment-

So in your image, 2nd column contains-

T
G
T
T

Make all possible pairs- 1st and 2nd (T,G) - not identical, so 0% , 1st and 3rd (T,T) - identical, so 100%, 1st and 4th (T,T) - 100%, 2nd and 3rd (G,T) - 0%, 2nd and 4th (G,T) - 0%, 3rd and 4th (T,T) - 100%.

Now calculate mean pairwise identity - (0+100+100+0+0+100)/6 = 50%

We have divided by 6 because we have 6 pairs (Mean=sum / number of pairs).

Hope this helps.

ADD COMMENT • link 11.8 years ago by Vikas Bansal ★ 2.4k

1

Entering edit mode

Could someone provide the reason why this answer was down voted?

ADD REPLY • link 11.8 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

I don't understand your explanation about : You have 6 pairs. 3 pairs are identcal (T-T) and 3 are not (G-T). So (100+100+100+0+0+0)/6

ADD REPLY • link 11.8 years ago by Nicolas Rosewick 10k

0

Entering edit mode

Have a look at my edit.

ADD REPLY • link 11.8 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Thanks it's much more clear now !

ADD REPLY • link 11.8 years ago by Nicolas Rosewick 10k

score 1 · Answer 2 · 2012-06-21

I use alistat from the HMMER package.

alistat reads a multiple sequence alignment from the file alignfile in any supported format (including SELEX, GCG MSF, and CLUSTAL), and shows a number of simple statistics about it. These statistics include the name of the format, the number of sequences, the total number of residues, the average and range of the sequence lengths, the alignment length (e.g. including gap characters).

Also shown are some percent identities. A percent pairwise alignment identity is defined as (idents / MIN(len1, len2)) where idents is the number of exact identities and len1, len2 are the unaligned lengths of the two sequences. The "average percent identity", "most related pair", and "most unrelated pair" of the alignment are the average, maximum, and minimum of all (N)(N-1)/2 pairs, respectively. The "most distant seq" is calculated by finding the maximum pairwise identity (best relative) for all N sequences, then finding the minimum of these N numbers (hence, the most outlying sequence).

score 0 · Answer 3 · 2012-06-21

0

Entering edit mode

11.8 years ago

Biomonika (Noolean) 3.2k

Hi, you are probably confused with 'K' and other strange letters. This is IUPAC code:

http://www.bioinformatics.org/sms/iupac.html

and for example K stands for G or T. This is to give some additional information, that you would lost when writing just N as the unknown letter. First position in you sequence is A at all positions, so its easy. Or what do you mean by identity? How similiar are all sequences to each other or similiar to consensus or what?

ADD COMMENT • link 11.8 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

I'm interested on how they compute the identity score for each position. In this example, it's plot with a barplot.

ADD REPLY • link 11.8 years ago by Nicolas Rosewick 10k