How To Spot Mutation In Msa ?
4
2
Entering edit mode
10.5 years ago
Giselle ▴ 130

Hey there,

i have a set of genes, where i want to find a mutations ? ( preferably in one specific organism )

example :

• organism1 ACGTACTAC
• organism2 ACGTACTCC
• organism3 ACGTACTCC
• organism4 ACGTACTCC
• organism5 ATGTACTCC

output would be something like that : oragnism1 has mutation on position 8. optional : organism5 has muation on position 2.

I am not an bioinformatician. So maybe, it will be easy to code OR there is already a tool, but i'm not aware of it.

Thanks !!

multiple mutation • 3.0k views
2
Entering edit mode
10.5 years ago

Here is a utility in R custom built for you.

## Step 1

Create the script file get-msa-variants.R

#!/usr/bin/env Rscript

# Parse arguments
script_args <- commandArgs(trailingOnly=T)
file_name <- script_args[[1]]

ma <- msa.tab$V2 # Convert to matrix num_sequences <- length(ma) ma <- paste(ma, sep="", collapse="") ma <- strsplit(ma, "")[[1]] ma <- matrix(ma, nrow=num_sequences, byrow=T) rownames(ma) <- msa.tab$V1

# Calculate consensus
get_consensus <- function(x){
t <- table(x)
consensus <- names(t[t == max(t)])
if (length(consensus) > 1) {
return(sample(consensus, 1)) # Break a tie randomly
} else {
return(consensus)
}
}
consensus <- apply(ma, 2, get_consensus)

# How to Spot mutation in MSA? The answer.
mutations <- apply(ma, 1, function(x){ which(x != consensus) })

# Print results
invisible(
lapply(names(mutations), function(seq_name) {
lapply(mutations[[seq_name]], function(mut_idx) {
cat(seq_name, "has a mutation", ma[seq_name,][[mut_idx]],
"at index", mut_idx, "\n")
})
})
)

## DEBUGGING
#print(file_name)
#print(msa.tab)
#print(ma)
#print(consensus)
#print(mutations)


## Step 2

Make the scirpt executable

chmod +x get-msa-variants.R


## Step 3

Create an input file msa.tab and put your MSA into a file in this format:

organism1 ACGTACTAC
organism2 ACGTACTCC
organism3 ACGTACTCC
organism4 ACGTACTCC
organism5 ATGTACTCC


## Step 4

Run the script

./get-msa-variants.R msa.tab


## Output

organism1 has a mutation A at index 8
organism5 has a mutation T at index 2

1
Entering edit mode
10.5 years ago

without any "programming" you can put your sequences in a HTML+javascript file and let your browser performing the calculation. Here is my naive implementation:

<html>
<script type="text/javascript">
function init()
{
var result="";
var sequences=[
{name:"organism1", seq:"ACGTACTAC"},
{name:"organism2", seq:"ACGTACTCC"},
{name:"organism3", seq:"ACGTACTCC"},
{name:"organism4", seq:"ACGTACTCC"},
{name:"organism5", seq:"ATGTACTCC"}
];
/* all sequences have the same length L */
var L=sequences[0].seq.length;

/* iterate over length */
for(var j=0;j < L;++j)
{
var count={};
var max=0;
var consensus="*";
/* iterator over sequences , get the consensus*/
for(var i=0;i < sequences.length;++i)
{
var base= sequences[i].seq[j];
if(count[base]==undefined)
{
count[base]=1;
}
else
{
count[base]++;
}
if(max<count[base])
{
max=count[base];
consensus=base;
}
}
for(var i=0;i < sequences.length;++i)
{
var base= sequences[i].seq[j];
if(base!=consensus)
{
result+= sequences[i].name+" has mutation "+base+" at index "+(j+1)+"\n";
}
}
}
var pre=document.getElementById("main");
pre.appendChild(document.createTextNode(result));
}
</script>
<pre id="main"/>
</body>
</html>


Result:

organism5 has mutation T at index 2
organism1 has mutation A at index 8

1
Entering edit mode

+2 for using JS

1
Entering edit mode
10.5 years ago

What you are asking for is software that will generate a list of variable sites from a multiple alignment. This is a very common task in population genetics, where you have resequencing data for a genomic region and you just want to summarize the table of variant sites (i.e. those that are polymorphic or divergent).

The exact answer to this will depend on whether you are using DNA or protein sequences, if you are doing this for a handful of genes, and if you want to install/run any command-line code. Assuming that you have DNA sequences for a few genes and just want to use a GUI-based program, then try DNAsp. If you have big data or are comfortable working at the command line, I can suggest other options including compute from the libsequence package.

0
Entering edit mode
10.5 years ago
Lyco ★ 2.3k

I don't know of any tool that would to this for you in one swoop. Assuming that you have a bunch of sequences that are almost identical, the first step would be to create a multiple alignment (should be easy for nearly identical sequences). If the sequences are not too big you could use a general purpose multiple aligner such as MAFFT. If your DNAs are hundreds of kb long, I am sure that the NGS folks (who dominate Biostar) can tell you a good alternative. The next step would be mutation detection. With a little bit of PERL or another programming language it would be easy to write a program for doing that. If you just have to do it once (or a few times), you could probably do it visually by either having the alignment program printing a consensus sequence or by using an alignment rendering tool such as boxshade. You would then have to screen the output for positions where the consensus is not 100% or where boxshade highlights a sequence difference.