How To Spot Mutation In Msa ?
4
2
Entering edit mode
12.8 years ago
Giselle ▴ 130

Hey there,

i have a set of genes, where i want to find a mutations ? ( preferably in one specific organism )

example :

  • organism1 ACGTACTAC
  • organism2 ACGTACTCC
  • organism3 ACGTACTCC
  • organism4 ACGTACTCC
  • organism5 ATGTACTCC

output would be something like that : oragnism1 has mutation on position 8. optional : organism5 has muation on position 2.

I am not an bioinformatician. So maybe, it will be easy to code OR there is already a tool, but i'm not aware of it.

Thanks !!

multiple mutation • 3.8k views
ADD COMMENT
2
Entering edit mode
12.8 years ago

Here is a utility in R custom built for you.

Step 1

Create the script file get-msa-variants.R

#!/usr/bin/env Rscript

# Parse arguments
script_args <- commandArgs(trailingOnly=T)
file_name <- script_args[[1]]

# Load data
msa.tab <- read.table(file_name)
ma <- msa.tab$V2

# Convert to matrix
num_sequences <- length(ma)
ma <- paste(ma, sep="", collapse="")
ma <- strsplit(ma, "")[[1]]
ma <- matrix(ma, nrow=num_sequences, byrow=T)
rownames(ma) <- msa.tab$V1

# Calculate consensus
get_consensus <- function(x){
  t <- table(x)
  consensus <- names(t[t == max(t)])
  if (length(consensus) > 1) {
    return(sample(consensus, 1)) # Break a tie randomly
  } else {
    return(consensus)
  }
}
consensus <- apply(ma, 2, get_consensus)

# How to Spot mutation in MSA? The answer.
mutations <- apply(ma, 1, function(x){ which(x != consensus) })

# Print results
invisible(
  lapply(names(mutations), function(seq_name) {
    lapply(mutations[[seq_name]], function(mut_idx) {
        cat(seq_name, "has a mutation", ma[seq_name,][[mut_idx]],
          "at index", mut_idx, "\n")
    })
  })
)

## DEBUGGING
#print(file_name)
#print(msa.tab)
#print(ma)
#print(consensus)
#print(mutations)

Step 2

Make the scirpt executable

chmod +x get-msa-variants.R

Step 3

Create an input file msa.tab and put your MSA into a file in this format:

organism1 ACGTACTAC
organism2 ACGTACTCC
organism3 ACGTACTCC
organism4 ACGTACTCC
organism5 ATGTACTCC

Step 4

Run the script

./get-msa-variants.R msa.tab

Output

organism1 has a mutation A at index 8 
organism5 has a mutation T at index 2
ADD COMMENT
1
Entering edit mode
12.8 years ago

without any "programming" you can put your sequences in a HTML+javascript file and let your browser performing the calculation. Here is my naive implementation:

<html>
<head>
<script type="text/javascript">
function init()
    {
    var result="";
    var sequences=[
        {name:"organism1", seq:"ACGTACTAC"},
        {name:"organism2", seq:"ACGTACTCC"},
        {name:"organism3", seq:"ACGTACTCC"},
        {name:"organism4", seq:"ACGTACTCC"},
        {name:"organism5", seq:"ATGTACTCC"}
        ];
    /* all sequences have the same length L */
    var L=sequences[0].seq.length;

    /* iterate over length */
    for(var j=0;j < L;++j)
        {
        var count={};
        var max=0;
        var consensus="*";
        /* iterator over sequences , get the consensus*/
        for(var i=0;i < sequences.length;++i)
            {
            var base= sequences[i].seq[j];
            if(count[base]==undefined)
                {
                count[base]=1;
                }
            else
                {
                count[base]++;
                }
            if(max<count[base])
                {
                max=count[base];
                consensus=base;
                }
            }
        for(var i=0;i < sequences.length;++i)
            {
            var base= sequences[i].seq[j];
            if(base!=consensus)
                {
                result+= sequences[i].name+" has mutation "+base+" at index "+(j+1)+"\n";
                }
            }
        }
    var pre=document.getElementById("main");
    pre.appendChild(document.createTextNode(result));
    }
</script>
</head>
<body onload="init()">
<pre id="main"/>
</body>
</html>

Result:

organism5 has mutation T at index 2
organism1 has mutation A at index 8
ADD COMMENT
1
Entering edit mode

+2 for using JS

ADD REPLY
1
Entering edit mode
12.8 years ago

What you are asking for is software that will generate a list of variable sites from a multiple alignment. This is a very common task in population genetics, where you have resequencing data for a genomic region and you just want to summarize the table of variant sites (i.e. those that are polymorphic or divergent).

The exact answer to this will depend on whether you are using DNA or protein sequences, if you are doing this for a handful of genes, and if you want to install/run any command-line code. Assuming that you have DNA sequences for a few genes and just want to use a GUI-based program, then try DNAsp. If you have big data or are comfortable working at the command line, I can suggest other options including compute from the libsequence package.

ADD COMMENT
0
Entering edit mode
12.8 years ago
Lyco ★ 2.3k

I don't know of any tool that would to this for you in one swoop. Assuming that you have a bunch of sequences that are almost identical, the first step would be to create a multiple alignment (should be easy for nearly identical sequences). If the sequences are not too big you could use a general purpose multiple aligner such as MAFFT. If your DNAs are hundreds of kb long, I am sure that the NGS folks (who dominate Biostar) can tell you a good alternative. The next step would be mutation detection. With a little bit of PERL or another programming language it would be easy to write a program for doing that. If you just have to do it once (or a few times), you could probably do it visually by either having the alignment program printing a consensus sequence or by using an alignment rendering tool such as boxshade. You would then have to screen the output for positions where the consensus is not 100% or where boxshade highlights a sequence difference.

ADD COMMENT

Login before adding your answer.

Traffic: 2289 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6