Question: How To Generate Mutations In Sequences Of A Fasta File?
2
gravatar for Geparada
8.3 years ago by
Geparada1.4k
Cambridge
Geparada1.4k wrote:

HI!

For example, if I have a file with 3 sequences:

>1
gtccgatgat
>2
gatggatcggacgttagcaa
>3
acatttgaga

I want to generate sequences with 10% of the nucleotides randomly mutated.

>1
gtcTgatgat
>2
gaAggatcggacgtGagcaa
>3
acatCtgaga

Does anybody know a program or code to do that?

And if you know a way to do this in python would be great!

Thanks!!!

fasta python sequence mutation tool • 9.3k views
ADD COMMENTlink modified 8.3 years ago by brentp23k • written 8.3 years ago by Geparada1.4k
9
gravatar for brentp
8.3 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

call the program below like:

python mutate.py some.fasta 0.01 > mutated.fasta

And it will change 1% (0.01) of the bases.

from random import random, choice
import sys
from itertools import groupby

def fasta_iter(fasta_name):
    """
    given a fasta file. yield tuples of header, sequence
    """
    fh = open(fasta_name)
    # ditch the boolean (x[0]) and just keep the header or sequence since
    # we know they alternate.
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        # drop the ">"
        header = header.next()[1:].strip()
        # join all sequence lines to one.
        seq = "".join(s.strip() for s in faiter.next())
        yield header, seq

def main(fasta_name, mutation_freq):

    for header, seq in fasta_iter(fasta_name):
        seq = list(seq)
        for i, s in enumerate(seq):
            val = random()
            if val < mutation_freq:
                # choose a random nucleotide that's different.
                seq[i] = choice([x for x in "ACTG" if x != s.upper()])
        print ">%s\n%s" % (header, "".join(seq))

if __name__ == "__main__":
    main(sys.argv[1], float(sys.argv[2]))
ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by brentp23k
2

on 1) i dont disagree but favor simplicity over performance unless it proves a problem. For 2) it probably depends on the mutation rate which is more efficient. And for 3) I guess that depends on what "mutation rate" means.

ADD REPLYlink written 8.3 years ago by brentp23k
1

actually, you were right. I changed it to s.upper() :) ... thanks.

ADD REPLYlink written 8.3 years ago by brentp23k

nice! just one few suggestions, 1)choice is really slower than random, as we only want to get one element the corresponding line could be changed using 'ATGC'[int(random()4)] 2) seq can be let as string, seq = seq[i:] + 'ATGC'[int(random()4)] + seq[:i+1]... it is perhaps more efficient 3)why should the mutation result in a different nucleotide?

ADD REPLYlink written 8.3 years ago by fransua390

Nice code brentp! thanks !!!

ADD REPLYlink written 8.3 years ago by Geparada1.4k

I've learned a lot from this code... you really help me brentp :)

ADD REPLYlink written 8.3 years ago by Geparada1.4k

Am I mistaken to believe this will only give a mutation rate equivalent to 3/4 of the one you specify? Namely, 1 out of 4 of your mutations will be mutations to the same nucleotide, thus not being a mutation? Maybe a while loop would correct for that easily. (while new == old: pick new) or something.

ADD REPLYlink written 8.3 years ago by Eric Normandeau10k

@Eric, yes, you are mistaken. See @fransua's comment. When the threshold is exceeded. It changes the base. That's the part about: choice([x for x in "ACTG" if not x == s.lower()])

ADD REPLYlink written 8.3 years ago by brentp23k

I was not sure I understood your list comprehension at first either. Now I see. Thanks

ADD REPLYlink written 8.3 years ago by Eric Normandeau10k

Ups... I didn't saw the error when I analyzed the script. Thanks for the correction!!

ADD REPLYlink written 8.3 years ago by Geparada1.4k

This is great. Sorry, Im new to phyton ... how would you go about to modify the script to directly introduce, say, 2 mutations? (I mean, not depending on a mutation frequency, but telling the script to introduce 2 mutations in each given sequence regardless of size). Thanks!

ADD REPLYlink written 3.6 years ago by boludopublico0
7
gravatar for iw9oel_ad
8.3 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

The EMBOSS program msbar will do this. E.g. to introduce 10 point mutations

msbar -count 10 -point 4 seq.fasta

The argument '4' allows only base changes, but msbar can produce other types of mutation. There is also an argument to ensure that the mutation does not create another specific sequence (e.g. the one you started with). You can use infoseq to obtain the sequence length so that you know how many mutations to request.

ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by iw9oel_ad6.0k

EMBOSS (the package that msbar is a part of) takes quite a bit of time to compile. Does anyone know of a place to get binaries?

ADD REPLYlink written 2.0 years ago by wkretzsch0

What about Bioconda?

conda install emboss

https://bioconda.github.io/recipes/emboss/README.html

ADD REPLYlink written 15 days ago by Manuel370

It seems that msbar couldn't work with multiple sequences in the same fasta file, do you have any idea how we could mutate multiple seqs with msbar

ADD REPLYlink written 10 months ago by Daiwei Tang10
4
gravatar for Larry_Parnell
8.3 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

If this effort is to support research on human genetics, for example, I would add statements that the random variation mimic known frequencies of transitions (T <-> C, or A <-> G) and transversions (A -> C or T, or G -> C or T, C -> G or A, T -> G or A) in the human genome. The most common single nucleotide change in human is C <-> T. Thus, if T is picked as the base to change, then it should change more often to a C than to another base. This would be important if the project wants to align with a biological situation.

The frequency with which a given nucleotide, selected at random by one of the scripts above, changes to any of the other three nucleotides can be taken by scanning the literature for such a table or by parsing dbSNP. I just don't have number on hand.

ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by Larry_Parnell16k

+1 if you put more time/realism/thought in, you'll likely get more useful stuff out.

ADD REPLYlink written 8.3 years ago by brentp23k
2
gravatar for Martin A Hansen
8.3 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

You can do this with Biopieces www.biopieces.org) like this:

read_fasta -i test.fna | mutate_seq -p 1 | write_fasta -x

There is also indel_seq for insertions and deletions.

http://code.google.com/p/biopieces/wiki/mutate_seq

http://code.google.com/p/biopieces/wiki/indel_seq

Cheers,

Martin

ADD COMMENTlink written 8.3 years ago by Martin A Hansen3.0k

This looks nice, but I can't figure out how to install biopieces. It looks like homebrew has dropped support for it. :/

ADD REPLYlink written 2.0 years ago by wkretzsch0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 827 users visited in the last hour