Question

How To Generate Mutations In Sequences Of A Fasta File?

3

Entering edit mode

13.8 years ago

Geparada ★ 1.5k

Hi!

For example, if I have a file with 3 sequences:

>1
gtccgatgat
>2
gatggatcggacgttagcaa
>3
acatttgaga

I want to generate sequences with 10% of the nucleotides randomly mutated.

>1
gtcTgatgat
>2
gaAggatcggacgtGagcaa
>3
acatCtgaga

Does anybody know a program or code to do that?

And if you know a way to do this in python would be great!

Thanks!

fasta python mutation sequence • 17k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 13.8 years ago by Geparada ★ 1.5k

score 9 · Answer 1 · 2011-09-22

9

Entering edit mode

13.8 years ago

brentp 24k

call the program below like:

python mutate.py some.fasta 0.01 > mutated.fasta

And it will change 1% (0.01) of the bases.

from random import random, choice
import sys
from itertools import groupby

def fasta_iter(fasta_name):
    """
    given a fasta file. yield tuples of header, sequence
    """
    fh = open(fasta_name)
    # ditch the boolean (x[0]) and just keep the header or sequence since
    # we know they alternate.
    faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
    for header in faiter:
        # drop the ">"
        header = header.next()[1:].strip()
        # join all sequence lines to one.
        seq = "".join(s.strip() for s in faiter.next())
        yield header, seq

def main(fasta_name, mutation_freq):

    for header, seq in fasta_iter(fasta_name):
        seq = list(seq)
        for i, s in enumerate(seq):
            val = random()
            if val < mutation_freq:
                # choose a random nucleotide that's different.
                seq[i] = choice([x for x in "ACTG" if x != s.upper()])
        print ">%s\n%s" % (header, "".join(seq))

if __name__ == "__main__":
    main(sys.argv[1], float(sys.argv[2]))

ADD COMMENT • link 13.8 years ago by brentp 24k

2

Entering edit mode

on 1) i dont disagree but favor simplicity over performance unless it proves a problem. For 2) it probably depends on the mutation rate which is more efficient. And for 3) I guess that depends on what "mutation rate" means.

ADD REPLY • link 13.8 years ago by brentp 24k

1

Entering edit mode

actually, you were right. I changed it to s.upper() :) ... thanks.

ADD REPLY • link 13.8 years ago by brentp 24k

0

Entering edit mode

nice! just one few suggestions, 1)choice is really slower than random, as we only want to get one element the corresponding line could be changed using 'ATGC'[int(random()4)] 2) seq can be let as string, seq = seq[i:] + 'ATGC'[int(random()4)] + seq[:i+1]... it is perhaps more efficient 3)why should the mutation result in a different nucleotide?

ADD REPLY • link 13.8 years ago by fransua ▴ 390

0

Entering edit mode

Nice code brentp! thanks !!!

ADD REPLY • link 13.8 years ago by Geparada ★ 1.5k

0

Entering edit mode

I've learned a lot from this code... you really help me brentp :)

ADD REPLY • link 13.8 years ago by Geparada ★ 1.5k

0

Entering edit mode

Am I mistaken to believe this will only give a mutation rate equivalent to 3/4 of the one you specify? Namely, 1 out of 4 of your mutations will be mutations to the same nucleotide, thus not being a mutation? Maybe a while loop would correct for that easily. (while new == old: pick new) or something.

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

0

Entering edit mode

@Eric, yes, you are mistaken. See @fransua's comment. When the threshold is exceeded. It changes the base. That's the part about: choice([x for x in "ACTG" if not x == s.lower()])

ADD REPLY • link 13.8 years ago by brentp 24k

0

Entering edit mode

I was not sure I understood your list comprehension at first either. Now I see. Thanks

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

0

Entering edit mode

Ups... I didn't saw the error when I analyzed the script. Thanks for the correction!!

ADD REPLY • link 13.8 years ago by Geparada ★ 1.5k

0

Entering edit mode

This is great. Sorry, Im new to phyton ... how would you go about to modify the script to directly introduce, say, 2 mutations? (I mean, not depending on a mutation frequency, but telling the script to introduce 2 mutations in each given sequence regardless of size). Thanks!

ADD REPLY • link 9.0 years ago by boludopublico • 0

score 7 · Answer 2 · 2011-09-22

7

Entering edit mode

13.8 years ago

biobot 0.0.77.a.1099 6.2k

The EMBOSS program msbar will do this. E.g. to introduce 10 point mutations

msbar -count 10 -point 4 seq.fasta

The argument '4' allows only base changes, but msbar can produce other types of mutation. There is also an argument to ensure that the mutation does not create another specific sequence (e.g. the one you started with). You can use infoseq to obtain the sequence length so that you know how many mutations to request.

ADD COMMENT • link 13.8 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

EMBOSS (the package that msbar is a part of) takes quite a bit of time to compile. Does anyone know of a place to get binaries?

ADD REPLY • link 7.5 years ago by wkretzsch • 0

0

Entering edit mode

What about Bioconda?

conda install emboss

https://bioconda.github.io/recipes/emboss/README.html

ADD REPLY • link 5.5 years ago by Manuel ▴ 410

0

Entering edit mode

It seems that msbar couldn't work with multiple sequences in the same fasta file, do you have any idea how we could mutate multiple seqs with msbar

ADD REPLY • link 6.3 years ago by Daiwei Tang ▴ 10

score 4 · Answer 3 · 2011-09-22

If this effort is to support research on human genetics, for example, I would add statements that the random variation mimic known frequencies of transitions (T <-> C, or A <-> G) and transversions (A -> C or T, or G -> C or T, C -> G or A, T -> G or A) in the human genome. The most common single nucleotide change in human is C <-> T. Thus, if T is picked as the base to change, then it should change more often to a C than to another base. This would be important if the project wants to align with a biological situation.

The frequency with which a given nucleotide, selected at random by one of the scripts above, changes to any of the other three nucleotides can be taken by scanning the literature for such a table or by parsing dbSNP. I just don't have number on hand.

score 2 · Answer 4 · 2011-09-22

2

Entering edit mode

13.8 years ago

Martin A Hansen 3.0k

You can do this with Biopieces www.biopieces.org) like this:

read_fasta -i test.fna | mutate_seq -p 1 | write_fasta -x

There is also indel_seq for insertions and deletions.

http://code.google.com/p/biopieces/wiki/mutate_seq

http://code.google.com/p/biopieces/wiki/indel_seq

Cheers,

Martin

ADD COMMENT • link 13.8 years ago by Martin A Hansen 3.0k

0

Entering edit mode

This looks nice, but I can't figure out how to install biopieces. It looks like homebrew has dropped support for it. :/

ADD REPLY • link 7.5 years ago by wkretzsch • 0