How To Generate Mutations In Sequences Of A Fasta File?
4
3
Entering edit mode
11.4 years ago

HI!

For example, if I have a file with 3 sequences:

>1
gtccgatgat
>2
gatggatcggacgttagcaa
>3
acatttgaga


I want to generate sequences with 10% of the nucleotides randomly mutated.

>1
gtcTgatgat
>2
gaAggatcggacgtGagcaa
>3
acatCtgaga


Does anybody know a program or code to do that?

And if you know a way to do this in python would be great!

Thanks!!!

fasta mutation python tool sequence • 13k views
9
Entering edit mode
11.4 years ago
brentp 24k

call the program below like:

python mutate.py some.fasta 0.01 > mutated.fasta


And it will change 1% (0.01) of the bases.

from random import random, choice
import sys
from itertools import groupby

def fasta_iter(fasta_name):
"""
given a fasta file. yield tuples of header, sequence
"""
fh = open(fasta_name)
# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
# drop the ">"
# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.next())

def main(fasta_name, mutation_freq):

seq = list(seq)
for i, s in enumerate(seq):
val = random()
if val < mutation_freq:
# choose a random nucleotide that's different.
seq[i] = choice([x for x in "ACTG" if x != s.upper()])

if __name__ == "__main__":
main(sys.argv[1], float(sys.argv[2]))

2
Entering edit mode

on 1) i dont disagree but favor simplicity over performance unless it proves a problem. For 2) it probably depends on the mutation rate which is more efficient. And for 3) I guess that depends on what "mutation rate" means.

1
Entering edit mode

actually, you were right. I changed it to s.upper() :) ... thanks.

0
Entering edit mode

nice! just one few suggestions, 1)choice is really slower than random, as we only want to get one element the corresponding line could be changed using 'ATGC'[int(random()4)] 2) seq can be let as string, seq = seq[i:] + 'ATGC'[int(random()4)] + seq[:i+1]... it is perhaps more efficient 3)why should the mutation result in a different nucleotide?

0
Entering edit mode

Nice code brentp! thanks !!!

0
Entering edit mode

I've learned a lot from this code... you really help me brentp :)

0
Entering edit mode

Am I mistaken to believe this will only give a mutation rate equivalent to 3/4 of the one you specify? Namely, 1 out of 4 of your mutations will be mutations to the same nucleotide, thus not being a mutation? Maybe a while loop would correct for that easily. (while new == old: pick new) or something.

0
Entering edit mode

@Eric, yes, you are mistaken. See @fransua's comment. When the threshold is exceeded. It changes the base. That's the part about: choice([x for x in "ACTG" if not x == s.lower()])

0
Entering edit mode

I was not sure I understood your list comprehension at first either. Now I see. Thanks

0
Entering edit mode

Ups... I didn't saw the error when I analyzed the script. Thanks for the correction!!

0
Entering edit mode

This is great. Sorry, Im new to phyton ... how would you go about to modify the script to directly introduce, say, 2 mutations? (I mean, not depending on a mutation frequency, but telling the script to introduce 2 mutations in each given sequence regardless of size). Thanks!

7
Entering edit mode
11.4 years ago

The EMBOSS program msbar will do this. E.g. to introduce 10 point mutations

msbar -count 10 -point 4 seq.fasta


The argument '4' allows only base changes, but msbar can produce other types of mutation. There is also an argument to ensure that the mutation does not create another specific sequence (e.g. the one you started with). You can use infoseq to obtain the sequence length so that you know how many mutations to request.

0
Entering edit mode

EMBOSS (the package that msbar is a part of) takes quite a bit of time to compile. Does anyone know of a place to get binaries?

0
Entering edit mode

conda install emboss

0
Entering edit mode

It seems that msbar couldn't work with multiple sequences in the same fasta file, do you have any idea how we could mutate multiple seqs with msbar

4
Entering edit mode
11.4 years ago

If this effort is to support research on human genetics, for example, I would add statements that the random variation mimic known frequencies of transitions (T <-> C, or A <-> G) and transversions (A -> C or T, or G -> C or T, C -> G or A, T -> G or A) in the human genome. The most common single nucleotide change in human is C <-> T. Thus, if T is picked as the base to change, then it should change more often to a C than to another base. This would be important if the project wants to align with a biological situation.

The frequency with which a given nucleotide, selected at random by one of the scripts above, changes to any of the other three nucleotides can be taken by scanning the literature for such a table or by parsing dbSNP. I just don't have number on hand.

0
Entering edit mode

+1 if you put more time/realism/thought in, you'll likely get more useful stuff out.

2
Entering edit mode
11.4 years ago

You can do this with Biopieces www.biopieces.org) like this:

read_fasta -i test.fna | mutate_seq -p 1 | write_fasta -x


There is also indel_seq for insertions and deletions.

Cheers,

Martin

0
Entering edit mode

This looks nice, but I can't figure out how to install biopieces. It looks like homebrew has dropped support for it. :/