Question

How can I randomly choose and delete a set number of nucleotides from a fasta file?

0

Entering edit mode

22 months ago

Jonathan ▴ 20

I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.

So let's say I have a fasta file that looks like this...

>GCA_900186885_1_000000000001  
ATGCAAACATTTGTAAAAAACTTAATCGAT

I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:

>GCA_900186885_1_000000000001   
ATGAAACATTGTAAAAACTTAATCGAT

The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.

Any insight would be much appreciated. Thank you!

EDIT: Solution found using bash

awk -v s=$RANDOM 'BEGIN {
    if(s)                                      # if seed provided as var s
        srand(s)                               # use s as seed 
    else                                       # or not
        srand()
}
!/^>/ {                                        # process non > starting records
    i=int(0.1*(length($0)))                    # 10 % of length to remove
    for(j=1;j<=i;j++) {
        p=int(rand()*length($0))+1             # p the point from where to remove
        $0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char 
    }
}
1' file_in.fa  >>  file_out.fa

random fasta trim • 1.1k views

ADD COMMENT • link 22 months ago by Jonathan ▴ 20

1

Entering edit mode

maybe you can use mutate.sh from the bbmap package

ADD REPLY • link 22 months ago by andres.firrincieli 3.6k

1

Entering edit mode

I tried to use mutate.sh and it seems to have not only delete a portion of nucleotides, but also randomly shuffle nucleotides around. Regardless, I found an elegant solution using bash. See my edit if you're interested.

ADD REPLY • link 22 months ago by Jonathan ▴ 20

0

Entering edit mode

from bbmap

reformat.sh in=input.fa out=output.fa samplerate=0.1

from seqkit

seqkit sample -p 0.1 -o out.fa input.fa

ADD REPLY • link 22 months ago by cpad0112 21k

0

Entering edit mode

Correct me if I am wrong, but I believe bbmap's reformat.sh script only works for subsetting reads, not bases. So the command that you suggested would not work because my fasta file is a single read with many bases.

I believe the same applies to seqkit sample, where it only outputs the specified proportion of reads, not bases.

ADD REPLY • link 22 months ago by Jonathan ▴ 20

score 0 · Answer 1 · 2022-06-12

0

Entering edit mode

22 months ago

Mensur Dlakic ★ 27k

It seems pretty straightforward, at least conceptually.

read the fasta file and determine sequence length N (BioPython)
pick N/10 random integers in the range 1-N (NumPy)
delete the letters at chosen index positions (string slicing)

ADD COMMENT • link 22 months ago by Mensur Dlakic ★ 27k