How can I randomly choose and delete a set number of nucleotides from a fasta file?
1
0
Entering edit mode
22 months ago
Jonathan ▴ 20

I am trying to randomly remove 10%, 15% and 20% of the nucleotides in a fasta file.

So let's say I have a fasta file that looks like this...

>GCA_900186885_1_000000000001  
ATGCAAACATTTGTAAAAAACTTAATCGAT

I want to randomly choose and delete 10% of the nucleotides, which in this case would be 3, resulting in a fasta file with the same header, but with 3 fewer nucleotides:

>GCA_900186885_1_000000000001   
ATGAAACATTGTAAAAACTTAATCGAT

The above is a simple example, which could easily be done manually, but I have a large fasta file with 2132142 nucleotides and thus want to generate three new fasta files using the original, but with 1918928, 1812321, and 1705714 nucleotides, representing a 10%, 15% and 20% reduction.

Any insight would be much appreciated. Thank you!

EDIT: Solution found using bash

awk -v s=$RANDOM 'BEGIN {
    if(s)                                      # if seed provided as var s
        srand(s)                               # use s as seed 
    else                                       # or not
        srand()
}
!/^>/ {                                        # process non > starting records
    i=int(0.1*(length($0)))                    # 10 % of length to remove
    for(j=1;j<=i;j++) {
        p=int(rand()*length($0))+1             # p the point from where to remove
        $0=substr($0,1,(p-1)) substr($0,(p+1)) # ie. skip p:s char 
    }
}
1' file_in.fa  >>  file_out.fa
random fasta trim • 1.1k views
ADD COMMENT
1
Entering edit mode

maybe you can use mutate.sh from the bbmap package

ADD REPLY
1
Entering edit mode

I tried to use mutate.sh and it seems to have not only delete a portion of nucleotides, but also randomly shuffle nucleotides around. Regardless, I found an elegant solution using bash. See my edit if you're interested.

ADD REPLY
0
Entering edit mode

from bbmap

reformat.sh in=input.fa out=output.fa samplerate=0.1

from seqkit

seqkit sample -p 0.1 -o out.fa input.fa
ADD REPLY
0
Entering edit mode

Correct me if I am wrong, but I believe bbmap's reformat.sh script only works for subsetting reads, not bases. So the command that you suggested would not work because my fasta file is a single read with many bases.

I believe the same applies to seqkit sample, where it only outputs the specified proportion of reads, not bases.

ADD REPLY
0
Entering edit mode
22 months ago
Mensur Dlakic ★ 27k

It seems pretty straightforward, at least conceptually.

  • read the fasta file and determine sequence length N (BioPython)
  • pick N/10 random integers in the range 1-N (NumPy)
  • delete the letters at chosen index positions (string slicing)
ADD COMMENT

Login before adding your answer.

Traffic: 2618 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6