Question

How to split all fasta sequences in a multifasta file into ten equal parts

0

Entering edit mode

3.5 years ago

debarunacharya • 0

Hi.

I have a multifasta sequence containing some of the human proteins of my interest (N>5000). all the proteins are usually of different length.

I want to split all the sequences in that file into TEN equal parts. e.g. if a protein is of 200 amino acid long, each nonoverlapping sequence should be of 20 amino acid long. this will apply to all fasta sequences of different length in the main file.

Also, the output should be TEN multifasta files, one for each of the TEN SEGMENTS. For example, for an 100 amino acid protein, Multifasta1 should contain the first segment (representing first 1-10 amino acids), Multifasta2 should contain the second (10-20) amino acid sequences and so on.

In the multifasta files, the desired portion of all proteins should be appended. I am not an expert coder, therefore any help will be greatly appreciated.

Thanks in advance.

fasta multifasta split equal parts splitfasta • 1.8k views

ADD COMMENT • link updated 3.5 years ago by Alex Nesmelov ▴ 200 • written 3.5 years ago by debarunacharya • 0

0

Entering edit mode

Thank you for data description. Please post example input, expected output and your efforts. Try seqkit split function.

ADD REPLY • link 3.5 years ago by cpad0112 21k

0

Entering edit mode

You will need to first split the original file into constituent protein files. THEN split those individual files into 10 pieces. faSplit utility from Jim Kent that is linked in @Juke34's tutorial would be one option to do both.

ADD REPLY • link 3.5 years ago by GenoMax 141k

score 2 · Answer 1 · 2020-11-02

Solution for R with tidyverse and seqinr packages. Don't forget to replace fasta_path with your file name. When a given sequence can't be splitted into 10 equal parts, the last piece is shortened (e.g. piece that goes in the last multifasta file).

library(tidyverse)
library(seqinr)

n_chunks = 10 

fasta_path = "your_file.fa"

fasta_data <- read.fasta(fasta_path, seqtype = "AA")

split_sequence = function(sequence) {
   cut_vector = cut(seq_along(sequence), breaks = n_chunks)
   split(sequence, cut_vector)
}

fasta_splitted = map(fasta_data,
                 split_sequence)

for (current_chunk in 1:n_chunks) {

    current_multifasta = map(fasta_splitted,
                             ~.[[current_chunk]])

    current_multifasta_name = str_c("Multifasta_part_", 
                                   current_chunk,
                                  ".fa")

     write.fasta(current_multifasta,
                names = names(current_multifasta),
                current_multifasta_name)

}

score 0 · Answer 2 · 2020-11-02

0

Entering edit mode

3.5 years ago

Juke34 8.5k

Many solutions are available, have a look at here : Tutorial: FASTA file split

ADD COMMENT • link 3.5 years ago by Juke34 8.5k