Question

Split multi fasta based on line header

0

Entering edit mode

4.8 years ago

ian.voorhees • 0

Hello,

I have a multifasta containing several gene segments from multiple organisms. The fasta is formatted as so:

>OrganismA_GeneSegment1

ATG..

>OrganismA_GeneSegment2

ATG...

>OrganismB_GeneSegment1

ATG...

>OrganismB_GeneSegment2

ATG....

I would like to split this into several sub-multi fasta files, each containing all the segments from each organism and named after that organism. Also, preferably sorted in length descending...

Any suggestions appreciated!

Thanks.

sequence split fasta edit fasta order fasta • 3.4k views

ADD COMMENT • link updated 4.7 years ago by Matteo Schiavinato ★ 3.6k • written 4.8 years ago by ian.voorhees • 0

2

Entering edit mode

This awk script could work but does not sort by length.

awk '{if(substr($0,1,1) == ">"){split(substr($0,2,length($0)),a,/_/);filename=a[1]};print $0 > filename }' your_input_file.fa

ADD REPLY • link 4.8 years ago by microfuge ★ 1.9k

0

Entering edit mode

@microfuge I find your suggestion very useful, but the files are saved without format specification. I´m new to awk and don´t get where to place the .fa in the script?

ADD REPLY • link 4.7 years ago by Konrad • 0

0

Entering edit mode

One idea would be to recover the headers (grep "^>" yourfile > headers). Split the headers into organism specific sub-files. Then use faSomeRecords utility from Jim Kent (UCSC) to get the data separated. Add execute permissions after downloading the utility (chmod a+x faSomeRecords).

ADD REPLY • link 4.8 years ago by GenoMax 141k

0

Entering edit mode

USEARCH has couple of options to sort fasta sequences by length.

ADD REPLY • link 4.8 years ago by Mensur Dlakic ★ 27k

score 1 · Answer 1 · 2019-08-23

If you can code in Python, then your easiest shot is to use Biopython. Something like this:

#!/usr/bin/env python

from Bio import SeqIO
from operator import itemgetter

Separated_reads = {}

input_file = "<put file="" here="">"
INPUT = open(input_file, "r")

for line in SeqIO.parse(INPUT, "fasta"):

    read_name = strrecord.id)
    organism = read_name.split("_")[0]
    read_seq = str(record.seq)

    try:
        Separated_reads[organism].append((len(read_seq), ">" + read_name + "\n" + read_seq + "\n"))
    except:
        Separated_reads[organism] = [(len(read_seq), ">" + read_name + "\n" + read_seq + "\n")]

INPUT.close()

Sorted_reads = { organism : sorted(Separated_reads[organism], key=itemgetter(0), reverse=True) for organism in Separated_reads.keys() }

for organism in Sorted_reads.keys():
    OUTPUT = open("{0}.fa".format(organism), "w")

    for x in Sorted_reads[organism]:
        gene_segment = x[1]
        OUTPUT.write(gene_segment)

    OUTPUT.close()