Question

how to split multi-fasta file into single fasta file named by header

2

Entering edit mode

4.3 years ago

Kumar ▴ 120

I have a multi-fasta file namely genome.fasta as follows

genome.fasta
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG
>1582.LC madi kg 5/58/8
GATGAT

I need to split the genome.fasta file into single fasta file and file name should be the corresponding first word of the fasta header. The expected output as follows,

LI5896452.1.fasta
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG

1582.LC.fasta
>1582.LC madi kg 5/58/8
GATGAT

I found many script available online but all are splitting the file and naming each by its own, I could not find any script which keeps header as file name. Please help me to do the same.

genome perl python3 bash python • 11k views

ADD COMMENT • link updated 13 months ago by shenwei356 8.7k • written 4.3 years ago by Kumar ▴ 120

1

Entering edit mode

Linearize your fasta file using code here

Then use the solutions in: Split Fasta file and rename output files with contig names

ADD REPLY • link 4.3 years ago by GenoMax 152k

1

Entering edit mode

with awk and flattened fasta:

$ cat test.fa                                                                                                                                                                           
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG
>1582.LC madi kg 5/58/8
GATGAT

$ awk -v OFS="\n" '/^>/ {getline seq; print $0,seq > substr($1,2)".fa"}' test.fa  

$ tree .                                                                                                                                                                                
.
├── 1582.LC.fa
├── LI5896452.1.fa
└── test.fa

0 directories, 3 files

$ cat 1582.LC.fa                                                                                                                                                                        
>1582.LC madi kg 5/58/8
GATGAT

$ cat LI5896452.1.fa                                                                                                                                                                    
>LI5896452.1 Liverpool 2 kg/dp/Kng
ATGCTAG

ADD REPLY • link 4.3 years ago by cpad0112 21k

0

Entering edit mode

This only works for the first line of sequences.

ADD REPLY • link 14 months ago by rsieber ▴ 10

score 3 · Answer 1 · 2021-03-23

3

Entering edit mode

4.3 years ago

GenoMax 152k

faSplit utility from Jim Kent (LINK for linux version). Add execute permissions after you download (chmod a+x faSplit).

$ faSplit byname scaffolds.fa outRoot/ 
This breaks up scaffolds.fa using sequence names as file names.
       Use the terminating / on the outRoot to get it to work correctly.

ADD COMMENT • link 4.3 years ago by GenoMax 152k

score 3 · Answer 2 · 2021-03-23

3

Entering edit mode

4.3 years ago

Jukka Matilainen ▴ 80

Another solution using AWK:

awk '/^>/ {out = substr($1, 2) ".fasta"; print > out} !/^>/ {print >> out}' genome.fasta

ADD COMMENT • link 4.3 years ago by Jukka Matilainen ▴ 80

0

Entering edit mode

This is great also for multiline fastas

ADD REPLY • link 14 months ago by rsieber ▴ 10

score 2 · Answer 3 · 2021-03-23

2

Entering edit mode

4.3 years ago

shenwei356 8.7k

Try seqkit split2

out=result

seqkit split2 --by-size 1 genomes.fasta -O $out

find $out -name "*.fasta" \
    | while read f; do \
        mv $f $out/$(seqkit seq --name --only-id $f).fasta; \
    done

Result

$ tree
.
├── genomes.fasta
└── result
    ├── 1582.LC.fasta
    └── LI5896452.1.fasta

ADD COMMENT • link 4.3 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you shenwei356 , However your script shows error for large dataset as follows,

[INFO] split seqs from genomic.fasta
[INFO] split into 1 seqs per file
[INFO] write 1 sequences to file: result/genomic.part_001.fasta
[INFO] write 1 sequences to file: result/genomic.part_002.fasta
-
-
-
-
-
:line 5:    /bin/ls: Argument list too long

ADD REPLY • link 4.3 years ago by Kumar ▴ 120

1

Entering edit mode

Use find instead of ls.

find $out -name "*.fasta" \
    | while read f; do \
        mv $f $out/$(seqkit seq --name --only-id $f).fasta; \
    done

I'm not good at find -exec. You can also use find/fd + parallel

ADD REPLY • link 4.3 years ago by shenwei356 8.7k

0

Entering edit mode

this means that you created too many files when splitting the original fasta file.

How many entries do you have in your original file? anything above 50-60k entries you will need to subdivide them in subfolders to remain workable.

ADD REPLY • link 4.3 years ago by lieven.sterck 15k

0

Entering edit mode

The following command will give you the fasta files renamed by ID

seqkit split -i [input_fasta] --out-dir [output_directory]

They will be inside the output directory name though you have to rename them as they come with an automatic prefix

ADD REPLY • link 13 months ago by azmigueldario • 0

1

Entering edit mode

There's a flag to remove the prefix.

--by-id-prefix ""

ADD REPLY • link 13 months ago by shenwei356 8.7k

score 2 · Answer 4 · 2021-03-23

2

Entering edit mode

4.3 years ago

Jorge Amigo 14k

Quick perl one-liner:

perl -ne 'if (/^>(\S+)/) { close OUT; open OUT, ">$1.fasta" } print OUT' genome.fasta

ADD COMMENT • link 4.3 years ago by Jorge Amigo 14k

score 1 · Answer 5 · 2021-03-23

1

Entering edit mode

4.3 years ago

Juke34 9.3k

In the subject, here a review about how to split fasta file https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/split_fasta.md Bash and faSplit approach do label fasta file by sequence name, for all other tools it is not mentioned but it does not mean they do not do it.

ADD COMMENT • link 4.3 years ago by Juke34 9.3k