How to split big .faa file into smaller .faa files
10 months ago
Shaurya • 0

I have a 10 gb .faa proteomes file that I want to run MAFFT on. But it is too big and hence I need to divide the file. How do I convert it to smaller files in windows without losing any data ? The solutions I have come across are for a UNIX/LINUX based environment

faa proteomes
from the statement The solutions I have come across are for a UNIX/LINUX based environment, I am assuming that you are on windows. Even in windows (=>10), you can use GNU-linux tools using wsl2. I would suggest seqkit (windows version) from here. Please go through the manual and there are multiple ways to split a fasta using seqkit in manual.

10 months ago
Mark ★ 1.1k

Seqkit is the answer.

To split into 100 parts:

seqkit split myfile.faa --by-part 100


To split by number of desired sequences per file (eg 5000 per file):

seqkit split myfile.faa --by-part 5000 -by-size


The solutions I have come across are for a UNIX/LINUX based environment

Yes, use linux, if you want to perform any bioinformatics you need to use linux

10 months ago
Divon ▴ 180

You can use my Genozip tool:

genozip myfile.faa
genocat --downsample 3,1 myfile.faa.genozip   <--- get part 1 out of 3


Works on Windows (as well as Linux and Mac)

10 months ago
Juke34 ★ 7.1k

14 methods reviewed here: https://github.com/Juke34/knowledge/blob/main/split_fasta.md