I have a fasta files that has more than 2.7 million headers. I want to break it into chunks.
>gene1 ACTG... >gene2 ATTT... ... >gene2,700,000 GCAC...
The way I do it is;
grep -n "^>" my.fasta > headersofmy.fasta
This gives me the positional information of the headers.
1:>gene1 4:>gene2 11:>gene3 ... n:>gene2,700,000
I then use the positional information to grab a set number of genes;
awk 'NR>=position1&&NR<=position2' my.fasta > set1.fasta
I do this a couple of times to break my initial huge fasta files into a smaller file with a set number of headers.
I broke it first in chunks of 500,000 headers then to 100,000.
I feel that there is a smarter way to do this if I want it to break into further smaller chunks based on the number of headers. I've seen other ways to split a fasta file but they split based on file size or k-mer size.
Any suggestion on how to approach this?