I have a fasta files that has more than 2.7 million headers. I want to break it into chunks.
>gene1
ACTG...
>gene2
ATTT...
...
>gene2,700,000
GCAC...
The way I do it is;
grep -n "^>" my.fasta > headersofmy.fasta
This gives me the positional information of the headers.
1:>gene1
4:>gene2
11:>gene3
...
n:>gene2,700,000
I then use the positional information to grab a set number of genes;
awk 'NR>=position1&&NR<=position2' my.fasta > set1.fasta
I do this a couple of times to break my initial huge fasta files into a smaller file with a set number of headers.
I broke it first in chunks of 500,000 headers then to 100,000.
I feel that there is a smarter way to do this if I want it to break into further smaller chunks based on the number of headers. I've seen other ways to split a fasta file but they split based on file size or k-mer size.
Any suggestion on how to approach this?
Please use the formatting bar (especially the
codeoption) to present your post better. I've done it for you this time.Hello sicat.paolo20 ,
There are multiple answers posted below. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer, if they work.

Sorry I was occupied with another issue and forgot to check my account again.