Question: Splitting a fasta file based on the number of headers.
0
gravatar for sicat.paolo20
7 months ago by
sicat.paolo2030 wrote:

I have a fasta files that has more than 2.7 million headers. I want to break it into chunks.

>gene1
ACTG...

>gene2
ATTT...

...

>gene2,700,000
GCAC...

The way I do it is;

grep -n "^>" my.fasta > headersofmy.fasta

This gives me the positional information of the headers.

1:>gene1
4:>gene2
11:>gene3
...
n:>gene2,700,000

I then use the positional information to grab a set number of genes;

awk 'NR>=position1&&NR<=position2'  my.fasta > set1.fasta

I do this a couple of times to break my initial huge fasta files into a smaller file with a set number of headers.

I broke it first in chunks of 500,000 headers then to 100,000.

I feel that there is a smarter way to do this if I want it to break into further smaller chunks based on the number of headers. I've seen other ways to split a fasta file but they split based on file size or k-mer size.

Any suggestion on how to approach this?

bash python fasta • 421 views
ADD COMMENTlink modified 7 months ago by Alex Reynolds28k • written 7 months ago by sicat.paolo2030
1

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 7 months ago by RamRS21k

Hello sicat.paolo20 ,

There are multiple answers posted below. If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer, if they work.
Upvote|Bookmark|Accept

ADD REPLYlink modified 7 months ago • written 7 months ago by genomax65k

Sorry I was occupied with another issue and forgot to check my account again.

ADD REPLYlink written 7 months ago by sicat.paolo2030
5
gravatar for finswimmer
7 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello,

seqkit split have these two options for it:

-p, --by-part int        split sequences into N parts
-s, --by-size int        split sequences into multi parts with N sequences

fin swimmer

ADD COMMENTlink written 7 months ago by finswimmer11k

Sorry for the late reply. Thanks for this. this worked for me. It just took a while to install it.

ADD REPLYlink written 7 months ago by sicat.paolo2030

Hello again,

have a look at conda. I described how to use it in the first part of this tutorial. After doing it, you won't have any istall issues anymore ;)

fin swimmer

ADD REPLYlink written 7 months ago by finswimmer11k
3
gravatar for Bastien Hervé
7 months ago by
Bastien Hervé3.9k
Limoges, CBRS, France
Bastien Hervé3.9k wrote:

If your fasta file is line 1 header, line 2 sequence, line 3 header...

You can use split unix command. If you want 100 000 sequences in each sub files (1 sequence = 2 lines)

split -l 200000 filename

Note that if you have sequences on multiple lines this will not work

ADD COMMENTlink modified 7 months ago • written 7 months ago by Bastien Hervé3.9k

Thanks for the idea. I think I forgot that I could simply do this to the fasta file (convert the sequence into a single line). Cheers!

ADD REPLYlink modified 7 months ago • written 7 months ago by sicat.paolo2030
2
gravatar for cpad0112
7 months ago by
cpad011211k
India
cpad011211k wrote:

use fasplit from kentutils (https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64 for 64 bit linux binaries)

ADD COMMENTlink modified 7 months ago • written 7 months ago by cpad011211k
1

OP: Remember to chmod a+x faSplit after you download the binary.

ADD REPLYlink written 7 months ago by genomax65k

Thanks. Haven't tried this and would look into it. Cheers,

ADD REPLYlink written 7 months ago by sicat.paolo2030
1
gravatar for Alex Reynolds
7 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

The following will split multiline FASTA records by every SPLIT number of headers.

$ export SPLIT=1234
$ awk -vs=${SPLIT} 'BEGIN{ RS=">"; k=0; i=0; ks=sprintf("%06d",k); }{ if (length($0)>0) { printf("%d,%s\n",i,$1); printf(">%s",$0) >> "split."ks".fa"; i++; } if (i==s) { i=0; k++; ks=sprintf("%06d",k); } }' seqs.fa

The output files are written to split.000000.fa, split.000001.fa, split.000002.fa, etc.

Each file contains SPLIT records, except for the last file, which can contain from 1 to SPLIT records, depending on whether the number of records in seqs.fa divides SPLIT evenly.

ADD COMMENTlink modified 7 months ago • written 7 months ago by Alex Reynolds28k

thanks for the bash solution. I have been starting to like using bash to solve issues like this. I think it will take a while for me to understand your one-liner. Thanks!

ADD REPLYlink written 7 months ago by sicat.paolo2030
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1496 users visited in the last hour