Cutting fasta file into chunks?
4
0
Entering edit mode
3.5 years ago
zhousun21 ▴ 40

Hi everyone, I'm using some chromosome-level genome assemblies, which have very large sequences. I need to cut these into 1 kb chunks for an analysis.

I have looked at a variety of available tools but I don't see any that do this. Does anyone know of one?

Any advice will be appreciated.

fasta • 4.7k views
ADD COMMENT
1
Entering edit mode

There might be a more straightforward method out there, but you can use split, and then fix the divided sequences (first & last sequences)

 split -b 1k myfile segment
ADD REPLY
1
Entering edit mode

Are we sure this works? split is a standard unix tool which has no understanding of base pairs. You may be confusing the split with kilobytes. split doesn't understand anything about fasta file headers etc to my knowledge.

ADD REPLY
0
Entering edit mode

Thank you, this works!

ADD REPLY
0
Entering edit mode

The second step sounds complicated. How woulld you clean up these "divided" sequences and still maintain a file size of 1 kb across the board?

ADD REPLY
0
Entering edit mode

Thank you everyone!

All useful solutions.

ADD REPLY
0
Entering edit mode

Please do not add an answer if you're not answering the principal question.

ADD REPLY
0
Entering edit mode

try seqkit split

ADD REPLY
3
Entering edit mode
3.5 years ago
GenoMax 141k

faSplit from Jim Kent's tools. Link for linux version. chmod a+x faSplit after download.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

For what you want

faSplit size input.fa 1000 outRoot

This breaks up input.fa into 1000 base chunks

ADD COMMENT
0
Entering edit mode

This also works, thanks!

ADD REPLY
3
Entering edit mode
3.5 years ago

We use seqkit to slide on sequences with 1kb window:

seqkit sliding -s 1000 -W 1000 seqs.fa.gz -o out.fa.gz

You can also adjust step size to make interleaved chunks in some scenarios, e.g., -s 900 -W 1000 for 1kb fragments with 100bp overlap.

Usage and examples:

sliding sequences, circular genome supported

Usage:
  seqkit sliding [flags]

Flags:
  -C, --circular-genome   circular genome.
  -g, --greedy            greedy mode, i.e., exporting last subsequences even shorter than windows size
  -s, --step int          step size
  -W, --window int        window size
ADD COMMENT
2
Entering edit mode
3.5 years ago
Juke34 8.5k

Plenty of solution described here : Tutorial: FASTA file split

ADD COMMENT
0
Entering edit mode

Love the comparison table!

ADD REPLY
0
Entering edit mode
3.5 years ago
GenoMax 141k

Use shred.sh from BBMap suite.

Usage:  shred.sh in=<file> out=<file> length=<number> minlength=<number> overlap=<number>

in=<file>       Input sequences.
out=<file>      Destination of output shreds.
length=500      Desired length of shreds.
minlength=1     Shortest allowed shred.  The last shred of each input sequence may be shorter than desired length.
overlap=0       Amount of overlap between successive reads.
ADD COMMENT
0
Entering edit mode

This is perfect, thank you. I had checked bbmap but somehow this wasn't on the list of functions. But it's there anyway.

ADD REPLY
0
Entering edit mode

I don't think this would work based on file size, genomax.

ADD REPLY
0
Entering edit mode

Should work fine. I recall Brian using it with Rice genome.

ADD REPLY

Login before adding your answer.

Traffic: 2759 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6