Question

Cutting fasta file into chunks?

0

Entering edit mode

3.5 years ago

zhousun21 ▴ 40

Hi everyone, I'm using some chromosome-level genome assemblies, which have very large sequences. I need to cut these into 1 kb chunks for an analysis.

I have looked at a variety of available tools but I don't see any that do this. Does anyone know of one?

Any advice will be appreciated.

fasta • 4.7k views

ADD COMMENT • link updated 3.5 years ago by Juke34 8.5k • written 3.5 years ago by zhousun21 ▴ 40

1

Entering edit mode

There might be a more straightforward method out there, but you can use split, and then fix the divided sequences (first & last sequences)

 split -b 1k myfile segment

ADD REPLY • link 3.5 years ago by Fatima ▴ 1000

1

Entering edit mode

Are we sure this works? split is a standard unix tool which has no understanding of base pairs. You may be confusing the split with kilobytes. split doesn't understand anything about fasta file headers etc to my knowledge.

ADD REPLY • link 3.5 years ago by Joe 21k

0

Entering edit mode

Thank you, this works!

ADD REPLY • link 3.5 years ago by zhousun21 ▴ 40

0

Entering edit mode

The second step sounds complicated. How woulld you clean up these "divided" sequences and still maintain a file size of 1 kb across the board?

ADD REPLY • link 3.5 years ago by Ram 43k

0

Entering edit mode

Thank you everyone!

All useful solutions.

ADD REPLY • link 3.5 years ago by zhousun21 ▴ 40

0

Entering edit mode

Please do not add an answer if you're not answering the principal question.

ADD REPLY • link 3.5 years ago by Ram 43k

0

Entering edit mode

try seqkit split

ADD REPLY • link 3.5 years ago by cpad0112 21k

score 3 · Answer 1 · 2020-10-11

3

Entering edit mode

3.5 years ago

GenoMax 141k

faSplit from Jim Kent's tools. Link for linux version. chmod a+x faSplit after download.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

For what you want

faSplit size input.fa 1000 outRoot

This breaks up input.fa into 1000 base chunks

ADD COMMENT • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

This also works, thanks!

ADD REPLY • link 3.5 years ago by zhousun21 ▴ 40

score 3 · Answer 2 · 2020-10-11

We use seqkit to slide on sequences with 1kb window:

seqkit sliding -s 1000 -W 1000 seqs.fa.gz -o out.fa.gz

You can also adjust step size to make interleaved chunks in some scenarios, e.g., -s 900 -W 1000 for 1kb fragments with 100bp overlap.

Usage and examples:

sliding sequences, circular genome supported

Usage:
  seqkit sliding [flags]

Flags:
  -C, --circular-genome   circular genome.
  -g, --greedy            greedy mode, i.e., exporting last subsequences even shorter than windows size
  -s, --step int          step size
  -W, --window int        window size

score 2 · Answer 3 · 2020-10-12

2

Entering edit mode

3.5 years ago

Juke34 8.5k

Plenty of solution described here : Tutorial: FASTA file split

ADD COMMENT • link 3.5 years ago by Juke34 8.5k

0

Entering edit mode

Love the comparison table!

ADD REPLY • link 3.5 years ago by Ram 43k

score 0 · Answer 4 · 2020-10-11

0

Entering edit mode

3.5 years ago

GenoMax 141k

Use shred.sh from BBMap suite.

Usage:  shred.sh in=<file> out=<file> length=<number> minlength=<number> overlap=<number>

in=<file>       Input sequences.
out=<file>      Destination of output shreds.
length=500      Desired length of shreds.
minlength=1     Shortest allowed shred.  The last shred of each input sequence may be shorter than desired length.
overlap=0       Amount of overlap between successive reads.

ADD COMMENT • link 3.5 years ago by GenoMax 141k

0

Entering edit mode

This is perfect, thank you. I had checked bbmap but somehow this wasn't on the list of functions. But it's there anyway.

ADD REPLY • link 3.5 years ago by zhousun21 ▴ 40

0

Entering edit mode

I don't think this would work based on file size, genomax.

ADD REPLY • link 3.5 years ago by Ram 43k

0

Entering edit mode

Should work fine. I recall Brian using it with Rice genome.

ADD REPLY • link 3.5 years ago by GenoMax 141k