Question: Cutting fasta file into chunks?
0
gravatar for zhousun21
4 months ago by
zhousun2120
zhousun2120 wrote:

Hi everyone, I'm using some chromosome-level genome assemblies, which have very large sequences. I need to cut these into 1 kb chunks for an analysis.

I have looked at a variety of available tools but I don't see any that do this. Does anyone know of one?

Any advice will be appreciated.

fasta • 328 views
ADD COMMENTlink modified 4 months ago by Juke345.2k • written 4 months ago by zhousun2120
1

There might be a more straightforward method out there, but you can use split, and then fix the divided sequences (first & last sequences)

 split -b 1k myfile segment
ADD REPLYlink written 4 months ago by Fatima930
1

Are we sure this works? split is a standard unix tool which has no understanding of base pairs. You may be confusing the split with kilobytes. split doesn't understand anything about fasta file headers etc to my knowledge.

ADD REPLYlink written 4 months ago by Joe19k

Thank you, this works!

ADD REPLYlink written 4 months ago by zhousun2120

The second step sounds complicated. How woulld you clean up these "divided" sequences and still maintain a file size of 1 kb across the board?

ADD REPLYlink written 4 months ago by Ram32k

Thank you everyone!

All useful solutions.

ADD REPLYlink written 4 months ago by zhousun2120

Please do not add an answer if you're not answering the principal question.

ADD REPLYlink written 4 months ago by Ram32k

try seqkit split

ADD REPLYlink written 4 months ago by cpad011215k
2
gravatar for GenoMax
4 months ago by
GenoMax96k
United States
GenoMax96k wrote:

faSplit from Jim Kent's tools. Link for linux version. chmod a+x faSplit after download.

faSplit - Split an fa file into several files.
usage:
   faSplit how input.fa count outRoot
where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
Files split by sequence will be broken at the nearest fa record boundary. 
Files split by base will be broken at any base.  
Files broken by size will be broken every count bases.

For what you want

faSplit size input.fa 1000 outRoot

This breaks up input.fa into 1000 base chunks

ADD COMMENTlink modified 4 months ago • written 4 months ago by GenoMax96k

This also works, thanks!

ADD REPLYlink written 4 months ago by zhousun2120
2
gravatar for Juke34
4 months ago by
Juke345.2k
Sweden
Juke345.2k wrote:

Plenty of solution described here : Tutorial: FASTA file split

ADD COMMENTlink written 4 months ago by Juke345.2k

Love the comparison table!

ADD REPLYlink written 4 months ago by Ram32k
1
gravatar for shenwei356
4 months ago by
shenwei3565.8k
China
shenwei3565.8k wrote:

We use seqkit to slide on sequences with 1kb window:

seqkit sliding -s 1000 -W 1000 seqs.fa.gz -o out.fa.gz

You can also adjust step size to make interleaved chunks in some scenarios, e.g., -s 900 -W 1000 for 1kb fragments with 100bp overlap.

Usage and examples:

sliding sequences, circular genome supported

Usage:
  seqkit sliding [flags]

Flags:
  -C, --circular-genome   circular genome.
  -g, --greedy            greedy mode, i.e., exporting last subsequences even shorter than windows size
  -s, --step int          step size
  -W, --window int        window size
ADD COMMENTlink written 4 months ago by shenwei3565.8k
0
gravatar for GenoMax
4 months ago by
GenoMax96k
United States
GenoMax96k wrote:

Use shred.sh from BBMap suite.

Usage:  shred.sh in=<file> out=<file> length=<number> minlength=<number> overlap=<number>

in=<file>       Input sequences.
out=<file>      Destination of output shreds.
length=500      Desired length of shreds.
minlength=1     Shortest allowed shred.  The last shred of each input sequence may be shorter than desired length.
overlap=0       Amount of overlap between successive reads.
ADD COMMENTlink written 4 months ago by GenoMax96k

This is perfect, thank you. I had checked bbmap but somehow this wasn't on the list of functions. But it's there anyway.

ADD REPLYlink written 4 months ago by zhousun2120

I don't think this would work based on file size, genomax.

ADD REPLYlink written 4 months ago by Ram32k

Should work fine. I recall Brian using it with Rice genome.

ADD REPLYlink written 4 months ago by GenoMax96k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2662 users visited in the last hour
_