Question

Remove sequences <300 bases from FASTA file

7

Entering edit mode

5.7 years ago

zoppisemma ▴ 70

I have a multiple FASTA file containing contigs deriving from metagenomic data. I need to remove all contigs less than 300 bp long. Ho do I proceed?

genome next-gen sequencing assembly • 18k views

ADD COMMENT • link updated 5.7 years ago by harish ▴ 450 • written 5.7 years ago by zoppisemma ▴ 70

2

Entering edit mode

Hi zoppisemma

There are multiple solutions provided by different users. you should upvote/ accept answers which helped. This will help others looking for such solutions.

accept or upvote

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k

1

Entering edit mode

See this post and tweak for 300.

ADD REPLY • link 5.7 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

This should be just a comment and not an answer, as you're only pointing to an existing post/answer. I've moved it to one.

ADD REPLY • link 5.7 years ago by Ram 43k

0

Entering edit mode

Thanks for the correction Ram.

ADD REPLY • link 5.7 years ago by Jeffin Rockey ★ 1.3k

1

Entering edit mode

duplicate: How To Filter Multi Fasta By Length??

ADD REPLY • link 5.7 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Other people gave you excellent solutions. Nevertheless, you may be also interested in SEDA (http://www.sing-group.org/seda/ ), an open-source tool for processing FASTA files. Among other functions, it provides an operation to apply different filters, including sequence length (https://www.sing-group.org/seda/manual/operations.html#filtering ). Regards.

ADD REPLY • link updated 5.7 years ago by finswimmer 16k • written 5.7 years ago by Hugo ▴ 380

score 14 · Answer 1 · 2018-07-30

14

Entering edit mode

5.7 years ago

lakhujanivijay 5.8k

using seqkit

seqkit seq -m 300 your_fasta.fa

download here

ADD COMMENT • link 5.7 years ago by lakhujanivijay 5.8k

Ram · Answer 2 · 2018-07-30

8

Entering edit mode

5.7 years ago

GenoMax 141k

Using reformat.sh from BBMap suite.

reformat.sh in=your.fa out=filtered.fa minlength=300

ADD COMMENT • link updated 5.7 years ago by Ram 43k • written 5.7 years ago by GenoMax 141k

Ram · Answer 3 · 2018-07-31

6

Entering edit mode

5.7 years ago

harish ▴ 450

Hi!

You can use seqtk for the same. The command should be:

seqtk seq -L 300 contigs.fasta > file.fasta

ADD COMMENT • link updated 5.7 years ago by Ram 43k • written 5.7 years ago by harish ▴ 450

1

Entering edit mode

@harish

Just FYI (for larger datasets), see this (seqkit benchmark)

https://bioinf.shenwei.me/seqkit/#benchmark

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Ahh. That's nice. Glad to learn something new today!

ADD REPLY • link 5.7 years ago by harish ▴ 450

0

Entering edit mode

I am using it for all sorts of fasta/q manipulation and found it really fast and effective.

ADD REPLY • link 5.7 years ago by lakhujanivijay 5.8k

score 4 · Answer 4 · 2018-07-30

4

Entering edit mode

5.7 years ago

finswimmer 16k

awk solution which should work for multiline fasta files:

awk -v RS=">" -v FS="\n" '{for(i=2;i<NF;i++) {l+=length($i)}; if(l>300) printf ">%s", $0}' test.fasta

ADD COMMENT • link 5.7 years ago by finswimmer 16k