Question: How to remove sequences from a scaffold/contig file?
gravatar for mandigene94
17 months ago by
mandigene9410 wrote:

Hi everyone,

I've been looking around and I can't seem to find a good solution to my problem, so I will ask here, in hopes that you may have some insight (of course, if there's a better thread that I've missed, please point me in that direction).

I have sequences for a few different strains of bacteria that I've manipulated in the lab, and each of these strains has a single plasmid. I've sequenced the whole genome of each strain, plasmid and everything. I didn't have an issue with this part - I trimmed the files with Trimmomatic, and aligned my sequences to a known reference for the original bacterial strain (no plasmid) using bowtie2, and asked for an output file of the reads that did not match the reference (so, hopefully all of my plasmid reads).

From there, I've done a de novo assembly using SPAdes, which worked well, but my problem is this: after assembly, I have many contigs/scaffolds in the respective files output by SPAdes (~1000+). I used BLAST to determine the likely origin of the sequences and have discovered that there is a pretty good mix of both short and longer leftover chromosomal sequences (which I don't want) mixed in with longer contigs that BLAST to plasmids. I'm only interested in assembling and characterizing my plasmids, but I'm not sure if there is a good way to remove the chromosomal sequences while keeping the plasmid ones, which I hope to then use in further steps to possibly get a whole map of the plasmids. I naively thought that I could go through and simply remove each of the sequences that I didn't want by deleting them manually in a word processor, but there has to be a better way. Using BLAST, I know exactly which contigs/scaffolds have plasmid sequences (or what are thought to be), but I just don't know how to filter out the rest of the sequences that I don't need.

I'd really appreciate any suggestions about what to do from here. I'm still fairly new to bioinformatics, so I'm not entirely sure what other types of software/tools are available in this situation.

Many thanks,


assembly plasmid • 963 views
ADD COMMENTlink modified 17 months ago by Brian Bushnell16k • written 17 months ago by mandigene9410
gravatar for Brian Bushnell
17 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

Hi Amanda,

You can use BBMap's FilterByName tool like this: in=contigs.fa out=filtered.fa names=names.txt exclude

...where names.txt contains one name per line, like this:


That will remove all of the listed contigs/scaffolds. Alternately, you can run in "include" mode if names.txt contains the names of sequences you want to keep.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Brian Bushnell16k

Hi Brian,

Thank you very much for the suggestion - this sounds like it'll solve my problem, so I will try this out!


Edit: I've just tried it out and it works great! Thanks again!

ADD REPLYlink modified 17 months ago • written 17 months ago by mandigene9410
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1280 users visited in the last hour