Question

How to remove sequences from a scaffold/contig file?

2

Entering edit mode

6.6 years ago

mandigene94 ▴ 30

Hi everyone,

I've been looking around and I can't seem to find a good solution to my problem, so I will ask here, in hopes that you may have some insight (of course, if there's a better thread that I've missed, please point me in that direction).

I have sequences for a few different strains of bacteria that I've manipulated in the lab, and each of these strains has a single plasmid. I've sequenced the whole genome of each strain, plasmid and everything. I didn't have an issue with this part - I trimmed the files with Trimmomatic, and aligned my sequences to a known reference for the original bacterial strain (no plasmid) using bowtie2, and asked for an output file of the reads that did not match the reference (so, hopefully all of my plasmid reads).

From there, I've done a de novo assembly using SPAdes, which worked well, but my problem is this: after assembly, I have many contigs/scaffolds in the respective files output by SPAdes (~1000+). I used BLAST to determine the likely origin of the sequences and have discovered that there is a pretty good mix of both short and longer leftover chromosomal sequences (which I don't want) mixed in with longer contigs that BLAST to plasmids. I'm only interested in assembling and characterizing my plasmids, but I'm not sure if there is a good way to remove the chromosomal sequences while keeping the plasmid ones, which I hope to then use in further steps to possibly get a whole map of the plasmids. I naively thought that I could go through and simply remove each of the sequences that I didn't want by deleting them manually in a word processor, but there has to be a better way. Using BLAST, I know exactly which contigs/scaffolds have plasmid sequences (or what are thought to be), but I just don't know how to filter out the rest of the sequences that I don't need.

I'd really appreciate any suggestions about what to do from here. I'm still fairly new to bioinformatics, so I'm not entirely sure what other types of software/tools are available in this situation.

Many thanks,

Amanda

plasmid assembly • 3.2k views

ADD COMMENT • link updated 6.6 years ago by Brian Bushnell 20k • written 6.6 years ago by mandigene94 ▴ 30

score 3 · Answer 1 · 2017-10-09

3

Entering edit mode

6.6 years ago

Brian Bushnell 20k

Hi Amanda,

You can use BBMap's FilterByName tool like this:

filterbyname.sh in=contigs.fa out=filtered.fa names=names.txt exclude

...where names.txt contains one name per line, like this:

contig_15
contig_73
contig_192

That will remove all of the listed contigs/scaffolds. Alternately, you can run in "include" mode if names.txt contains the names of sequences you want to keep.

ADD COMMENT • link 6.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

Thank you very much for the suggestion - this sounds like it'll solve my problem, so I will try this out!

Amanda

Edit: I've just tried it out and it works great! Thanks again!

ADD REPLY • link 6.6 years ago by mandigene94 ▴ 30