Question: Blast and sort contigs by species
0
gravatar for spirowol
3.1 years ago by
spirowol0
TAMU
spirowol0 wrote:

Hello there I am trying to BLAST and sort thousands of contigs generated from my assemblies. The problem is that my target contigs belong to a bacteria and DNA I used for sequecing wasn't pure; instead I have a mixture of contigs from at least two different species and I'd like to separate them by species when are identified in BLAST. Do anybody did this before? Thanks

blast next-gen assembly • 1.2k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by spirowol0
1
gravatar for pld
3.1 years ago by
pld4.8k
United States
pld4.8k wrote:

You can set BLAST to output taxonomic IDs in hits and filter based on that.

If the contaminating species have sequenced genomes, you may want to filter reads mapping to those genomes out before running the assembly over again.

ADD COMMENTlink written 3.1 years ago by pld4.8k

Yes, I filtered already my reads to discard those from unwanted organisms but since the large amount of DNA belongs to a large eukaryotic organism (non sequenced yet) I still see host DNA and other bacterial contaminants (which reads I also filtered before). Output taxonomic returns the fasta sequences or just the BLAST ID results?

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by spirowol0

Using tabular format (6) or a few others,you can set blast to output the taxonomic IDs along with the standard fields (query id, subject id, etc). See the BLAST documentation for more detail.

If you want the full subject sequences, it would be fairly trivial to extract them from the database searched using blastdbcmd and a list of sequence IDs from your results.

ADD REPLYlink written 3.1 years ago by pld4.8k

I used -outfmt 6and I can have a list of my contigs that actually BLAST with the desired bacteria with all the IDs. But I want to recover my blasted contigs (query) not the subject sequences. The objective is to create two datasets of contigs one with the contaminant sequences and the other only made of contigs that belong to the target bacteria. The contigs belonging to the target bacteria will be used later for scaffolding and genome finishing

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by spirowol0

Taxids are not output by default, you'll need to add them to the output. Run blast, then split the BLAST results by taxid, those matching contaminating species and those not matching contaminants. After that use the query ID in those files to filter your contigs accordingly.

Another option, again assuming your blast database stores the IDs would be to use blastdbcmd to extract taxids for your hits, then map taxids against your contigs via this file and filter accordingly. This would avoid having to run BLAST over again if you've already run it and didn't collect taxids in your results.

ADD REPLYlink written 3.1 years ago by pld4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1508 users visited in the last hour