Genome Contamination Check with Blobtools
1
0
Entering edit mode
9 weeks ago
saamhasan55 ▴ 30

I generated a genome assembly and wanted to check for contaminant sequences using blobtools. I carried out a genome alignment between the assembly and fastq reads using bwa-mem, used to diamond blastx to search my sequences against the blast database and then used blobtools. But blobtools finds no taxonomic hits, I only get undefined or no-hits. The diamond output has taxonomic IDs and when I search some sample predicted proteins from my assembly they map. Why is blobtools not finding any taxonomic hits?

diamond blobtools genome blast • 1.7k views
ADD COMMENT
0
Entering edit mode

I generated a genome assembly and wanted to check for contaminant sequences

Why do you think you have contaminant sequences in your assembly? Generally when libraries are made from an organismm there should be no extraneous sequence present.

ADD REPLY
0
Entering edit mode

Not true, depending on how the wet-lab experiments were conducted.

ADD REPLY
1
Entering edit mode
13 days ago
Kevin Blighe ★ 90k

The problem that you describe is commonly caused by the format of the Diamond output file not matching what Blobtools expects for taxonomic assignment. Blobtools requires a hits file in tab-separated format with exactly three columns: the sequence identifier (from your assembly), the NCBI taxonomic identifier, and the alignment score (typically the bitscore). If the second column does not contain the taxonomic identifier, Blobtools will report no taxonomic hits, even if Diamond produced alignments with taxonomic information elsewhere in the output.

To resolve this, first ensure that your Diamond database was built with taxonomic mappings. Download the necessary NCBI taxonomy files:

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzf taxdump.tar.gz nodes.dmp names.dmp
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz

Then, build the database (using UniRef90 as an example):

diamond makedb --in uniref90.faa --db uniref90 --taxonmap prot.accession2taxid.gz --taxonnodes nodes.dmp

Run Diamond with a custom output format to include the taxonomic identifiers:

diamond blastx --db uniref90 --query your_assembly.fasta --out matches.tsv --outfmt 6 qseqid staxids bitscore

This produces the required three-column format. When creating the BlobDB, specify the path to the taxonomy dump:

blobtools create --infile your_assembly.fasta --out your_blobdb --taxrule bestsum --taxdump path/to/taxdump --hits matches.tsv --cov your_bam.cov

If your current Diamond output includes taxonomic identifiers but not in the second column, reformat it using awk:

awk '{print $1 "\t" $13 "\t" $12}' your_diamond_output.tsv > reformatted_hits.tsv

(Adjust column numbers based on your output format; $13 is typically staxids if using outfmt 6 with staxids, and $12 is bitscore.)

This should allow Blobtools to assign taxonomy correctly.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6