What Is A Good Strategy For Finding Common Bacterial Sequences In A Potentially Contaminated Human Sample?
3
0
Entering edit mode
9.5 years ago
Dan D 7.3k

I typically work on the programming/data management side of things, but we're short-staffed and tricky analysis jobs sometimes work their way to me when others can't figure them out. In this case an investigator has human tissue that may contain bacterial contamination. He would like to determine whether this is true by extracting and sequencing the RNA from the sample, and then seeing if any of the reads are bacterial in origin. I don't know if this is preferable to other tests (like biochemical assays and whatnot), but I don't know enough to offer a better suggestion.

My first thought (as someone who doesn't do this kind of analysis regularly) is to do a standard alignment via bwa or tophat and store the unmapped reads in a file. Then these unmapped reads can be searched against a file or database to see if anything interesting pops up.

The tricky part here is that while it would be trivial to search against a single organism's genome, I don't know if a database of conserved bacterial sequences exists that I could search against. If such a thing exists then that's a possible solution.

Anyone have any ideas?

mapping • 4.0k views
3
Entering edit mode
9.5 years ago

I think that is total overkill. First, there should be some very simple wet lab techniques. Then, when you want to look for the presence of bacteria and taxonomic analysis, it should be sufficient to prove presence of bacterial 16s RNA. What you would need is some universal primers for 16s RNA and do a RT-PCR. If you have to find out which bacteria, then you could even sequence the product, by simple Sanger, no NGS required, but in principle for proving the presence, this is not even necessary.

1
Entering edit mode

Sequencing was the method that was recently used to find Fusobacterium associated with colon carcinoma, so there is some precedence: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3266037/ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3266036/

2
Entering edit mode
9.5 years ago

the nr database?

I think what's worth trying first is to de novo assemble the unmapped reads. See if you get any contigs, then blast those against nr.

0
Entering edit mode
6.2 years ago
steve ★ 3.3k

It sounds like you might want to try out Kraken (after you've sequenced the samples):

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3991327/

https://ccb.jhu.edu/software/kraken/

I have used it before in this sample script, credited to igor for work on it

#!/bin/bash

SAMPLEID="UniqueSampleIdentifier"
INPUTFILE="/path/to/fastq.gz"
# use 6 threads if NSLOTS is undefined e.g. not submitted with qsub
THREADS=${NSLOTS:=6} # kraken settings export KRAKEN_DEFAULT_DB="$HOME/ref/Kraken/nt-48G"
export KRAKEN_NUM_THREADS="$THREADS" # unzip the fastq.gz, and run Kraken on the first 1000000 reads # # also do some formatting of the output # # # NOTE:$HOME/software/bin/kraken is the path to the Kraken binary; adjust this if needed
zcat $INPUTFILE | head -4000000 | \$HOME/software/bin/kraken --fastq-input /dev/fd/0 | \
$HOME/software/bin/kraken-report --show-zeros | awk -F$'\t' '$1>0.1' > kraken_contaminant_analysis.${SAMPLEID}.txt

# filter for top hits
cat kraken_contaminant_analysis.${SAMPLEID}.txt | awk -F$'\t' '$1>1.0 &&$4!="-"' | cut -f 1,2,4,5,6 > tophit.txt