Question

How to check for contamination by one contaminant across multiple whole genome sequenced samples?

0

Entering edit mode

7.8 years ago

jerrybug109 ▴ 10

Hello Friends,

So recently our lab had a batch of different strains of a bacteria whole genome sequenced. We got the short read sequences back from the sequencing facility, and had bit of a surprise. After selecting a random 10k base pairs from each sample and BLASTing them, half of the samples were more similar to a different species of bacteria than the species we had expected to get back.

So I'm thinking, maybe during our DNA extraction process we screwed up by accidentally contaminating half the samples with one individual of the new surprising bacteria species, and then that surprise species got sequenced instead of the one we hoped for. I'm thinking that a good way to check this would be to see if any of these surprising samples are identical in terms of sequence to each other. If so, that would suggest that a single contaminant made its way across several sample tubes we sent to the NGS facility.

What's a good way to check to see if whether all these samples of the suspicious bacterial species identity are completely identical to each other? Each sample is in FASTQ reverse and forward end and paired end 150 bp reads each. Would I have to assemble each genome and then align them against each other using BWA-MEM or Bowtie2 in a pairwise fashion, or is there a faster way?

I would appreciate any input you have. Thank you.

alignment sequencing genomics • 4.6k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 7.8 years ago by jerrybug109 ▴ 10

1

Entering edit mode

You could use BBsplit.sh from BBMap suite with a selection of the genomes you expect to quickly try and bin your reads. You would get an idea of the problem at hand that way and cleaned data that you can use for downstream steps if you wish.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Hi thanks for the suggestion, that looks interesting. In this case would I use a couple of my genomes as the references and see if any of the reads bin "heavily" towards one reference versus the other? Thanks.

ADD REPLY • link 7.8 years ago by jerrybug109 ▴ 10

score 1 · Answer 1 · 2016-06-28

1

Entering edit mode

7.8 years ago

Brian Bushnell 20k

If this is the result of contamination by a single organism, then yes, all of the contamination will be identical (aside from sequencing errors) and it will be easy to remove once you identify the contaminating organism. To identify the contaminant, BLASTing against RefSeq bacterial and/or NT will probably work.

What are the organisms in question? I understand that the entire run was supposed to be just a single species, and all the contamination is from a different single species, correct? Are these species with existing references? And, were your libraries amplified, and if so, how?

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

We expected all 40 of our samples to be Bacillus subtilis. After we got the short reads from all samples back from the core facility, I BLASTed 10,000 random reads from each sample. 20/40 of the samples got BLAST results that indicated Bacillus licheniformis. So I suspect that the Bacillus licheniformis samples might be the result of contamination from a single Bacillus licheniformis individual, cross contaminated across multiple samples.

Yes, there are existing references for both species.

The libraries were amplified via Clonal Bridge Amplification / PCR. The sequencing done was Illumina Hiseq, and the read output was short reads of 150 bp each.

Thanks I really appreciate you taking the time to help out!

ADD REPLY • link 7.8 years ago by jerrybug109 ▴ 10

0

Entering edit mode

20/40 of the samples got BLAST results that indicated Bacillus licheniformis.

Do you mean that ALL of the reads were Bacillus licheniformis, or just some? Were any of the reads from Bacillus subtilis? And specifically, what percent were from each species? And do you happen to know how closely related those species are (in terms of ANI)?

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

score 0 · Answer 2 · 2016-06-28

0

Entering edit mode

7.8 years ago

Brian Bushnell 20k

There are many possible vectors of contamination. If these bacteria were multiplexed on the same plate/run, there could be cross-contamination in which each sample is contaminated by other samples. There's also the possibility of your reagents being contaminated, or there being carryover contamination from the previous run, or your robots causing contamination, etc. The first thing to do is find out whether everything is contaminated by the same organism, which can be easily accomplished by assembling all of them and then blasting the contigs, or blasting a thousand reads from each sample, and examining the blast hits. Assembling, shredding the contigs, and aligning them against each other and creating a heatmap indicating which samples map to which other samples, in which the heatmap axes are sorted by taxonomy, is another way of looking at it.

There are ways of digitally fixing different contamination vectors, if you can figure out which one it was. Can you describe the experiment in more detail? For example, are these MDA'd single cells, or isolates extracted from colonies?

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

The first thing to do is find out whether everything is contaminated by the same organism, which can be easily accomplished by assembling all of them and then blasting the contigs, or blasting a thousand reads from each sample, and examining the blast hits

I've already done this step. We were expecting all 40 of our samples to be "Species A", but based on blasting 10,000 random reads from each sample across the public NCBI database, 20 of them are more similar to "Species B" than they are to "Species A".

These samples were isolated at a field side where it's likely to get either bacterial species A or B, and it's possible that those 20 "species B-like" individuals were simply just miss classified in the past as "species A". In this experiment, these are isolates extracted from colonies. So I need to find out whether contamination is going on or not.

But for now, I hope to at least find out whether those 20 "species B-like" individuals are indeed the result of a carryover contamination event from one single "Species B-like" organism. In this case I would expect all the "species B-like" samples to be identical in sequence identity, correct? (Or am I being naive?)

Or based on your suggestion, are you saying I should BLAST the contigs from each sample against a local database made up of the assembled genomes from each of my samples? And then look to see if any of them have 100% sequence identity with each other?

Thanks!

Or are you saying I should blast the contigs across the publicly available NCBI database?

ADD REPLY • link 7.8 years ago by jerrybug109 ▴ 10

score 0 · Answer 3 · 2016-06-28

0

Entering edit mode

7.8 years ago

jerrybug109 ▴ 10

Do you mean that ALL of the reads were Bacillus licheniformis, or just some? Were any of the reads from Bacillus subtilis? And specifically, what percent were from each species? And do you happen to know how closely related those species are (in terms of ANI)?

Just some. For example, here are the top 10 BLAST hits for one of our samples (Sample A) which heavily indicate B. licheniformis, along with percent from each species. Some reads are from subtilis but the results of the NT blast contain a higher proportion of mapped reads in total reads associated with B. licheniformis.

Sample A: Bacillus licheniformis DSM (59.13%) Bacillus licheniformis ATCC (58.95%)
Bacillus licheniformis 9945A, (57.74%)
Bacillus subtilis subsp. (1.31%)
Bacillus licheniformis plasmid (1.27%)
Bacillus atrophaeus subsp. (1.08%)
Bacillus subtilis PY79, (1.01%) B.licheniformis gene for (0.98%)
Bacillus amyloliquefaciens subsp. (0.97%)
Bacillus sp. JS, (0.92%)

I am not sure of how close they are in terms of ANI. But I have mapped all 40 samples across B. subtilis and B. licheniformis reference genomes, and the 20 samples that were indicated by blast to be B. licheniformis do map ~90% to B. licheniformis reference genomes; and the 20 samples that were indicated by blast to be b. subtilis similarly map ~90% to B. subtilis reference genomes.

ADD COMMENT • link 7.8 years ago by jerrybug109 ▴ 10

1

Entering edit mode

You can certainly use BBSplit as Genomax suggested to separate the licheniformis from the subtilis reads, but it seems like you would end up with virtually nothing left. This might be a situation where you don't actually have contamination (it's rare for contamination to be the vast majority of an isolate library) but rather the organism you are sequencing is not what you expect, but some species of Bacillus that is not in the BLAST database, but kind of related to licheniformis and subtilis. It's really hard to say. I suggest you assemble and look at the size of the resulting assembly and number of single-copy genes present to see whether the library appears to have one organism or two organisms in it.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Yes, we considered the possibility that perhaps what we sequenced was not what we expected to get.

However, I still need to check for contamination.

To eliminate the possiblity that a single organism was responsible for cross contamination across 20 samples, couldn't I also do the following:

separate b. licheniformis reads from b. subtilis reads
assemble b. licheniformis reads into contigs
align each samples' contigs against other samples' contigs using software like BWA-MEM
if a carryover contamination of one organism across multiple samples did occur, then I would expect to see 100% alignment between the suspicious samples?

Or is that approach not good for my purpose? Thanks again. I'm new to this field and appreciate your advice.

ADD REPLY • link 7.8 years ago by jerrybug109 ▴ 10

0

Entering edit mode

Question you need to ask is how far apart are B. subtilis and B. licheniformis evolutionarily? If they are close relatives then you may be seeing false positives with blast. If you were to make independent DB of the two organisms then all your reads may align well to a large extent to both.

Perhaps there is no contamination at all. Have you considered this possibility?

Why did you go looking with blast in first place? Was there a red flag of some kind?

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

I did blast because after we got the reads back from the core facility I assembled their genomes & aligned contigs to B. subtilis reference with the intention to do some comparative genomic analysis on our 40 samples (they are different strains of B. subtilis, or so we thought at the time). But after 20 of the strains aligned with low % identity to B. subtilis reference genomes, I then tried to blast the contigs to check for species identity.

It's very possible that there's no contamination at all and that half our samples are B. licheniformis. I'll have to look into this possibility too. Thanks.

ADD REPLY • link 7.8 years ago by jerrybug109 ▴ 10

0

Entering edit mode

You can try this. There's no single recipe for solving contamination in general. But as Genomax indicated, you may indeed not have contamination. Also, separating the reads works best when you are absolutely sure what species each organism is; in this case we are not really sure.

Another way to look for contamination is to look at kmer-frequency histograms; bacteria, which are haploid and not very repetitive, will have most of their kmers in a single primary peak. A bacteria contaminated with another bacteria, assuming they are at different coverage levels, will typically have two primary peaks. With the BBMap package you can generate a histogram like this:

kmercountexact.sh in=reads.fq khist=khist.txt

...then make a scatter plot it and look at it in a log-log scale in something like Excel. But if the two species are very closely related, this method won't work very well (in fact, no method will work very well in that case).

I still also advise you to assemble and look at the genome size and number of single-copy genes; that's a very good way of determining contamination.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k

1

Entering edit mode

@jerrybug109: You could take the reference genomes of the two bacilli, use randomreads.sh from BBMap (randomreads.sh ref=genome.fasta out=reads.fq len=NNN reads=NNNNNNN), mix them and then run kmercountexact.sh to see what a control plot looks like. Comparing it to your own data should be somewhat revealing. At worst you can convince yourself that you can't tell the contamination apart this way.

ADD REPLY • link 7.8 years ago by GenoMax 141k