Question: Removing bacterial contamination of mouse gut sequences through metagenomic techniques
1
gravatar for Assa Yeroslaviz
4.3 years ago by
Assa Yeroslaviz1.2k
Munich
Assa Yeroslaviz1.2k wrote:

Hi,

we have multiple samples from the gut/intestine track of mice and we think that there is a contamination from bacterial genome(s) in the data set, as only a very low number of genes is mapped to the mouse genome

What methods can I use to map the fastq files (or even fastA after conversion) to other, unknown bacterial genomes. I prefer not to BLAST all the reads of all the samples.

As we don't know exactly which bacterial genomes we have, I can't really download all of them.

 

Any ideas?

 

thanks

Assa

ADD COMMENTlink modified 4.3 years ago by Josh Herr5.6k • written 4.3 years ago by Assa Yeroslaviz1.2k

Why do you think there is a contamination? When you look at metagenomic data, bacterial reads is what you mostly suppose to see, there wouldn't be much mouse genome, unless your mice are seriously sick and have pieces of their gut in the poop. Anyway, even if there is a bacterial contamination but you have no idea which bacteria it is you can pretty much remove all the bacterial reads because there is no way to distinguish the real ones from the contaminants. Of course, you can map your reads only to known genomes but how do you know that contaminant is necessarily unknown bacteria?

ADD REPLYlink written 4.3 years ago by marina.v.yurieva480

we didn't start with bacterial genome. The biologist did a normal RNA-Seq experiment of the gut/intestine of mice. We have tried to map the data, but with very little success (~30-50%).

We now would like to know the reason for that might be due to the fact, that cleaning the RNA wasn't as good as expected and the bacterial genome was also sequenced in the process.

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k

Did you guys do rRNA depletion step?

ADD REPLYlink written 4.3 years ago by marina.v.yurieva480

yes we did. for some of the sample even twice!

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k

What did you map those 30-50% of the reads to? I still don't get why do you think it's contamination but have you tried to take a subset of the unmapped reads, blast it and see what does it map to if it maps at all?

ADD REPLYlink written 4.3 years ago by marina.v.yurieva480

I have mapped the fastq files to the mouse genome and got only 30-50% of the reads mapped (using tophat2).

Than I tried to first map it to only the mouse rRNA sequences, took the unmapped reads and used tophat2 again to map them to the mouse genome. Here also I got ~50-60% success. But this is still very low. As the sequences were extracted from gut samples, we were thinking, there might be some bacterial residues in the total RNA samples, which were also sequenced.

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k

another reason is the fact that we see a GC distribution, which is not normal.

 

GC content

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k

Wait, are you doing metagenomics/metatranscriptomics or trying to do mouse RNA-seq? If it's the last one, I'm surprised that so many of your reads are mapped to the mouse because in the gut sample most of stuff IS bacteria and usually people use it to analyse bacteria. Is it a stool sample or an actual sample from the gut? 

ADD REPLYlink written 4.3 years ago by marina.v.yurieva480

Yes the project is RNASeq, the samples are from the colon, spleen and ileum. There shouldn't be any bacterial genome residues in the data at all.

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k

Oh, okay. Then you probably shouldn't name your post "metagenomics", that's confusing. Look what do the unmapped reads blast to and also see if quality trimming improves the mapping.

ADD REPLYlink written 4.3 years ago by marina.v.yurieva480

I called it metagenomics, because I would like to try a metagenomics analysis of the files to see if we have any bacterial genomic residues in the data.

sorry, if it was a bit confusing.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Assa Yeroslaviz1.2k

Really dumb question, but what is the quality/length of your reads? If you have a problem there trimming might help. Can you link here a FastQC report?

ADD REPLYlink written 4.3 years ago by cyril-cros890

here is the link to the report (fastqc_data) of the untrimmed file and the same file after trimming, filtering and cutting as much as I can afford.

here are the links to the html files (converted to pdf) of the trimmed and untrimmed fastq files.

hope this helps.

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k

Ok, this site http://bionumbers.hms.harvard.edu/bionumber.aspx?&id=102409&ver=6 gives us around 47% average GC content for mus_musculus. With your trimmed reads, you are at 64%. I don't know if my reasonning is valid, but if you assumed contamination by bacteria with say, a solid 80% average GC content, a bit less than 50% of your sequences would come from mus_musculus, at best. You will have 10-20M mouse reads (of a good length granted; I don't know if you are doing paired-end either). I don't know if it is still worth it for you.

I am no expert on GC content - if anyone knows better, please reply, it would be interesting for me. Just doing:
mouseDNA*GC_mouse+(1-mouseDNA)*GC_bact= average_GC     (with mouseDNA the percentage of reads from mice). I assume a very high GC for bacteria - the higher it is, the less bacteria there must be.

EDIT: Just noticed that my smart-ass math is exactly the same as saying "30 to 50% reads only are aligned to the mouse genome"  .....SIGH....

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by cyril-cros890

You don't need to BLAST all the reads; just do a few thousand, to indicate which bacterial references you need to download.  Then map all reads to mouse and those at the same time.

ADD REPLYlink written 4.3 years ago by Brian Bushnell16k
1
gravatar for Josh Herr
4.3 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

There are numerous ways to parse contaminant reads -- no way to parse the reads will be perfect.

I would first parse by GC content, then pull your reads out by mapping (BWA or bowtie) to the mouse genome.  This would be much faster and more accurate than using all the bacterial (microbial) genomes.

In addition to mapping, you can also BLAST all the reads and parse with MEGAN.

ADD COMMENTlink written 4.3 years ago by Josh Herr5.6k
1
gravatar for h.mon
4.3 years ago by
h.mon27k
Brazil
h.mon27k wrote:

You may assemble the reads, blast to identify taxon, get average coverage and GC content of each contig (I used blobology for this, there are other papers/programs with a similar approach). If desired, then you can filter the reads mapping just to the contigs of your interest.

ADD COMMENTlink written 4.3 years ago by h.mon27k

Sorry for chiming in. I have a similar problem, although with metatranscriptomics data. I'd like to separate eukaryotic from prokaryotic metatranscriptomic reads. Do you know if blobology can help me to do this? And could you please tell me other similar programs? Thank you.

ADD REPLYlink written 4.3 years ago by sentausa640
0
gravatar for stolarek.ir
4.3 years ago by
stolarek.ir600
Poland
stolarek.ir600 wrote:

Actually you have to download the whole database and do the alignment. Even so, most of the reads won't be identified, or will be spurious by mapping to different organisms. The databases of bacterial genomes are not so huge as we would like them to be. I have similar situation mapping ancient DNA, where often 99% of the reads are bacterial, and possibly some of them are ancient extinct bacteria, so there is no way, that any reference exists for them.

ADD COMMENTlink written 4.3 years ago by stolarek.ir600

Thanks, but what data base do I use?

ADD REPLYlink written 4.3 years ago by Assa Yeroslaviz1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1214 users visited in the last hour