Question

Removing bacterial contamination of mouse gut sequences through metagenomic techniques

1

Entering edit mode

10.2 years ago

Assa Yeroslaviz ★ 1.9k

Hi,

We have multiple samples from the gut/intestine track of mice and we think that there is a contamination from bacterial genome(s) in the data set, as only a very low number of genes is mapped to the mouse genome

What methods can I use to map the fastq files (or even fastA after conversion) to other, unknown bacterial genomes. I prefer not to BLAST all the reads of all the samples.

As we don't know exactly which bacterial genomes we have, I can't really download all of them.

Any ideas?

Thanks

Assa

metagenomics de-novo blast RNA-Seq • 6.5k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Why do you think there is a contamination? When you look at metagenomic data, bacterial reads is what you mostly suppose to see, there wouldn't be much mouse genome, unless your mice are seriously sick and have pieces of their gut in the poop. Anyway, even if there is a bacterial contamination but you have no idea which bacteria it is you can pretty much remove all the bacterial reads because there is no way to distinguish the real ones from the contaminants. Of course, you can map your reads only to known genomes but how do you know that contaminant is necessarily unknown bacteria?

ADD REPLY • link 10.2 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

we didn't start with bacterial genome. The biologist did a normal RNA-Seq experiment of the gut/intestine of mice. We have tried to map the data, but with very little success (~30-50%).

We now would like to know the reason for that might be due to the fact, that cleaning the RNA wasn't as good as expected and the bacterial genome was also sequenced in the process.

ADD REPLY • link 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Did you guys do rRNA depletion step?

ADD REPLY • link 10.2 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

yes we did. for some of the sample even twice!

ADD REPLY • link 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

What did you map those 30-50% of the reads to? I still don't get why do you think it's contamination but have you tried to take a subset of the unmapped reads, blast it and see what does it map to if it maps at all?

ADD REPLY • link 10.2 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

I have mapped the fastq files to the mouse genome and got only 30-50% of the reads mapped (using tophat2).

Than I tried to first map it to only the mouse rRNA sequences, took the unmapped reads and used tophat2 again to map them to the mouse genome. Here also I got ~50-60% success. But this is still very low. As the sequences were extracted from gut samples, we were thinking, there might be some bacterial residues in the total RNA samples, which were also sequenced.

ADD REPLY • link 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Another reason is the fact that we see a GC distribution, which is not normal.

image: GC content

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Wait, are you doing metagenomics/metatranscriptomics or trying to do mouse RNA-seq? If it's the last one, I'm surprised that so many of your reads are mapped to the mouse because in the gut sample most of stuff IS bacteria and usually people use it to analyse bacteria. Is it a stool sample or an actual sample from the gut?

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

Yes the project is RNASeq, the samples are from the colon, spleen and ileum. There shouldn't be any bacterial genome residues in the data at all.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Oh, okay. Then you probably shouldn't name your post "metagenomics", that's confusing. Look what do the unmapped reads blast to and also see if quality trimming improves the mapping.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

I called it metagenomics, because I would like to try a metagenomics analysis of the files to see if we have any bacterial genomic residues in the data.

Sorry, if it was a bit confusing.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Really dumb question, but what is the quality/length of your reads? If you have a problem there trimming might help. Can you link here a FastQC report?

ADD REPLY • link 10.2 years ago by cyril-cros ▴ 950

0

Entering edit mode

here is the link to the report (fastqc_data) of the untrimmed file and the same file after trimming, filtering and cutting as much as I can afford.

here are the links to the html files (converted to pdf) of the trimmed and untrimmed fastq files.

hope this helps.

ADD REPLY • link 10.2 years ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Ok, this site gives us around 47% average GC content for mus_musculus. With your trimmed reads, you are at 64%. I don't know if my reasoning is valid, but if you assumed contamination by bacteria with say, a solid 80% average GC content, a bit less than 50% of your sequences would come from mus_musculus, at best. You will have 10-20M mouse reads (of a good length granted; I don't know if you are doing paired-end either). I don't know if it is still worth it for you.

I am no expert on GC content - if anyone knows better, please reply, it would be interesting for me. Just doing: mouseDNA*GC_mouse+(1-mouseDNA)*GC_bact= average_GC (with mouseDNA the percentage of reads from mice). I assume a very high GC for bacteria - the higher it is, the less bacteria there must be.

EDIT: Just noticed that my smart-ass math is exactly the same as saying "30 to 50% reads only are aligned to the mouse genome" .....SIGH....

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by cyril-cros ▴ 950

0

Entering edit mode

You don't need to BLAST all the reads; just do a few thousand, to indicate which bacterial references you need to download. Then map all reads to mouse and those at the same time.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

Ram · Answer 1 · 2015-05-22

There are numerous ways to parse contaminant reads -- no way to parse the reads will be perfect.

I would first parse by GC content, then pull your reads out by mapping (BWA or bowtie) to the mouse genome. This would be much faster and more accurate than using all the bacterial (microbial) genomes.

In addition to mapping, you can also BLAST all the reads and parse with MEGAN.

Ram · Answer 2 · 2015-05-22

1

Entering edit mode

10.2 years ago

h.mon 35k

You may assemble the reads, blast to identify taxon, get average coverage and GC content of each contig (I used blobology for this, there are other papers/programs with a similar approach). If desired, then you can filter the reads mapping just to the contigs of your interest.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 10.2 years ago by h.mon 35k

0

Entering edit mode

Sorry for chiming in. I have a similar problem, although with metatranscriptomics data. I'd like to separate eukaryotic from prokaryotic metatranscriptomic reads. Do you know if blobology can help me to do this? And could you please tell me other similar programs? Thank you.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 10.1 years ago by sentausa ▴ 650

score 0 · Answer 3 · 2015-05-22

0

Entering edit mode

10.2 years ago

stolarek.ir ▴ 700

Actually you have to download the whole database and do the alignment. Even so, most of the reads won't be identified, or will be spurious by mapping to different organisms. The databases of bacterial genomes are not so huge as we would like them to be. I have similar situation mapping ancient DNA, where often 99% of the reads are bacterial, and possibly some of them are ancient extinct bacteria, so there is no way, that any reference exists for them.