Question: How to differentiate between "horizontal gene transfer" and "contamination" in NGS data?
I am just wondering for a suitable approach to filter "contamination" from the genome of know HGT nature? Any ideas and approaches are welcome.

Thanks for your favor.

I suppose you are talking about DNA-seq data, right ?

In the case of a contamination, you can expect a somehow uniform distribution of the contaminant reads on their genome while if there is a true HGT event, then only one or a few extra genes will be represented.

A possible approach to distinguish both cases is to :

  1. Identify the origin of the contaminant/HGT using blast (for instance, E.coli).
  2. Take the reads that don't map on the genome of your model and map them on that possible contaminant.
  3. Have a look on the distribution of the mapped reads (with for instance, IGV). If the coverage is more or less uniform, then its probably a contamination while if you have one a few spikes, then it probably comes from HGT.
A tale of two tardigrades

The discussion about contamination vs. HGT is still in flux. You might want to have a look at the following interesting case in PNAS on real or false HGT in tardigrades:

  • Boothby TC, et al. (2015) Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc Natl Acad Sci USA 112(52):15976–15981.
  • Koutsovoulos G, et al. (2016) No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc Natl Acad Sci USA 113:5053–5058.

In my opinion the evidence has shifted towards no HGT and towards contamination in this case.

In the following commentary, some guidelines for distinguishing contamination from HGT have been given:

  • Commentary - Biological Sciences - Genetics: Thomas A. Richards and Adam Monier A tale of two tardigrades PNAS 2016 113 (18) 4892-4894; published ahead of print April 15, 2016, doi:10.1073/pnas.1603862113
First of all, it it more likely to see contamination than true HTG.

If it is contamination, you have to discern two cases: 1) contamination happened before preparation of the sequencing library, and 2) contamination after library preparation (your library was cross contaminated with another library likely during loading of the sequencer). In both cases there are some characteristic features. If it is contamination with fast growing bacteria, you should especially see the high copy number genes of the contaminating bacterium, which are rRNA genes and plasmidic genes. In case of contamination with another library, that library may have an insert size distinct from your main library, see Plant viruses sequences are found in human brain Rna-seq sample: how to evaluate it?

You can identify true HTG if you find a chimeric contig, where a stretch of foreign DNA sequence is inserted into your target genome and is flanked by sequences of your target organism at BOTH ends. The whole contig should show a rather uniform read coverage, especially at the insertion sites.

