K-Mer Based Sequencing Contamination Detection
4
12
Entering edit mode
12.2 years ago
Darked89 4.2k

In a plant genome project I got a draft assembly (> 500Mbp, >500k contigs). A number of contigs is no doubt bacterial in origin.

There are at least 3 peaks when it comes to GC content (40% - my plant, 50% largest contig, 65-70% another group).

Blastn takes ages, and there is no point of doing it every time we change assembler parameters even slightly. So while rather sooner than later I will have to split 454 sff files into my_plant vs not_my_plant, I will still need a faster method of classifying contigs to not_my_plant group.

In metagenomics this is often being done by calculating k-mer frequencies, see i.e (not supported anymore) TETRA: http://www.megx.net/tetra/ (see the manual for the algorithm)

Do you use any program for fast clustering/classification of sequences from say 150bp to 1Mbp using k-mer frequencies?

sequencing • 8.1k views
ADD COMMENT
9
Entering edit mode
12.2 years ago

I would recommend R/Bioconductor to do these kinds of analysis even though I personally doubt that this method is precise enough to separate the reads correctly. You can find the function oligonucleotideFrequency in the Biostrings package. The code for the first step would look somewhat like this:

library(Biostrings)
reads = read.DNAStringSet("yourReads.fas", format="fasta")
nf = oligonucleotideFrequency(reads[1:100], width=4)
hclust(dist(nf)) # do hierarchical clustering of your tetra freq.

That would be a very simple form of clustering. Then you have all the powerful classification algorithms built in R available, for example a support vector machine classifier. Create a training and test set of reads from 2 or more sequenced genomes and mix them. Then you will see if it is possible.

But if you look at your frequencies it might look like this:

     TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT TCAA TCAC TCAG TCAT TCCA TCCC TCCG TCCT
[1,]    0    0    0    0    0    0    0    0    1    0    0    1    2    0    0    4    0    0
[2,]    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0
[3,]    0    0    0    1    1    1    0    0    0    1    0    0    0    0    0    1    0    0
[4,]    0    0    1    1    0    1    0    2    0    0    1    0    2    1    0    0    1    1
[5,]    0    1    0

So, lots of 0 or 1. Maybe not enough to classify correctly. That's from some 454 reads as an example and it seems that one should try di- and tri- nucleotides as well.

Alternative: blastx on the individual reads and discard only those with good best hit to a bacterium. A few wrong reads should do no big harm, so it is maybe good not to risk to filter out too many beforehand.

ADD COMMENT
1
Entering edit mode

Thank you, that's a very useful tip.

It is quite fast to count all 4-mers in my draft sequence. I think I have to count them on a reverse strand and add it to forward counts (done). For implementing TETRA-measure I will start with RPy.

ADD REPLY
4
Entering edit mode
11.5 years ago
Monzoor ▴ 300

You can try out this site for the problem you have.

The software available at this site helps you in separating eukaryotic sequences from prokaryotic sequences without the need for blast alignment. It can analyze a million sequences in roughly an hour.

ADD COMMENT
2
Entering edit mode
11.5 years ago
Bach ▴ 550

I'd do it the other way round.

  1. Find contaminated contigs: take the contigs of one of your assemblies and screen those against a microbial database. Check the contigs which come out with a significant hit and, once validated it's not some spurious error, put them in a separate file.
  2. Find contaminated reads: using the validated contigs from above, simply perform a mapping assembly of your 454 reads against those. Use stringent alignment parameters (perhaps something like 95% identity). The assembler will tell you which reads mapped and which didn't.

Well, those reads which mapped are the ones you want to exclude in the future.

One last thing: be very careful ... more often than not there's been gene transfer between plants and bacteria living close to them. This will probably lead to some perfectly valid plant reads to be wrongly sorted out.

ADD COMMENT
1
Entering edit mode

Hi Bach, the approach that you had suggested still involves alignment. Better to use Eu-Detect.

This is an algorithm that does not require any alignment at all. Test it yourself using a few sequences from eukaryotes and prokaryotes and do let me know your feedback.

ADD REPLY
1
Entering edit mode
11.5 years ago

Though I agree with others that this may not be the best strategy to solve your problem, for fast and efficient computation of k-mers on large sequence databases, try Tallymer from the Kurtz lab. Tallymer is a part of the GenomeTools package, which compiles and runs very cleanly and has a number of other very nice algorithms for genome analysis. Of course this doesn't fully solve your problem, but Tallymer should allow you to quickly generate various k-mer indices and counts as input data for a clustering/classification method.

See also paper here

ADD COMMENT

Login before adding your answer.

Traffic: 1184 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6