What distance metric to use for clustering fragments of genes, like 16S rRNA, or DNA fragments?
1
1
Entering edit mode
9.1 years ago
vmirjalily ▴ 10

I am trying to apply different clustering algorithms on sequence data with lengths about 240-260. The sequences mainly come from nex-generation sequencing technology. For any clustering (OTU) method, we need some notion of distance among points/objects to assess how far/close objects/sequences are from each other. So far I have been using the edit-distance for this purpose, but now I want to use a more biologically relevent distance!

I have found some measures such as Kimura distance, Gamma distance, ... but among all these distance I don't know which one would fit to the type of data that I have! Is Kimura applicable for 16S rRNA sequences? Do you have any other suggestions, or papers that reviews application of these evolutionary distance measures for RNA/DNA fragments (not the whole genome)

Below, I have provided some detailed informations for the dataset that I am studying:

One example sequences in the dataset:

>M02127_29_000000000-A9TRU_1_1106_3986_10247
UAC--GG-AA-GGU---CCG-G-G-C-G-U--U--AU-C-CGG-AU----UU-A--U-U--GG-GU---UU-A----AA-GG-GA-GC--G-UA-G-G-C-C-G--G-UC-U-U-U---AA-G-C-G-U--G-C-C-G--UG--A-AA-UU-U-U-GU-G-G--CU-C-AA-C-C-A-U-G-A-G-A-G--U-G-C-G-G-C-G--CGA-A-CU-G-G--AG-A-C-C-U-U-G-A-G-U--G-C-GC--GG-A-A-G-G-C-A--GG-C--GG-A--AUU--CG-U-G-GU--GU-A-G-CG-GU-G-A-A-A-UG-C-UU-AG--AU-A-UC-A-C-G-A-A-G-A-AC-C-CC--GA-U-U-GC-GAA-GG-C-A-G--C-C-U-G--CCG-C--AG-C-G-U-U-----A-C-U--GA--CG-C-U-G-A-AG-C-U-CG-A--AA-G-C-G-CG--GG-U--AU-C-G-AA-CAGG

Here is the taxonomy levels for 10 sequences that I have:

Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
Bacteria Firmicutes Clostridia Clostridiales unclassified unclassified
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Rikenellaceae Alistipes
Bacteria unclassified unclassified unclassified unclassified unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified
Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae unclassified
Bacteria unclassified unclassified unclassified unclassified unclassified
next-gen-sequencing sequence • 3.3k views
ADD COMMENT
0
Entering edit mode

I am confused what your actual question is. You mention you want to cluster 16S sequences, but then you also make reference to Kimura and Gamma distances which are typically used to measure evolutionary metrics of presumed sequence change. These are different analyses to address different research questions.

Can you tell us something about your research question and where your data comes from (what type it is -- mixed sample, single bacterial genome) -- this will help us to actually be able to guide you with what you want to do.

ADD REPLY
0
Entering edit mode

Hi Josh, thanks for your comment! I have updated my question, provided some an sequence and some taxonomy levels from my data.

I want to use a proper distance measure for clustering biological sequences into OTUs. I want to know among all these distance metrics, what metric is sutiable for this kind of data that I have!

ADD REPLY
0
Entering edit mode
9.1 years ago
Josh Herr 5.8k

I'm still not entirely clear on what you are wanting to do. It looks like you already have a taxonomy list which is, typically, one of the numerous outputs from OTU clustering.

There are a lot of methods to cluster sequences into OTUs -- I use usearch, but I would also recommend swarm (which was written by Frédéric Mahé), mothur (DOTOR is the clustering program inside, from Pat Schloss) and cd-hit is an oldie, but still works. There are many more I haven't mentioned here.

Depending on what you want to do, you might find these tutorials handy (they are from a course I co-instruct): http://edamame-course.org/

ADD COMMENT
0
Entering edit mode

Yes, I am familiar with MOTHUR, DOTUR, .. and in fact what I want to know is that what distance metrics is used in MOTHUR for building clusters/OTUs?

So here is a line from DOTUR paper (abstract): "We present a method that addresses the challenge of assigning sequences to operational taxonomic units (OTUs) based on the genetic distances between sequences" What exactly is the genetic distance between sequences? Is it Kimura, Tamura?

(maybe my terminology is not compatible with what is used in biology, sorry about that!)

PS. I am also a student at MSU! That summer course is certainly useful for me, so I will try to register for the course in summer!

ADD REPLY
0
Entering edit mode

My understanding is mothur/DOTUR uses a specific criterion devised specifically for mothur -- so it's neither Kimura, Gamma, etc. You'll have read further to find more information -- I checked the mothur OTU forum and don't see anything specific regarding the genetic distance metrics. You'll have to look further. You can always post a question to the mothur help forum - both Pat and Sarah are extremely helpful and diligent in addressing questions.

ADD REPLY

Login before adding your answer.

Traffic: 2498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6