Question: What distance metric to use for clustering fragments of genes, like 16S rRNA, or DNA fragments?
1
gravatar for vmirjalily
4.1 years ago by
vmirjalily10
United States
vmirjalily10 wrote:

I am trying to apply different clustering algorithms on sequence data with lengths about 240-260. The sequences mainly come from nex-generation sequencing technology. For any clustering (OTU) method, we need some notion of distance among points/objects to assess how far/close objects/sequences are from each other. So far I have been using the edit-distance for this purpose, but now I want to use a more biologically relevent distance!

I have found some measures such as Kimura distance, Gamma distance, ... but among all these distance I don't know which one would fit to the type of data that I have! Is Kimura applicable for 16S rRNA sequences? Do you have any other suggestions, or papers that reviews application of these evolutionary distance measures for RNA/DNA fragments (not the whole genome)

Below, I have provided some detailed informations for the dataset that I am studying:

 

 

One example sequences in the dataset:

>M02127_29_000000000-A9TRU_1_1106_3986_10247
UAC--GG-AA-GGU---CCG-G-G-C-G-U--U--AU-C-CGG-AU----UU-A--U-U--GG-GU---UU-A----AA-GG-GA-GC--G-UA-G-G-C-C-G--G-UC-U-U-U---AA-G-C-G-U--G-C-C-G--UG--A-AA-UU-U-U-GU-G-G--CU-C-AA-C-C-A-U-G-A-G-A-G--U-G-C-G-G-C-G--CGA-A-CU-G-G--AG-A-C-C-U-U-G-A-G-U--G-C-GC--GG-A-A-G-G-C-A--GG-C--GG-A--AUU--CG-U-G-GU--GU-A-G-CG-GU-G-A-A-A-UG-C-UU-AG--AU-A-UC-A-C-G-A-A-G-A-AC-C-CC--GA-U-U-GC-GAA-GG-C-A-G--C-C-U-G--CCG-C--AG-C-G-U-U-----A-C-U--GA--CG-C-U-G-A-AG-C-U-CG-A--AA-G-C-G-CG--GG-U--AU-C-G-AA-CAGG

 

Here is the taxonomy levels for 10 sequences that I have:

Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
Bacteria Firmicutes Clostridia Clostridiales unclassified unclassified
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Rikenellaceae Alistipes
Bacteria unclassified unclassified unclassified unclassified unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified
Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae unclassified
Bacteria unclassified unclassified unclassified unclassified unclassified

 

sequencing next-gen sequence • 1.9k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by vmirjalily10

I am confused what your actual question is.  You mention you want to cluster 16S sequences, but then you also make reference to Kimura and Gamma distances which are typically used to measure evolutionary metrics of presumed sequence change.  These are different analyses to address different research questions.   

Can you tell us something about your research question and where your data comes from (what type it is -- mixed sample, single bacterial genome) -- this will help us to actually be able to guide you with what you want to do.

ADD REPLYlink written 4.1 years ago by Josh Herr5.6k

Hi Josh, thanks for your comment! I have updated my question, provided some an sequence and some taxonomy levels from my data.

I want to use a proper distance measure for clustering biological sequences into OTUs. I want to know among all these distance metrics, what metric is sutiable for this kind of data that I have!
 

ADD REPLYlink written 4.1 years ago by vmirjalily10
0
gravatar for Josh Herr
4.1 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

I'm still not entirely clear on what you are wanting to do.  It looks like you already have a taxonomy list which is, typically, one of the numerous outputs from OTU clustering.

There are a lot of methods to cluster sequences into OTUs -- I use usearch/uclust/uparse, but I would also recommend swarm (which was written by Frédéric Mahé), mothur (DOTOR is the clustering program inside, from pschloss) and cd-hit is an oldie, but still works.  There are many more I haven't mentioned here.

Depending on what you want to do, you might find these tutorials handy (they are from a course I co-instruct): http://edamame-course.org/

ADD COMMENTlink written 4.1 years ago by Josh Herr5.6k

yes, I am familiar with MOTHUR, DOTUR, .. and in fact what I want to know is that what distance metrics is used in MOTHUR for building clusters/OTUs? 

 

So here is a line from DOTUR paper (abstract): "We present a method that addresses the challenge of assigning sequences to operational taxonomic units (OTUs) based on the genetic distances between sequences" What exactly is the genetic distance between sequences? Is it Kimura, Tamura?

(maybe my terminology is not compatible with what is used in biology, sorry about that!)

PS. I am also a student at MSU! That summer course is certainly useful for me, so I will try to register for the course in summer!

ADD REPLYlink written 4.1 years ago by vmirjalily10

My understanding is mothur/DOTUR uses a specific criterion devised specifically for mothur -- so it's neither Kimura, Gamma, etc.  You'll have read further to find more information -- I checked the mothur OTU forum and don't see anything specific regarding the genetic distance metrics.  You'll have to look further.  You can always post a question to the mothur help forum - both Pat and Sarah are extremely helpful and diligent in addressing questions.

ADD REPLYlink written 4.1 years ago by Josh Herr5.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 904 users visited in the last hour