weighted mean pairwise distance among samples containing multiple sequences?
0
0
Entering edit mode
2.6 years ago

Weighted pairwise genetic distance between samples from fasta file and count table of sequences?

I have a series of samples, all containing multiple DNA sequences. I’m looking to calculate the mean pairwise genetic distance between samples. I’ve figured out a way of how to do this in R using dist.dna in ape and dist_groups in usedist in a dummy dataset as follows:


library(adegenet)

library(ape)

library(usedist)

library(tidyverse)

# open dna file and calculate genetic distance matrix:

dna <- fasta2DNAbin(file="http://adegenet.r-forge.r-project.org/files/usflu.fasta")

D <- dist.dna(dna, model = "N")

# make dummy vector of samples a - d

samples <- c(repsamples <- c(rep("sample.a", 20), rep("sample.b", 20), rep("sample.c", 20), rep("sample.d", 20)))

# rename distance labels

D <- dist_setNames(D, samples)

# plot matrix by samples

table.paint(as.matrix(D), cleg=0, clabel.row=.5, clabel.col=.5)

# Create a data frame of distances between groups of items.

D2 <- dist_groups(D, samples) D2.grouped <- D2 %>% group_by(Label) %>% summarise(mean.distance=mean(Distance))


The problem I have is that in the actual dataset I have ~400 samples, and each sample contains upwards of 20,000 sequences. Combining these together would make a (very) large distance matrix.

As there are only ~1500 unique sequences in total, I can get around this problem by calculating a distance matrix on JUST the unique samples. What I’d then like to do is calculate mean pairwise distance BETWEEN samples (as above with dist_groups from usedist) except somehow weight each sample by the number of times the sequence occurs in the sample?

Specific example using dummy data below (10 samples, 3 sequences per sample, lots of sequence counts):


# Open fasta and extract sequence names

dna <- fasta2DNAbin(file="http://adegenet.r-forge.r-project.org/files/usflu.fasta")

sequencenames <- dimnames(dna)[[1]]

# Make dummy count table

sequences <- sample(sequencenames,30,replace=TRUE)

count <- sample(1000:5000,30)

sample <- rep(LETTERS[seq( from = 1, to = 10 )],3)

counttable <- cbind(sample, sequences, count) %>% as.data.frame() %>% arrange(sample)


One way I could see this working would be:

1) Take the pairwise distances among sequences from dist.dna(dna, model = "N")

2) Expand the fasta file using the count table so that every row of dist.dna would be an individual sequence within a sample

While this could work, the fasta file would be… sum(as.numeric(counttable$count)) = 101430 sequences long, and the dist matrix would be 101430* 101430 - which would rapidly become computationally intractable as the actual dataset has… 60 million sequences.

My ideal workflow would be this:

1) Take the pairwise distances among unique sequences from dist.dna(dna, model = "N")

2) Use dist_groups as above to calculate within and between sample pairwise distance, but someway make this a weighted distance by incorporating the replicated sequences within each sample.

Any ideas how to do this in either R or Python? At this stage i'm up for anything!

distance dna ape • 689 views
ADD COMMENT

Login before adding your answer.

Traffic: 2337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6