Randomised subsetting of sequences in a fasta file using R
Entering edit mode
9.4 years ago
confusedious ▴ 490

I have a sequence alignment in fasta format with 219 sequences in it. I am testing a new phylogenetic method and I am curious about how subsets of differing sizes and compositions from my full alignment might impact upon selection of sites for inclusion in tree building and thus tree topology.

I am using 'ape' and 'phangorn' in R and have found that I can subset defined sequences using the following method:

testalign <- read.phyDat("alignment.fasta", format = "fasta", type = "DNA")
subset(testalign, subset=1:10)

In this case I am creating a subset of sequences 1 through 10. Ideally I would like to extract subsets of this alignment of a random size between 3 and 218 and then write these subsets out as individual alignment files. I would prefer, of course, that these subsets not be taken in order of how they are found in the origianl file (i.e. not 1:10; 10 random sequences from the alignment of 219).

Could anyone advise on how I might achieve this?

fasta alignment R • 3.7k views
Entering edit mode
9.4 years ago
David W 4.9k

I don't have phangorn on the computer in front of me to test the whole thing, but you can get a random sample of integers with... sample() :)

sample(1:219, replace=TRUE, size=n)

Using replace=TRUE is equivalent to a bootstrap sample of size n.

You could do the same to sample from a uniform distribution of sample sizes (Ns <- sample(3:218, replace=TRUE, size=100)), or use sapply and replicate to repeatedly sample at each of several values for n.


Login before adding your answer.

Traffic: 1834 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6