Question

split supermatrix by gene

0

Entering edit mode

9.4 years ago

nwhelan • 0

I'm trying to figure out if there is a good way to split an alignment by positions. For example, let's say I have a supermatrix alignment of 5000 positions, and I want to split it by 1-501, 502-4500, 45001-5000 and then drop those new alignments into a new file. So three files with each taxon but only the positions specified in each file. Is there an easy way to do this? The starting file could either be a fasta or phylip file.

This would be different than taking a fasta file with thousands of sequences and splitting it up so only 500 different sequences are in any given file (that's been talked about elsewhere on this forum).

supermatrix alignment phylogeny gene • 2.0k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by nwhelan • 0

Ram · Answer 1 · 2014-12-05

Very straightforward using R:

#via Bioconductor

library(Biostrings)

sequenceData <- readDNAStringSet("your_fasta_file.fa", format="fasta")

#you said it's an aligned supermatrix, so it should be a 'square' matrix
subset <- DNAStringSet(sequenceData, start=[YOUR START POSITION], end=[YOUR STOP POSITION])
writeXStringSet(x=subset, format="fasta")

You can iterate this using a loop or more sophisticated apply()-based approaches.

If you are going to be working with datasets of this size frequently, I'd recommend Mark Holder's Nexus Class Library for fast and accurate conversion among filetypes. Also, depending on what you need to do (combine, split, etc.), Stephen Smith's Phyutility works well for most things. It can be a bit buggy, but I've used it to combine transcripts from whole mouse exomes, for example.

Ram · Answer 2 · 2014-12-04

If you are going to do this sort of thing a lot, it's probably better to make your storage format something like Nexus or NexML that is designed to contain character sets and make it easy to include/exclude them for analyses.

To just get started you can whatever scripting language you like to slice out parts of the sequence. Biopython has AlignIO which lets you select columns, and SeqIO to process one sequence at time (if memory is problem for very large alignments). The R package ape can allows you to select alignments by column. I'm sure there are perl and ruby functions to do the same