Question: split supermatrix by gene
gravatar for nwhelan
5.6 years ago by
United States
nwhelan0 wrote:

I'm trying to figure out if there is a good way to split an alignment by positions. For example, let's say I have a supermatrix alignment of 5000 positions, and I want to split it by 1-501, 502-4500, 45001-5000 and then drop those new alignments into a new file. So three files with each taxon but only the positions specified in each file. Is there an easy way to do this? The starting file could either be a fasta or phylip file. 

This would be different than taking a fasta file with thousands of sequences and splitting it up so only 500 different sequences are in any given file (that's been talked about elsewhere on this forum).

ADD COMMENTlink modified 4.0 years ago by Biostar ♦♦ 20 • written 5.6 years ago by nwhelan0
gravatar for Brice Sarver
5.6 years ago by
Brice Sarver3.5k
United States
Brice Sarver3.5k wrote:

Very straightforward using R:

#via Bioconductor


sequenceData <- readDNAStringSet("your_fasta_file.fa", format="fasta")

#you said it's an aligned supermatrix, so it should be a 'square' matrix
subset <- DNAStringSet(sequenceData, start=[YOUR START POSITION], end=[YOUR STOP POSITION])
writeXStringSet(x=subset, format="fasta")

You can iterate this using a loop or more sophisticated apply()-based approaches.

If you are going to be working with datasets of this size frequently, I'd recommend Mark Holder's Nexus Class Library for fast and accurate conversion among filetypes. Also, depending on what you need to do (combine, split, etc.), Stephen Smith's Phyutility works well for most things. It can be a bit buggy, but I've used it to combine transcripts from whole mouse exomes, for example.

ADD COMMENTlink modified 8 months ago by RamRS27k • written 5.6 years ago by Brice Sarver3.5k
gravatar for David W
5.6 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

If you are going to do this sort of thing a lot, it's probably better to make your storage format something like Nexus or NexML that is designed to contain character sets and make it easy to include/exclude them for analyses. 

To just get started you can whatever scripting language you like to slice out parts of the sequence. Biopython has AlignIO which lets you select columns, and SeqIO to process one sequence at time (if memory is problem for very large alignments). The R package ape can allows you to select alignments by column. I'm sure there are perl and ruby functions to do the same


ADD COMMENTlink written 5.6 years ago by David W4.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 784 users visited in the last hour