Question: Split a concatenated MSA and parse out taxa with only missing data
gravatar for kroschie
5.4 years ago by
kroschie0 wrote:

Hi all,

In a nutshell, this is the problem I’m trying to solve: essentially I want to go from a concatenated (non-interleaved) MSA to splitting that into the component individual gene alignments as separate files. I want to do this so I can infer gene trees for all loci separately. I have two datasets currently, one of 14 species and 51 genes, the other of 24 species and 1000 genes, to give you some idea of scale. The data is from transcriptomes compared against OrthoDB COGs using Orthograph.

Sounds straightforward, and indeed there are tools out there that I have used already to do so (Raxml being one). 

HOWEVER, because of random dropout or non-expression of some genes in some species, this results in some species in some gene alignments with no data for that gene, just a long string of missing data characters. So here is the problem: I need a way of parsing those sequences out either during splitting or after so I can build phylogenetic trees for each gene from the species that have data present. But I can’t find a good way of doing this (with just bash type commands) and maintaining the correct format.

So far I’ve tried in Nexus and Phylip format, using sed-based commands because in these formats the species name and corresponding sequence are or can be on the same line – so if I match to a string of missing data characters then I can remove the whole line. The problem with this is that this does not also change the 'ntax' block automatically and I would have to go back through and do it by hand. On the other hand, if I used fasta format which doesn’t have that extra information, then I’d need a way to remove both the sequence line and the one preceding it.

Even better would be a way to search for lines that contain only the missing data character, rather than a string of predetermined but arbitrary length as I have been doing with sed. This is because sequence length changes from gene to gene. This would probably work best with fasta as the sequence takes up the whole line, but again I’d also need a way to remove the species name in the preceding line.

So, my question is, has anyone been able to solve this problem directly - ie, splitting a MSA by gene while also parsing out any sequences for a particular gene that contain only missing data? Alternatively, are there ways to do this post-splitting using Bash, Perl, Python, R, etc? 


sequence alignment gene • 1.5k views
ADD COMMENTlink modified 5.4 years ago by Brice Sarver3.6k • written 5.4 years ago by kroschie0
gravatar for Brice Sarver
5.4 years ago by
Brice Sarver3.6k
United States
Brice Sarver3.6k wrote:

Since there are a couple different versions of Nexus floating around, I would recommend the following approach:

  1. Convert your Nexus file to FASTA. I use NCLconverter, part of the Nexus Class Library.
  2. Read your FASTA file using R/Python/your choice.
  3. You said you had an MSA, so the sequences ought to be of equal length. Scan by position and look for a certain percentage of missing data. Off the top of my head (no guarantees for this quick code!) using Biostrings in R:

    #read it in
    a <- readDNAStringSet("your file", format="fasta")
    #convert to a matrix
    b <- as.matrix(a)
    #create a function to determine if a site is all missing data
    miss <- function(x){
     number_of_Ns <- length(which(x == "N")))
     if (number_of_Ns <= [some threshold you define based on your dataset]){
         site <- NA #(you'll want to make sure this will work; you might have to return a number of NAs equal to the number of rows in your matrix.)
     } else {
         site <- x
    #apply the function across your table of sites
    new_matrix <- apply(a, 2, miss)
    #remove NAs
    new_matrix <- new_matrix[!is.finite(new_matrix)]
    #coerce back to DNAStringSet
    dna <- DNAStringSet(apply(new_matrix, 1, function(x) paste(x, collapse="")))
    writeXStringSet(dna, "your_new_file.fa", format="fasta")

    For extracting certain regions, subseq() is fast (if you know the coords for your genes, perhaps).

  4. Convert back.

Hope this general approach gives you some ideas!

ADD COMMENTlink modified 15 months ago by Ram32k • written 5.4 years ago by Brice Sarver3.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2032 users visited in the last hour