In a nutshell, this is the problem I’m trying to solve: essentially I want to go from a concatenated (non-interleaved) MSA to splitting that into the component individual gene alignments as separate files. I want to do this so I can infer gene trees for all loci separately. I have two datasets currently, one of 14 species and 51 genes, the other of 24 species and 1000 genes, to give you some idea of scale. The data is from transcriptomes compared against OrthoDB COGs using Orthograph.
Sounds straightforward, and indeed there are tools out there that I have used already to do so (Raxml being one).
HOWEVER, because of random dropout or non-expression of some genes in some species, this results in some species in some gene alignments with no data for that gene, just a long string of missing data characters. So here is the problem: I need a way of parsing those sequences out either during splitting or after so I can build phylogenetic trees for each gene from the species that have data present. But I can’t find a good way of doing this (with just bash type commands) and maintaining the correct format.
So far I’ve tried in Nexus and Phylip format, using sed-based commands because in these formats the species name and corresponding sequence are or can be on the same line – so if I match to a string of missing data characters then I can remove the whole line. The problem with this is that this does not also change the 'ntax' block automatically and I would have to go back through and do it by hand. On the other hand, if I used fasta format which doesn’t have that extra information, then I’d need a way to remove both the sequence line and the one preceding it.
Even better would be a way to search for lines that contain only the missing data character, rather than a string of predetermined but arbitrary length as I have been doing with sed. This is because sequence length changes from gene to gene. This would probably work best with fasta as the sequence takes up the whole line, but again I’d also need a way to remove the species name in the preceding line.
So, my question is, has anyone been able to solve this problem directly - ie, splitting a MSA by gene while also parsing out any sequences for a particular gene that contain only missing data? Alternatively, are there ways to do this post-splitting using Bash, Perl, Python, R, etc?