Split a concatenated alignment into individual gene alignments
1
0
Entering edit mode
6.9 years ago
jon.brate ▴ 250

Hi,

I want to split a concatenated alignment in nexus-format into the individual gene alignments. The gene boundaries are indicated below the alignment. Any programs that can do this? My goal is to create phylogenies for each individual gene.

Thanks,

Jon

alignment nexus concatenation phylogeny • 4.3k views
0
Entering edit mode

Do you have an interleaved nexus file or does each sample just have one long line with the sample name and then sequence?

0
Entering edit mode

Non-interleaved, each taxon on a long line.  It seems that the file is written by Mesquite. I tried to open it in Mesquite, but it didn't recognize the partitions. The partitions are described after the alignment like this (not sure what the DEFTYPE line means though...):

BEGIN ASSUMPTIONS;
CHARSET gene1 = 1 - 93 ;
CHARSET gene2 = 94 - 313 ;
CHARSET gene3 = 314 - 597 ;

...

OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
END;

0
Entering edit mode

An individual nexus file for each phylogeny? and also how many genes in that ASSUMPTIONS block?

0
Entering edit mode

The format of the individual files is not that important. I will run RAxML so phylip is the best. But nexus or fasta is also fine.

There are 406 genes in ASSUMPTIONS.

0
Entering edit mode

Hmm, no I don't have any scripts on hand to deal with that situation sorry.  Was going to try and write a little python script that pulls out the matrix block, then subsets into new nexus files based on the gene positions, but because of the number of genes I would also have to parse the assumption block. Thought it may have been a very fast script to write, but I should really study for these finals. Sorry I couldn't be any help, hopefully someone else finds a solution for you.

The only suggestion I can give  is to play around with the Nexus parser in biopython, and try and twist it to solve this problem.

0
Entering edit mode

Ok, but thanks a lot anyway,

3
Entering edit mode
6.9 years ago
Whetting ★ 1.5k
from Bio import SeqIO
import re

charset=[]
with open ("alignment.nexus", "rU") as f:
for line in f:
line=line.strip ()
if "CHARSET" in line:
line = line.split (" = ")
charset.append (line [1])

for c in charset:
with open (c+"_aligned.fas","w") as out:
for record in SeqIO.parse ("alignment.nexus", "nexus"):
m=re.search("(.*) - (.*) ;",c)
if m:
start = int(m.groups()[0])
end = int(m.groups()[1])
print >> out, ">"+record.id
print >> out, record.seq [start:end]


0
Entering edit mode

Thanks!

I get some errors though. Does the nexus parsers assume interleaved format?

Traceback (most recent call last):
File "nexus.py", line 19, in <module>
for record in SeqIO.parse ("est.holozoa.nex", "nexus"):
File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 586, in parse
for r in i:
File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 580, in <genexpr>
i = (r for alignment in AlignIO.parse(fp, format,
File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 370, in parse
for a in i:
File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/NexusIO.py", line 40, in NexusIterator
n = Nexus.Nexus(handle)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 587, in __init__
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 638, in read
self._parse_nexus_block(title, contents)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 679, in _parse_nexus_block
getattr(self, '_' + line.command)(line.options)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 899, in _matrix
"(check dimensions/interleaving)" % (id, c, iupac_seq))


0
Entering edit mode

I edited the file. it worked on a test nexus file for me. Are you sure that your nexus format is ok? if you try

from Bio import SeqIO

for record in SeqIO.parse("alignment.nexus","nexus"):

print record.id

does that throw an error?

0
Entering edit mode

Thanks, really appreciate this.

Yes this gives the same error. It says that the first taxon contains an illegal character and prints the whole alignment-line for that taxon, ending with (check dimensions/interleaving).

I believe my nexus file is correct, starts like this:

#NEXUS

BEGIN TAXA;
TITLE EST_Holozoa_Taxa;
DIMENSIONS NTAX=64;
END;

BEGIN DATA;
TITLE  EST_Holozoa_Matrix;
DIMENSIONS  NCHAR=88384;
FORMAT DATATYPE = Protein GAP = - MISSING = ?;
MATRIX

Oscarella_carmela                 LFYSFFKSLVGKEVMVELKNDLSICGTLHSVDQFLNIKLTDVSVPD....
....

;

END;

BEGIN ASSUMPTIONS;
CHARSET gene1 = 1 - 93 ;
CHARSET gene2 = 94 - 313 ;
...
OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
END;


0
Entering edit mode

just to do some more QC. If you use this server (http://www.bugaco.com/bioinf/nexusfasta.php) to convert your nexus to a fasta file. Save it as "alignment.fas" and change

for record in SeqIO.parse ("alignment.nexus", "nexus"):

to

for record in SeqIO.parse ("alignment.fas", "fasta"):

Does that work?

0
Entering edit mode

Yep, that works!

0
Entering edit mode

0
Entering edit mode

Thanks again :)

0
Entering edit mode

Hello, how did you parse the character sets?

0
Entering edit mode

Hello, how did you parse the character sets?

0
Entering edit mode

I got the following error:

Traceback (most recent call last): File "file.py", line 25, in <module> print >> out, record.seq [start:end] NameError: name 'start' is not defined

Any ideas?

Thanks!

0
Entering edit mode

In the above example, start and end are only defined in an if statement:

if m:

start = int(m.groups()[0])

end = int(m.groups()[1])

If your regular expression doesn't match, they won't be defined. This can be handled in a couple of ways depending on what you want to do (move the writing into the if statement, for example), but the problem lies here.