Split a concatenated alignment into individual gene alignments
1
0
Entering edit mode
9.5 years ago
jon.brate ▴ 290

Hi,

I want to split a concatenated alignment in nexus-format into the individual gene alignments. The gene boundaries are indicated below the alignment. Any programs that can do this? My goal is to create phylogenies for each individual gene.

Thanks,
Jon

alignment phylogeny nexus concatenation • 6.3k views
ADD COMMENT
0
Entering edit mode

Do you have an interleaved nexus file or does each sample just have one long line with the sample name and then sequence?

ADD REPLY
0
Entering edit mode

Non-interleaved, each taxon on a long line. It seems that the file is written by Mesquite. I tried to open it in Mesquite, but it didn't recognize the partitions. The partitions are described after the alignment like this (not sure what the DEFTYPE line means though...):

BEGIN ASSUMPTIONS;
 CHARSET gene1 = 1 - 93 ;
 CHARSET gene2 = 94 - 313 ;
 CHARSET gene3 = 314 - 597 ;

...

 OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
 END;
ADD REPLY
0
Entering edit mode

An individual nexus file for each phylogeny? and also how many genes in that ASSUMPTIONS block?

ADD REPLY
0
Entering edit mode

The format of the individual files is not that important. I will run RAxML so phylip is the best. But nexus or fasta is also fine.

There are 406 genes in ASSUMPTIONS.

ADD REPLY
0
Entering edit mode

Hmm, no I don't have any scripts on hand to deal with that situation sorry. Was going to try and write a little python script that pulls out the matrix block, then subsets into new nexus files based on the gene positions, but because of the number of genes I would also have to parse the assumption block. Thought it may have been a very fast script to write, but I should really study for these finals. Sorry I couldn't be any help, hopefully someone else finds a solution for you.

The only suggestion I can give is to play around with the Nexus parser in biopython, and try and twist it to solve this problem.

ADD REPLY
0
Entering edit mode

Ok, but thanks a lot anyway,

ADD REPLY
3
Entering edit mode
9.5 years ago
Whetting ★ 1.6k
from Bio import SeqIO
import re

charset=[]
with open ("alignment.nexus", "rU") as f:
    for line in f:
        line=line.strip ()
        if "CHARSET" in line:
            line = line.split (" = ")
            charset.append (line [1])

for c in charset:
    with open (c+"_aligned.fas","w") as out:
        for record in SeqIO.parse ("alignment.nexus", "nexus"):
            m=re.search("(.*) - (.*) ;",c)
            if m:
                start = int(m.groups()[0])
                end = int(m.groups()[1])
            print >> out, ">"+record.id
            print >> out, record.seq [start:end]
ADD COMMENT
0
Entering edit mode

Thanks!

I get some errors though. Does the nexus parsers assume interleaved format?

Traceback (most recent call last):
  File "nexus.py", line 19, in <module>
    for record in SeqIO.parse ("est.holozoa.nex", "nexus"):
  File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 586, in parse
    for r in i:
  File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 580, in <genexpr>
    I = (r for alignment in AlignIO.parse(fp, format,
  File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 370, in parse
    for a in i:
  File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/NexusIO.py", line 40, in NexusIterator
    n = Nexus.Nexus(handle)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 587, in __init__
    self.read(input)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 638, in read
    self._parse_nexus_block(title, contents)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 679, in _parse_nexus_block
    getattr(self, '_' + line.command)(line.options)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 899, in _matrix
    "(check dimensions/interleaving)" % (id, c, iupac_seq))
ADD REPLY
0
Entering edit mode

I edited the file. it worked on a test nexus file for me. Are you sure that your nexus format is ok? If you try:

from Bio import SeqIO

for record in SeqIO.parse("alignment.nexus","nexus"):
    print record.id

does that throw an error?

ADD REPLY
0
Entering edit mode

Thanks, really appreciate this.

Yes this gives the same error. It says that the first taxon contains an illegal character and prints the whole alignment-line for that taxon, ending with (check dimensions/interleaving).

I believe my nexus file is correct, starts like this:

#NEXUS

BEGIN TAXA;
    TITLE EST_Holozoa_Taxa;
    DIMENSIONS NTAX=64;
END;

BEGIN DATA;
    TITLE  EST_Holozoa_Matrix;
    DIMENSIONS  NCHAR=88384;
    FORMAT DATATYPE = Protein GAP = - MISSING = ?;
    MATRIX

    Oscarella_carmela                 LFYSFFKSLVGKEVMVELKNDLSICGTLHSVDQFLNIKLTDVSVPD....
....

;

END;

BEGIN ASSUMPTIONS;
 CHARSET gene1 = 1 - 93 ;
 CHARSET gene2 = 94 - 313 ;
...
 OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
 END;
ADD REPLY
0
Entering edit mode

Just to do some more QC. If you use this server to convert your nexus to a fasta file. Save it as alignment.fas and change

for record in SeqIO.parse ("alignment.nexus", "nexus"):

to

for record in SeqIO.parse ("alignment.fas", "fasta"):

Does that work?

ADD REPLY
0
Entering edit mode

Yep, that works!

ADD REPLY
0
Entering edit mode

Cool. Glad you got your data :)

ADD REPLY
0
Entering edit mode

Thanks again :)

ADD REPLY
0
Entering edit mode

Hello, how did you parse the character sets?

ADD REPLY
0
Entering edit mode

Hello, how did you parse the character sets?

ADD REPLY
0
Entering edit mode

I got the following error:

Traceback (most recent call last):
  File "file.py", line 25, in <module>
    print >> out, record.seq [start:end]
NameError: name 'start' is not defined

Any ideas?

Thanks!

ADD REPLY
0
Entering edit mode

In the above example, start and end are only defined in an if statement:

if m: 
    start = int(m.groups()[0])
    end = int(m.groups()[1])

If your regular expression doesn't match, they won't be defined. This can be handled in a couple of ways depending on what you want to do (move the writing into the if statement, for example), but the problem lies here.

ADD REPLY

Login before adding your answer.

Traffic: 2509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6