Question

Split a concatenated alignment into individual gene alignments

0

Entering edit mode

9.5 years ago

jon.brate ▴ 290

Hi,

I want to split a concatenated alignment in nexus-format into the individual gene alignments. The gene boundaries are indicated below the alignment. Any programs that can do this? My goal is to create phylogenies for each individual gene.

Thanks,
Jon

alignment phylogeny nexus concatenation • 6.3k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

Do you have an interleaved nexus file or does each sample just have one long line with the sample name and then sequence?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by smilefreak ▴ 420

0

Entering edit mode

Non-interleaved, each taxon on a long line. It seems that the file is written by Mesquite. I tried to open it in Mesquite, but it didn't recognize the partitions. The partitions are described after the alignment like this (not sure what the DEFTYPE line means though...):

BEGIN ASSUMPTIONS;
 CHARSET gene1 = 1 - 93 ;
 CHARSET gene2 = 94 - 313 ;
 CHARSET gene3 = 314 - 597 ;

...

 OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
 END;

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

An individual nexus file for each phylogeny? and also how many genes in that ASSUMPTIONS block?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by smilefreak ▴ 420

0

Entering edit mode

The format of the individual files is not that important. I will run RAxML so phylip is the best. But nexus or fasta is also fine.

There are 406 genes in ASSUMPTIONS.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

Hmm, no I don't have any scripts on hand to deal with that situation sorry. Was going to try and write a little python script that pulls out the matrix block, then subsets into new nexus files based on the gene positions, but because of the number of genes I would also have to parse the assumption block. Thought it may have been a very fast script to write, but I should really study for these finals. Sorry I couldn't be any help, hopefully someone else finds a solution for you.

The only suggestion I can give is to play around with the Nexus parser in biopython, and try and twist it to solve this problem.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by smilefreak ▴ 420

0

Entering edit mode

Ok, but thanks a lot anyway,

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

Ram · Accepted Answer · 2014-11-02

3

Entering edit mode

9.5 years ago

Whetting ★ 1.6k

from Bio import SeqIO
import re

charset=[]
with open ("alignment.nexus", "rU") as f:
    for line in f:
        line=line.strip ()
        if "CHARSET" in line:
            line = line.split (" = ")
            charset.append (line [1])

for c in charset:
    with open (c+"_aligned.fas","w") as out:
        for record in SeqIO.parse ("alignment.nexus", "nexus"):
            m=re.search("(.*) - (.*) ;",c)
            if m:
                start = int(m.groups()[0])
                end = int(m.groups()[1])
            print >> out, ">"+record.id
            print >> out, record.seq [start:end]

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

Thanks!

I get some errors though. Does the nexus parsers assume interleaved format?

Traceback (most recent call last):
  File "nexus.py", line 19, in <module>
    for record in SeqIO.parse ("est.holozoa.nex", "nexus"):
  File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 586, in parse
    for r in i:
  File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 580, in <genexpr>
    I = (r for alignment in AlignIO.parse(fp, format,
  File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 370, in parse
    for a in i:
  File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/NexusIO.py", line 40, in NexusIterator
    n = Nexus.Nexus(handle)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 587, in __init__
    self.read(input)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 638, in read
    self._parse_nexus_block(title, contents)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 679, in _parse_nexus_block
    getattr(self, '_' + line.command)(line.options)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 899, in _matrix
    "(check dimensions/interleaving)" % (id, c, iupac_seq))

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

I edited the file. it worked on a test nexus file for me. Are you sure that your nexus format is ok? If you try:

from Bio import SeqIO

for record in SeqIO.parse("alignment.nexus","nexus"):
    print record.id

does that throw an error?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

Thanks, really appreciate this.

Yes this gives the same error. It says that the first taxon contains an illegal character and prints the whole alignment-line for that taxon, ending with (check dimensions/interleaving).

I believe my nexus file is correct, starts like this:

#NEXUS

BEGIN TAXA;
    TITLE EST_Holozoa_Taxa;
    DIMENSIONS NTAX=64;
END;

BEGIN DATA;
    TITLE  EST_Holozoa_Matrix;
    DIMENSIONS  NCHAR=88384;
    FORMAT DATATYPE = Protein GAP = - MISSING = ?;
    MATRIX

    Oscarella_carmela                 LFYSFFKSLVGKEVMVELKNDLSICGTLHSVDQFLNIKLTDVSVPD....
....

;

END;

BEGIN ASSUMPTIONS;
 CHARSET gene1 = 1 - 93 ;
 CHARSET gene2 = 94 - 313 ;
...
 OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
 END;

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

Just to do some more QC. If you use this server to convert your nexus to a fasta file. Save it as alignment.fas and change

for record in SeqIO.parse ("alignment.nexus", "nexus"):

to

for record in SeqIO.parse ("alignment.fas", "fasta"):

Does that work?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

Yep, that works!

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

Cool. Glad you got your data :)

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

Thanks again :)

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by jon.brate ▴ 290

0

Entering edit mode

Hello, how did you parse the character sets?

ADD REPLY • link 5.7 years ago by jrparedesmontero • 0

0

Entering edit mode

Hello, how did you parse the character sets?

ADD REPLY • link 5.7 years ago by jrparedesmontero • 0

0

Entering edit mode

I got the following error:

Traceback (most recent call last):
  File "file.py", line 25, in <module>
    print >> out, record.seq [start:end]
NameError: name 'start' is not defined

Any ideas?

Thanks!

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 8.0 years ago by lurking_variable • 0

0

Entering edit mode

In the above example, start and end are only defined in an if statement:

if m: 
    start = int(m.groups()[0])
    end = int(m.groups()[1])

If your regular expression doesn't match, they won't be defined. This can be handled in a couple of ways depending on what you want to do (move the writing into the if statement, for example), but the problem lies here.

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 8.0 years ago by Brice Sarver ★ 3.8k