Question: Split a concatenated alignment into individual gene alignments
0
gravatar for jon.brate
3.8 years ago by
jon.brate160
Norway
jon.brate160 wrote:

Hi,

I want to split a concatenated alignment in nexus-format into the individual gene alignments. The gene boundaries are indicated below the alignment. Any programs that can do this? My goal is to create phylogenies for each individual gene.

Thanks,

Jon

ADD COMMENTlink modified 3.8 years ago by Whetting1.5k • written 3.8 years ago by jon.brate160

Do you have an interleaved nexus file or does each sample just have one long line with the sample name and then sequence?

ADD REPLYlink written 3.8 years ago by smilefreak400

Non-interleaved, each taxon on a long line.  It seems that the file is written by Mesquite. I tried to open it in Mesquite, but it didn't recognize the partitions. The partitions are described after the alignment like this (not sure what the DEFTYPE line means though...):

BEGIN ASSUMPTIONS;
 CHARSET gene1 = 1 - 93 ;
 CHARSET gene2 = 94 - 313 ;
 CHARSET gene3 = 314 - 597 ;

...

 OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
 END;

 

ADD REPLYlink written 3.8 years ago by jon.brate160

An individual nexus file for each phylogeny? and also how many genes in that ASSUMPTIONS block?

 

ADD REPLYlink written 3.8 years ago by smilefreak400

The format of the individual files is not that important. I will run RAxML so phylip is the best. But nexus or fasta is also fine.

There are 406 genes in ASSUMPTIONS.

ADD REPLYlink written 3.8 years ago by jon.brate160

Hmm, no I don't have any scripts on hand to deal with that situation sorry.  Was going to try and write a little python script that pulls out the matrix block, then subsets into new nexus files based on the gene positions, but because of the number of genes I would also have to parse the assumption block. Thought it may have been a very fast script to write, but I should really study for these finals. Sorry I couldn't be any help, hopefully someone else finds a solution for you.

The only suggestion I can give  is to play around with the Nexus parser in biopython, and try and twist it to solve this problem.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by smilefreak400

Ok, but thanks a lot anyway,

ADD REPLYlink written 3.8 years ago by jon.brate160
3
gravatar for Whetting
3.8 years ago by
Whetting1.5k
Bethesda, MD
Whetting1.5k wrote:
from Bio import SeqIO
import re

charset=[]
with open ("alignment.nexus", "rU") as f:
    for line in f:
        line=line.strip ()
        if "CHARSET" in line:
            line = line.split (" = ")
            charset.append (line [1])

for c in charset:
    with open (c+"_aligned.fas","w") as out:
        for record in SeqIO.parse ("alignment.nexus", "nexus"):
            m=re.search("(.*) - (.*) ;",c)
            if m:
                start = int(m.groups()[0])
                end = int(m.groups()[1])
            print >> out, ">"+record.id
            print >> out, record.seq [start:end]

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Whetting1.5k

Thanks!

I get some errors though. Does the nexus parsers assume interleaved format?

Traceback (most recent call last):
  File "nexus.py", line 19, in <module>
    for record in SeqIO.parse ("est.holozoa.nex", "nexus"):
  File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 586, in parse
    for r in i:
  File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 580, in <genexpr>
    i = (r for alignment in AlignIO.parse(fp, format,
  File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 370, in parse
    for a in i:
  File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/NexusIO.py", line 40, in NexusIterator
    n = Nexus.Nexus(handle)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 587, in __init__
    self.read(input)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 638, in read
    self._parse_nexus_block(title, contents)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 679, in _parse_nexus_block
    getattr(self, '_' + line.command)(line.options)
  File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 899, in _matrix
    "(check dimensions/interleaving)" % (id, c, iupac_seq))

ADD REPLYlink written 3.8 years ago by jon.brate160

I edited the file. it worked on a test nexus file for me. Are you sure that your nexus format is ok? if you try

 

from Bio import SeqIO

for record in SeqIO.parse("alignment.nexus","nexus"):

    print record.id

 

does that throw an error?

ADD REPLYlink written 3.8 years ago by Whetting1.5k

Thanks, really appreciate this.

Yes this gives the same error. It says that the first taxon contains an illegal character and prints the whole alignment-line for that taxon, ending with (check dimensions/interleaving).

I believe my nexus file is correct, starts like this:

#NEXUS

BEGIN TAXA;
    TITLE EST_Holozoa_Taxa;
    DIMENSIONS NTAX=64;
END;

BEGIN DATA;
    TITLE  EST_Holozoa_Matrix;
    DIMENSIONS  NCHAR=88384;
    FORMAT DATATYPE = Protein GAP = - MISSING = ?;
    MATRIX

    Oscarella_carmela                 LFYSFFKSLVGKEVMVELKNDLSICGTLHSVDQFLNIKLTDVSVPD....
....

;

END;

BEGIN ASSUMPTIONS;
 CHARSET gene1 = 1 - 93 ;
 CHARSET gene2 = 94 - 313 ;
...
 OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
 END;

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by jon.brate160

just to do some more QC. If you use this server (http://www.bugaco.com/bioinf/nexusfasta.php) to convert your nexus to a fasta file. Save it as "alignment.fas" and change

for record in SeqIO.parse ("alignment.nexus", "nexus"):

to

for record in SeqIO.parse ("alignment.fas", "fasta"):

 

Does that work?

ADD REPLYlink written 3.8 years ago by Whetting1.5k

Yep, that works!

ADD REPLYlink written 3.8 years ago by jon.brate160

cool. Glad you got your data :)

ADD REPLYlink written 3.8 years ago by Whetting1.5k

Thanks again :)

ADD REPLYlink written 3.8 years ago by jon.brate160

Hello, how did you parse the character sets?

ADD REPLYlink written 24 days ago by jrparedesmontero0

Hello, how did you parse the character sets?

ADD REPLYlink written 24 days ago by jrparedesmontero0

I got the following error:

Traceback (most recent call last): File "file.py", line 25, in <module> print >> out, record.seq [start:end] NameError: name 'start' is not defined

Any ideas?

Thanks!

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by lurking_variable0

In the above example, start and end are only defined in an if statement:

if m:

start = int(m.groups()[0])

end = int(m.groups()[1])

If your regular expression doesn't match, they won't be defined. This can be handled in a couple of ways depending on what you want to do (move the writing into the if statement, for example), but the problem lies here.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Brice Sarver2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1288 users visited in the last hour