Question: How To Reformat Msa Output To Show Only 50 Nucleotide Sequences In Each Line
1
gravatar for Thaman
9.2 years ago by
Thaman3.2k
Finland
Thaman3.2k wrote:

I have Multiple sequence alignment (clustal) file and I want to read this file and arrange nucleotide sequences in such a way that it looks more clear and precise in order.

I am doing this from biopython using AlignIO object. My codes goes like this:

alignment = AlignIO.read("opuntia.aln", "clustal")
print "Number of rows: %i" % len(align)
for record in alignment:
    print "%s - %s" % record.id, record.seq)

I have been more fascinated by the output of http://www.ebi.ac.uk/Tools/clustalw2/ project because it looks clear.

My Output -- , it looks messy and long scrolling. What i want to do is print only 50 sequences (nucleotide) in each line and continue till the end of alignment file.

I wish to have output like this from http://www.ebi.ac.uk/Tools/clustalw2/.

Any suggestions, algorithm and sample code is appreciated

Here is the link to the input file-- http://pastebin.com/EaeKsyvg

Thanks in advance

Br,

multiple clustalw • 2.3k views
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.2 years ago by Thaman3.2k
1

By '50 sequences', do you mean 50 nucleotides? @Thaman: I have seen a few of your questions and they are all great, in the sens that they have the potential to be very useful to others. However, it would be much better if you took the time to formulate them more clearly, giving both more details about the context and what you desire, and also making it more English. That way, you will get much better replies and won't have to ask the same question twice because the first version was confusing, as happened recently. Thanks for your contribution! Cheers.

ADD REPLYlink written 9.2 years ago by Eric Normandeau10k

Sequence ID and 50 sequences? That's not clear.

ADD REPLYlink written 9.2 years ago by Paulo Nuin3.7k
gi|6273285|gb|AF191659.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA 50
ADD REPLYlink modified 10 months ago by RamRS22k • written 9.2 years ago by Thaman3.2k

I mean like this...........

ID                             Sequences
gi|6273285-----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
ADD REPLYlink modified 10 months ago by RamRS22k • written 9.2 years ago by Thaman3.2k
ID--------------------Sequence----------------------------------------
gi|6273285|gb|------TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
ADD REPLYlink modified 10 months ago by RamRS22k • written 9.2 years ago by Thaman3.2k
ID------Sequence----------------------------------------
gi|6273285|gb|----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAA
gi|6273284|gb|----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAA
gi|6273287|gb-----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAA
ADD REPLYlink modified 10 months ago by RamRS22k • written 9.2 years ago by Thaman3.2k

@Eric, thanks for your kind suggestion, i have edited my query and make it more precise. I am just trying to make alignment output similar to http://www.ebi.ac.uk/Tools/clustalw2/ project. According to my output its long nucleotide record, but i want to split it into 50 nucleotide each in individual line. Thanks for listening

ADD REPLYlink written 9.2 years ago by Thaman3.2k

Could you please put the input data on the web. That way we can write a code and test it withouth having to generate the input data manually. Just paste it into http://pastebin.com/ and link it.

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 80k

Could you please put the input data (alignment) on the web. That way we can write a code and test it without having to generate the input data manually. Just paste it into pastebin.com and link it

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 80k

How can i link it to Istan Albert?

ADD REPLYlink written 9.2 years ago by Thaman3.2k

just edit the post and add the link into the main post

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 80k

Ok now finally alignment file is here.

ADD REPLYlink written 9.2 years ago by Thaman3.2k

okei i provided the input file link in the main post.

ADD REPLYlink written 9.2 years ago by Thaman3.2k

The input file is the file you want to transform, the one that has long lines. The file that you provide seems like the result of the transformation.

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 80k

I tried with readlines() and seems working, seems likes i couldn't make my problem clear to you.

ADD REPLYlink written 9.2 years ago by Thaman3.2k
5
gravatar for Istvan Albert
9.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Here is my solution with Python, using the formatter by nuin. It transforms data from the original format shown here to the a visually more pleasing format shown here

import string
from itertools import islice

# read the file
lines = file('input.txt').readlines()

# remove empty lines if present
lines = map(string.strip, lines)
lines = filter(None, lines)

# break each line into id and sequence pairs
pairs = map(string.split, lines)

# this breaks each sequence into the small pieces
# generates each individual line in the file
def chunk(pair, length=50):
    seqid, sequence = pair
    temp = []
    for j in range(0,len(sequence),length):
        temp.append( (seqid + ' ' + sequence[j:j+length]) )
    return temp

# here we apply the chunking function
data = map(chunk, pairs)

# this is really an idiomatic expression to perform 
# the inverse of what zip usually does
transposed = zip( *data )

# generate the output
for elems in transposed:
    print '\n'.join(elems)
    print

The only tricky part is the zip(*data), to best understand what it does see:

a = [ (1,2,3), (4,5,6) ]
b = zip(*a)
c = zip(*b)

print a
print b
print c

it will print:

[(1, 2, 3), (4, 5, 6)]
[(1, 4), (2, 5), (3, 6)]
[(1, 2, 3), (4, 5, 6)]
ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.2 years ago by Istvan Albert ♦♦ 80k

@ Istvan, how can i get clustal multiple sequence alignment scores in each line like on the right side column of

ADD REPLYlink modified 10 months ago by RamRS22k • written 9.2 years ago by Thaman3.2k

it that alignment score or simply the step count, 50, 100, 150 etc. You see this is why it is important to give people access to the actual files for both the input and output, otherwise we are left guessing

ADD REPLYlink written 9.2 years ago by Istvan Albert ♦♦ 80k

I am trying to make project using Clustalw2(multiple sequence alignment) biopython wrapper. Its somehow similar to http://www.ebi.ac.uk/Tools/ -> sequence analysis-> clustalw. Fasta Input file for example is (http://pastebin.com/JA76QTJG) and output alignment file is (http://pastebin.com/M4KQX46u). Now i want to read the alignment file and show (id, sequence and scores) records. I'm not sure whether its count or alignment score.

ADD REPLYlink written 9.2 years ago by Thaman3.2k
2
gravatar for Paulo Nuin
9.2 years ago by
Paulo Nuin3.7k
Canada
Paulo Nuin3.7k wrote:

This is a function that you can use to do that

def format_output(sequence, length):
    temp = []
    for j in range(0,len(sequence),length):
        temp.append(sequence[j:j+length])
    return '\n'.join(temp)

just loop your sequences through it.

ADD COMMENTlink modified 10 months ago by RamRS22k • written 9.2 years ago by Paulo Nuin3.7k

Really nice approach! Thx :)

ADD REPLYlink written 9.2 years ago by Eric Normandeau10k

Not as Pythonic as it could be, but simple to understand.

ADD REPLYlink written 9.2 years ago by Paulo Nuin3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 567 users visited in the last hour