Question

Remove gap from MSA alignment file

0

Entering edit mode

6.4 years ago

skjobs1234 ▴ 40

     import sys
    import re
    t=open(sys.argv[1],'r').read()
    #reading the template sequences
    blocks=re.split(r'>.*.pdb',t)
    #Reading all the sequences
    blocks1=t.split(">")

    count=[]
    #Calculate the start gap from the template
    for fa in blocks[1:]:
        fasta=fa.split('\n')
        for i in range(len(fasta)):
            #modulate the gap size per line here
#but I want to remove one by one character not per line
            if fasta[i]=='-'*60:
                count.append(i)

    #Get the maximum lines of gap from start in all template sequences
    count=[i for i in count if count.count(i)>1]
    j=list(set(count))
    out=[]
    #Removing the lines form query+template sequence file
    for fa1 in blocks1[1:]:
        fasta1=fa1.split('\n',)
        for i in sorted(j, reverse=True): 
            del fasta1[i]
        #Remove comment for commandline output
        #print ">%s\n%s"%(fasta1[0],"\n".join(fasta1[1:]))
        out.append(">%s\n%s"%(fasta1[0],"\n".join(fasta1[1:])))
    #Output file operation
    f = open(sys.argv[1],'w')
    for k in out:
        f.write(k)
    f.close()
    print "DONE\nOutput name: "+sys.argv[1]

This script is removing line by line string. I want to remove (-) character wise from input files. Please guide how to split into string wise rather than line (\n)

alignment sequence software error • 2.0k views

ADD COMMENT • link updated 6.4 years ago by Pierre Lindenbaum 161k • written 6.4 years ago by skjobs1234 ▴ 40

0

Entering edit mode

What is your input file format?

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

Input file will be fasta sequence  Example:

 >P1_2vt4.pdb
 ---------------------------------------------------------WEA GMSLLMALVVLLIVAGNVLVIAAIGSTQRLQTLTNLFITSLACADLVVGLLVVPFGATLV
 -----------------------------------
 >P1_3kj6.pdb
 ------------------------------------------------------------ GMGIVMSLIVLAIVFGNVLVITAI-AFERLQTVTNYFITSLACADLVMGLAVVPFGA---
 LRRSS------------------------------
 >P1_F7EQ49 LSCLYTIFLFPIGFVGNILILVVNISFREKMTIPDLYFINLAVADLILVA-DSLIEVF-N
 LHEQYYDIAVLCTFMSLFLQVNMYSSVFFLTWMSFDRYIALARAMRCSLFRTKHHARLSC
 QKTNLPALNRFCHAALKAVIPDSTEQSDVRFSSAV

And I want output like this 1) Remove the gap if found continuous gap (-) more than 20 in >.pdb file. 2) Also remove at the same position from without >.pdb file like >P1_F7EQ49.. 3) Removing position should be same in all the files. Output should look like this

> >P1_2vt4.pdb 
>          WEA  GMSLLMALVVLLIVAGNVLVIAAIGSTQRLQTLTNLFITSLACADLVVGLLVVPFGATLV
>          ----------
>      >P1_3kj6.pdb
>     
>          ---  GMGIVMSLIVLAIVFGNVLVITAI-AFERLQTVTNYFITSLACADLVMGLAVVPFGA--- LRRSS
>          >P1_F7EQ49 
>     LSC  LHEQYYDIAVLCTFMSLFLQVNMYSSVFFLTWMSFDRYIALARAMRCSLFRTKHHARLSC QKTNL

ADD REPLY • link 6.3 years ago by skjobs1234 ▴ 40

0

Entering edit mode

Is there any specific reason why gaps are not removed in one go, instead of character wise?

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

Your formatting is still very janky, so I’m not sure I understand yet. The first point is straightforward, remove any gaps >20 characters in length. I don’t really understand the last 2 requests. You want to leave stretches of gaps within the sequence intact? So just remove - characters from the ends of the sequence?

Why exactly do you want to do this? You’ll be ruining the alignment..

ADD REPLY • link 6.3 years ago by Joe 21k