Question: Remove gap from files
0
gravatar for skjobs1234
3 months ago by
skjobs12340
skjobs12340 wrote:

I want remove the long unaligned gap from the files with position wise. I have input life this files

 >f1
--------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------GTTYGVC
SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

>f2
--------------------------------------------------------------------------------------------------------------------
-----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC
TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

>f3
MRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLC
IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ
CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

And I want the output like this

>f1
------------------------------------------------------------------------------------------------------GTTYGVC
SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

>f2
-----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC
TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

>f3
IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ
CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

So equal number of gaps are removed from f1 and f2 gaps as well as from f3 remove the one line on the basis of position removed from f1 and f2.

Thanks in advance

sequence alignment • 411 views
ADD COMMENTlink modified 3 months ago by shoujun.gu290 • written 3 months ago by skjobs12340
5

I want remove the long unaligned gap from the files with position wise

why ? sounds like http://xyproblem.info/

ADD REPLYlink written 3 months ago by Pierre Lindenbaum102k
1

How about Gblocks: http://molevol.cmima.csic.es/castresana/Gblocks/Gblocks_documentation.html

ADD REPLYlink written 3 months ago by Sej Modha2.3k

I have used this Gblocks and trimal but both are not suitable for your data set. Please help how to write the perl or python script

ADD REPLYlink written 3 months ago by skjobs12340

I want remove the long unaligned gap from the files with position wise

Do you know the positions of the gaps to remove? Are they based on the blast result?

ADD REPLYlink written 3 months ago by st.ph.n1.9k

No. Please guide Yes.. On the based on blast result. I have got the aligned file PROMAL3D to get aligned file.. yes the position is known.

ADD REPLYlink written 3 months ago by skjobs12340

provide examples of the positions in which to remove gaps

ADD REPLYlink written 3 months ago by st.ph.n1.9k

Example Input file

>f1

------------------------------------------------------------------------------------------------------GTTYGVC SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

>f2

-----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

f3 MRCVGVGNRDFVEGLSGATWVDVVLEHGGCVTTMAKNKPTLDIELQKTEATQLATLRKLC IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

Outpur file example

f1 ------------------------------------------------------------------------------------------------------GTTYGVC SKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

f2 -----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMC TEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

f3 IEGKITNITTDSRCPTQGEAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQ CLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

ADD REPLYlink written 3 months ago by skjobs12340
1

why you keep the gaps at the beginning of f1 and f2 output? how many gaps do you want to keep?

ADD REPLYlink modified 3 months ago • written 3 months ago by shoujun.gu290

This does not look like you want to remove gaps but just remove the new line characters on the fasta header line.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax39k

This question is unrelated to Perl or Blast so I removed those tags.

ADD REPLYlink written 3 months ago by Brian Bushnell15k

Hello skjobs1234!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/2524/r

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 3 months ago by Pierre Lindenbaum102k
3
gravatar for shoujun.gu
3 months ago by
shoujun.gu290
Rockville/MD
shoujun.gu290 wrote:
  1. save the following code in a file called biostar.py (or any other name)
  2. in shell, run: python3 biostar.py input_file output_file
  3. note this code may not work on really big file due to the memory problem

code:

import sys

inp=sys.argv[1]
output=sys.argv[2]

count=0
tempcount=0
with open(inp, 'r') as file:
    for line in file:
        if line[0]=='>':
            if count<tempcount:
                count=tempcount
            tempcount=0
        elif set(line)=={'-','\n'}:
            tempcount=tempcount+1
if count<tempcount:
    count=tempcount

with open(inp, 'r') as file2:
    lines=file2.read().split('>')[1:]

list=[]
i=count+1
for fa in lines:
    fasta=fa.split('\n')
    list=list+['>'+fasta[0]]+fasta[i:]
list=[i+'\n' for i in list if i]

with open(output, 'w') as out:
    out.writelines(list)
ADD COMMENTlink modified 3 months ago • written 3 months ago by shoujun.gu290

I'm getting this error on Terminal

**Traceback (most recent call last):

File "biostar.py", line 3, in <module>

inp=sys.argv[1]

IndexError: list index out of range**

ADD REPLYlink modified 3 months ago • written 3 months ago by skjobs12340
1

did you provide the input and output file's absolute path in the command line? or could you copy the command you run here?

ADD REPLYlink modified 3 months ago • written 3 months ago by shoujun.gu290

Yes.. It's working.. Thank for your valuable suggestion and time

ADD REPLYlink written 3 months ago by skjobs12340

Dear Shoujun your script is running well. But the out is not coming which i want. Please see once more this problem. Thanking for helping me

Actually your script is not removing gap properly, I want to remove only unaligned (gap) region from > files. I want to give a brief example with output

tem1------------------------------------------------------FHLTTR GGEPHMIVSKQERGKSLLFKTSAGVNMCTLIAMDLGELCEDTMTYKCPRITETEPDDVDC

tem2------------------------------------------------------------ --------------------------TPVECFEPSMLKKKQLTVLDLHPG-G-KTRRVLP

query MNNQRKKTGKPSINMLKRVRNRVSTGSQLAKRFSKGLLNGQGPMKLVMAFIAFLRFLAIP PTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINQRKKTSLCLMMILPAALAFHLTSR

I want output like this

tem1 FHLTTRGGEPHMIVSKQERGKSLLFKTSAGVNMCTLIAMDLGELCEDTMTYKCPRITETEPDDVDC tem2-------------------------------------TPVECFEPSMLKKKQLTVLDLHPG-G-KTRRVLP query MNNQRPTAGVLARWGTFKKSGAIKVLKGFKKEISNMLSIINQRKKTSLCLMMILPAALAFHLTSR

Condition 1 So you can see here, if the gaps are found at same position in both tem1 & tem2 then remove gaps and also remove query (it doesn't matter gap or not but it should be remove equal number of sequences with same position)

Condition 2 if the if gaps found more than 20 residues in all >tem at the same position then it should remove.

ADD REPLYlink modified 20 days ago • written 3 months ago by skjobs12340
1

The format of your given example is messed up. I cannot really understand how you want to remove the gap.

Could you just upload as fig?

ADD REPLYlink written 3 months ago by shoujun.gu290

Yes I can

please give me email ID?

You can ping on my email id

redacted (see below)

ADD REPLYlink modified 3 months ago by genomax39k • written 3 months ago by skjobs12340
1

You can upload to the forum. Or send to my email: redacted

ADD REPLYlink modified 3 months ago by genomax39k • written 3 months ago by shoujun.gu290

Personal email addresses are not shared on Biostars.

ADD REPLYlink written 3 months ago by genomax39k

Please use the "101" button in the edit window to apply code formatting to you text. Do not use spaces and (") icon to format text. I tried to format your example above but can't make exact sense of what you have/need.

ADD REPLYlink written 3 months ago by genomax39k

I want to remove All Positions Containing A Gap In A Multiple Alignment by perl, python or any other scripting language

ADD REPLYlink written 3 months ago by skjobs12340

If you want to remove all gaps from the file:

sed -i 's/-//g' input.fasta

Otherwise, you need to know the positions, for specific gaps if you're not removing all of them. The OP is vague on how you intend to identify gaps to be removed.

ADD REPLYlink written 3 months ago by st.ph.n1.9k

This script is working well, but I want to remove gap one by one, not directly line by line. Can you modify this program into one by one if the gap is continue more than 20-25

ADD REPLYlink written 21 days ago by skjobs12340

Please can you modify this script? This script is removing gap line by line. But I would like to remove the gap one by one string.
This script is removing line by line gap. But I want to remove the gap if the gap is more than 20. It can be possible by the array. Take every gap into in a array and match to the other file. If found the gap in all files then remove otherwise skip if found in any one.

ADD REPLYlink written 21 days ago by skjobs12340

You should try and give it a shot and post your version here for feedback.

ADD REPLYlink written 21 days ago by Sej Modha2.3k

import sys

inp=sys.argv[1]

output=sys.argv[2]

count=0

tempcount=0

with open(inp, 'r') as file:

for line in file:

    if line[0]=='>':

        if count<tempcount:

            count=tempcount

        tempcount=0

    elif set(line)=={'-','\n'}:

        tempcount=tempcount+1

if count<tempcount:< p="">

count=tempcount

with open(inp, 'r') as file2:

lines=file2.read().split('>')[1:]

list=[]

i=count+1

for fa in lines:

fasta=fa.split('\n')

list=list+['>'+fasta[0]]+fasta[i:]

list=[i+'\n' for i in list if i]

with open(output, 'w') as out:

out.writelines(list)

This script is removing line by line gap (-). But I would like to remove the gap one string by string. if more than 20 continuous string found in all input template file then remove the gap, otherwise if not found any gap at same position in any template the skip. Example of Input

f1 ------------------------------------------------------------------------------------------------------GTTYGVCSKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

f2 -----------------------------MVHRQWFFDLPLPWA-----------------------------------------GTTYGMCTEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

Query IEGKITNITTDSRCPTAVLPEEQDQNYVCKHTYVDRGWGNGCGLFGKGSLVTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

Output should be like

f1 --------------------------------GTTYGVCSKAFKFLGTPADTGHGTVVLELQYTGTDGPCKVPISSVASLNDLTPVGRLVTVNPFVSVA

f2 MVHRQWFFDLPLPWAGTTYGMCTEKFSFAKNPADTGHGTVVIELSYSGSDGPCKIPIVSVASLNDMTPVGRLVTVNPFVATS

f3 AVLPEEQDQNYVCKHTCAKFQCLEPIEGKVVQYENLKYTVIITVHTGDQHQVGNETQGVTAEITPQASTTEAILPEYGTLGGG

ADD REPLYlink modified 20 days ago • written 20 days ago by skjobs12340
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1384 users visited in the last hour