Question: Removing gaps from fasta sequence file
0
gravatar for mdsiddra
9 months ago by
mdsiddra20
mdsiddra20 wrote:

I am using python (3.6)/biopython(1.72) to read sequence files. I have an aligned sequence file in fasta format.

>Human
----------------------------MRLRVRLLKRTWPLEVPETEPTL-RSHLRQSLLCT-IPSSTDSEHSSLQN-NEQPSL
>Chimpanzee
----------------------------MRLRVRLLKRTWPLEVPETEPTL-RSRLRQSLLCT-IPSSTDSEHSSLQN-NEQPSL
>Dog
----------------------------MKLRVRLQKRTWPLDLPDAEPTL-RAHLSQALLPS-LPSSTDSEHSSLQN-NDPPSL
>Mouse
----------------------------MKLRVRLQKRTQPLEVPESEPTL-RAHLSQVLLPT-LPSSTDTEHSSLQD-NDQPSL

I need to remove the gaps '-' from the file and have the result file like this:

>Human
MRLRVRLLKRTWPLEVPETEPTLRSHLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Chimpanzee
MRLRVRLLKRTWPLEVPETEPTLRSRLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Dog
MKLRVRLQKRTWPLDLPDAEPTLRAHLSQALLPSLPSSTDSEHSSLQNNDPPSL
>Mouse
MKLRVRLQKRTQPLEVPESEPTLRAHLSQVLLPTLPSSTDTEHSSLQDNDQPSL

I have been trying this using python:

file_var = input ("Enter your file name: ")
sequences = []
for seq_record in SeqIO.parse(file_var, "fasta"):
    sequences.append(seq_record.seq)
print (sequences)
list2 = [] # list for extracting "-"
list3 = [] # list for sequence without "-"

for seq_record in alignment:
    if "-" in alignment:
        list2.append(seq_record)
    else:
        list3.append(seq_record)

But this outputs me the error:

    raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest.

Can I have any suggestions?? (P.S: I have been working with sequence file using windows OS, not linux)

biopython python • 896 views
ADD COMMENTlink modified 9 months ago by finswimmer11k • written 9 months ago by mdsiddra20
1

you should be using ungap function in biopython from here: https://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html. some thing like seq.ungap() assuming that seq object holds the sequence.

from Bio.Seq import Seq
from Bio import SeqIO
with open("out.fa", "w") as o:
    for record in SeqIO.parse("test.fa", "fasta"):
        record.seq = record.seq.ungap("-")
        SeqIO.write(record, o, "fasta")

output (input is from OP fasta):

$ cat out.fa 
>Human
MRLRVRLLKRTWPLEVPETEPTLRSHLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Chimpanzee
MRLRVRLLKRTWPLEVPETEPTLRSRLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Dog
MKLRVRLQKRTWPLDLPDAEPTLRAHLSQALLPSLPSSTDSEHSSLQNNDPPSL
>Mouse
MKLRVRLQKRTQPLEVPESEPTLRAHLSQVLLPTLPSSTDTEHSSLQDNDQPSL
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k

Hello mdsiddra!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/4846/removing-sequences-gaps-from-fasta-file

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 9 months ago by Benn6.8k

Thankyou for bringing it to my notice, I will better be careful next time.

ADD REPLYlink written 9 months ago by mdsiddra20

Hi, you can also use the "Undo alignment" function of the SEDA software (https://www.sing-group.org/seda/manual/operations.html#undo-alignment). Regards.

ADD REPLYlink written 9 months ago by Hugo150
2
gravatar for finswimmer
9 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello mdsiddra,

there is more than one thing weired in your code:

file_var = input ("Enter your file name: ")

Of course it depends on what your program should do. But in bioinformatics it is more common to give inputs/outputs as parameters to the program.

for seq_record in alignment:

You've never defined alignment before.

if "-" in alignment:
        list2.append(seq_record)
    else:
        list3.append(seq_record)

If I assume you are thinking that in alignment there is the sequence string itself, than here you are checking whether there is a - in. But you never remove - from your sequence.

This should work better:

import sys
from Bio import SeqIO
from Bio.Seq import Seq

input_file = sys.argv[1]
output_file = sys.argv[2]

with open(output_file, "w") as output:
    for record in SeqIO.parse(input_file, "fasta"):
        record.seq = Seq(str(record.seq).replace("-", ""))
        SeqIO.write(record, output, "fasta")

Run it like this:

$ remove.py input.fasta output.fasta

fin swimmer

ADD COMMENTlink modified 9 months ago • written 9 months ago by finswimmer11k

@finswimmer #1: I will modify the input file to be as argument. But whenever I use this ;

input_file = sys.argv[1]
output_file = sys.argv[2]

I come across with error like this:

    input_file = sys.argv[1]
IndexError: list index out of range

#2: the "alignment" comes from;

alignment = AlignIO.read(open(file_var), "fasta")
for seq_record in alignment:
    sequences.append(seq_record.seq)
print (sequences)

#3: Would it be fine to read sequence file using AlignIO function of Biopython??

ADD REPLYlink written 9 months ago by mdsiddra20
2

#3: with alignIO.read:

with open("out.fa", "w") as o:
    for record in AlignIO.read("test.fa", "fasta"):
        record.seq = record.seq.ungap("-")
        SeqIO.write(record, o, "fasta")
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k
1

I guess first error is due to building code within IDE, without input. For eg. if you build the code in ST3, one would get the error. This would go away if you run the script with input and output from the shell. As fin or some other star here suggested, please do not use input in python code. @ mdsiddra

ADD REPLYlink modified 9 months ago • written 9 months ago by cpad011211k
1

mdsiddra you need to provide the arguments on the commandline, as finswimmer showed you. This was discussed in your last question.

If you want to provide the arguments 'interactively' you need to replace the lines:

input = sys.argv[1]
output = sys.argv[2]

with

input = str(input("Enter input file: "))
output = str(input("Enter output file: "))

This error:

input_file = sys.argv[1]
IndexError: list index out of range

Means that the command wasn't run correctly, as it didn't find the right number of commandline arguments.

ADD REPLYlink modified 9 months ago • written 9 months ago by jrj.healey12k

Yeah okay, I must follow the suggestions given. Thankyou for your kind response.

ADD REPLYlink written 9 months ago by mdsiddra20
1

To #1 and #3 other people already posted solution.

Concerning #2: You haven't posted this line in your initial post. Please always post your complete code. Or at least in example that we can run and see the same behaviour as you. Otherwise it's hard to help.

fin swimmer

ADD REPLYlink written 9 months ago by finswimmer11k

Alright, I will take care about these mistakes. Thankyou so much for giving me this much guidance.

ADD REPLYlink written 9 months ago by mdsiddra20

mdsiddra You have not been accepting answers to your questions. Please go back through your post history and accept any answers you like (you can accept more than one), by clicking the tick symbol on the left hand side - I have done some for you.

ADD REPLYlink written 9 months ago by jrj.healey12k

I had been upvoting every answer to my question except one that I did'nt accept, for which you have made me attentive. Thankyou for this kind guidance.

ADD REPLYlink written 9 months ago by mdsiddra20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1390 users visited in the last hour