Removing gaps from fasta sequence file
1
0
Entering edit mode
3.3 years ago
mdsiddra ▴ 30

I am using python (3.6)/biopython(1.72) to read sequence files. I have an aligned sequence file in fasta format.

>Human
----------------------------MRLRVRLLKRTWPLEVPETEPTL-RSHLRQSLLCT-IPSSTDSEHSSLQN-NEQPSL
>Chimpanzee
----------------------------MRLRVRLLKRTWPLEVPETEPTL-RSRLRQSLLCT-IPSSTDSEHSSLQN-NEQPSL
>Dog
----------------------------MKLRVRLQKRTWPLDLPDAEPTL-RAHLSQALLPS-LPSSTDSEHSSLQN-NDPPSL
>Mouse
----------------------------MKLRVRLQKRTQPLEVPESEPTL-RAHLSQVLLPT-LPSSTDTEHSSLQD-NDQPSL


I need to remove the gaps '-' from the file and have the result file like this:

>Human
MRLRVRLLKRTWPLEVPETEPTLRSHLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Chimpanzee
MRLRVRLLKRTWPLEVPETEPTLRSRLRQSLLCTIPSSTDSEHSSLQNNEQPSL
>Dog
MKLRVRLQKRTWPLDLPDAEPTLRAHLSQALLPSLPSSTDSEHSSLQNNDPPSL
>Mouse
MKLRVRLQKRTQPLEVPESEPTLRAHLSQVLLPTLPSSTDTEHSSLQDNDQPSL


I have been trying this using python:

file_var = input ("Enter your file name: ")
sequences = []
for seq_record in SeqIO.parse(file_var, "fasta"):
sequences.append(seq_record.seq)
print (sequences)
list2 = [] # list for extracting "-"
list3 = [] # list for sequence without "-"

for seq_record in alignment:
if "-" in alignment:
list2.append(seq_record)
else:
list3.append(seq_record)


But this outputs me the error:

    raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest.


Can I have any suggestions?? (P.S: I have been working with sequence file using windows OS, not linux)

Python biopython • 3.6k views
1
Entering edit mode

you should be using ungap function in biopython from here: https://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html. some thing like seq.ungap() assuming that seq object holds the sequence.

from Bio.Seq import Seq
from Bio import SeqIO
with open("out.fa", "w") as o:
for record in SeqIO.parse("test.fa", "fasta"):
record.seq = record.seq.ungap("-")
SeqIO.write(record, o, "fasta")


output (input is from OP fasta):

cat out.fa >Human MRLRVRLLKRTWPLEVPETEPTLRSHLRQSLLCTIPSSTDSEHSSLQNNEQPSL >Chimpanzee MRLRVRLLKRTWPLEVPETEPTLRSRLRQSLLCTIPSSTDSEHSSLQNNEQPSL >Dog MKLRVRLQKRTWPLDLPDAEPTLRAHLSQALLPSLPSSTDSEHSSLQNNDPPSL >Mouse MKLRVRLQKRTQPLEVPESEPTLRAHLSQVLLPTLPSSTDTEHSSLQDNDQPSL  ADD REPLY 0 Entering edit mode Hello mdsiddra! It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/4846/removing-sequences-gaps-from-fasta-file This is typically not recommended as it runs the risk of annoying people in both communities. ADD REPLY 0 Entering edit mode Thankyou for bringing it to my notice, I will better be careful next time. ADD REPLY 0 Entering edit mode Hi, you can also use the "Undo alignment" function of the SEDA software (https://www.sing-group.org/seda/manual/operations.html#undo-alignment). Regards. ADD REPLY 0 Entering edit mode hello I want to remove columns that contain gaps in multiple sequence alignment like human ATGC-GTGC--- campanz GT-A-TTGC--- output: human ATCGTGC campanz GTATTGC ADD REPLY 0 Entering edit mode OK - and what about the solutions provided here don't resolve that for you? ADD REPLY 2 Entering edit mode 3.3 years ago Hello mdsiddra, there is more than one thing weired in your code: file_var = input ("Enter your file name: ")  Of course it depends on what your program should do. But in bioinformatics it is more common to give inputs/outputs as parameters to the program. for seq_record in alignment:  You've never defined alignment before. if "-" in alignment: list2.append(seq_record) else: list3.append(seq_record)  If I assume you are thinking that in alignment there is the sequence string itself, than here you are checking whether there is a - in. But you never remove - from your sequence. This should work better: import sys from Bio import SeqIO from Bio.Seq import Seq input_file = sys.argv[1] output_file = sys.argv[2] with open(output_file, "w") as output: for record in SeqIO.parse(input_file, "fasta"): record.seq = Seq(str(record.seq).replace("-", "")) SeqIO.write(record, output, "fasta")  Run it like this:  remove.py input.fasta output.fasta


fin swimmer

0
Entering edit mode

@finswimmer #1: I will modify the input file to be as argument. But whenever I use this ;

input_file = sys.argv[1]
output_file = sys.argv[2]


I come across with error like this:

    input_file = sys.argv[1]
IndexError: list index out of range


#2: the "alignment" comes from;

alignment = AlignIO.read(open(file_var), "fasta")
for seq_record in alignment:
sequences.append(seq_record.seq)
print (sequences)


#3: Would it be fine to read sequence file using AlignIO function of Biopython??

2
Entering edit mode

with open("out.fa", "w") as o:
for record in AlignIO.read("test.fa", "fasta"):
record.seq = record.seq.ungap("-")
SeqIO.write(record, o, "fasta")

1
Entering edit mode

I guess first error is due to building code within IDE, without input. For eg. if you build the code in ST3, one would get the error. This would go away if you run the script with input and output from the shell. As fin or some other star here suggested, please do not use input in python code. @ mdsiddra

1
Entering edit mode

mdsiddra you need to provide the arguments on the commandline, as finswimmer showed you. This was discussed in your last question.

If you want to provide the arguments 'interactively' you need to replace the lines:

input = sys.argv[1]
output = sys.argv[2]


with

input = str(input("Enter input file: "))
output = str(input("Enter output file: "))


This error:

input_file = sys.argv[1]
IndexError: list index out of range


Means that the command wasn't run correctly, as it didn't find the right number of commandline arguments.

0
Entering edit mode

Yeah okay, I must follow the suggestions given. Thankyou for your kind response.

1
Entering edit mode

To #1 and #3 other people already posted solution.

Concerning #2: You haven't posted this line in your initial post. Please always post your complete code. Or at least in example that we can run and see the same behaviour as you. Otherwise it's hard to help.

fin swimmer

0
Entering edit mode

Alright, I will take care about these mistakes. Thankyou so much for giving me this much guidance.

0
Entering edit mode

mdsiddra You have not been accepting answers to your questions. Please go back through your post history and accept any answers you like (you can accept more than one), by clicking the tick symbol on the left hand side - I have done some for you.

0
Entering edit mode

I had been upvoting every answer to my question except one that I did'nt accept, for which you have made me attentive. Thankyou for this kind guidance.