Hi everyone. I have a multiple FASTA file which contains roughly 150k sequence. The first sequence is my reference sequence and the rest of it contains sequences that have some mutations in them.
I'm working with single-point mutations and I am going to find them by comparing each sequence with the reference. So I need to remove the ones that have deletions because they would act like single-point mutations. Does anyone have any idea how to do that?
The deletion mutations are like this:
There is a deletion in the selected area hence the whole sequence shifted. There are many mutations like that. I don't have any idea how to detect and remove them.
I don't know how to align that much sequence but if I could that would solve my problem too.
Thanks in advance for your opinions and helps.
hello, this could be done using python pandas library as usually data engineers do in the preprocessing stage of machine learning workflow project. pandas library is part of the python data ecosystem and it's the most powerful for data manipulation today. feel free to setup your python development environment and follow some pandas tutorials online. there are many of them available today. i hope this helps!
There's already utilities in the Python ecosystem for working with FASTA files, such as pyfaidx. As much as I do like Pandas, I'm not following the need for it here? Question for Mustafa , are the ones with deletions shorter and the others the same length as your reference so you can filter on length of the sequence?
All of them are at the same length. I think there are some insertions too in the ones with deletions. This is where my problem starts :') Wayne
So the one on the 6th line in your image ends up being the same length and wouldn't get eliminated by the length filter. But it has more differences than a single-point mutation and so why not follow the length filter with a filtering-step that collects the counts of every amino acid and eliminates the sequence if one or more of the counts differs by more than one? Alternatively, you don't need to align that many sequences by multi-sequence alignment (MSA) as you seem to suggest by your statement "I don't know how to align that much sequence but if I could that would solve my problem too". You just do a pairwise alignment to the reference 150K times and toss out both (a) ones where the aligned length isn't the same and (b) ones where more than one amino acid mismatches for those that have the same aligned length? Biopython has pairwise alignment built in.
Are you saying that I should compare the total amino acid counts of the sequences with my reference sequence? If that's so, isn't it would be like losing the sequences that contain more than one single-point mutation? We can't know how many counts differences would be in a sequence.
EDIT: Oh... I didn't know that. This could solve my problem. I'm going to try the pairwise alignment. Thank you very much!!
I thought you were looking to limit it to single point mutations and that is why I suggested the counting for a second layer. I think assessing the pairwise alignment is better though as you could have combinations that keep count totals within one, yet have multiple substitutions at multiple locations.
I tried to understand the logic of the built-in pairwise alignment of biopython. I prepared a test fasta file for studying the code. There are 40 sequences in the file but I keep getting 64 different alignments with the following code:
What am I doing wrong?
I think it gives you alternatives for alignment as there are usually a number of ways to make a viable alignment if gaps are involved. Also, I'm concerned with your code, you are only looking at the ones for the last pairwise alignment? (Unless you are printing them as you go along and just not including that here?) You'd want to collect them for each pairing. One idea would be to just make a list for your 40 sequences. A dictionary will help you keep track easily by assigning each a number corresponding to the 40 sequences. Idea for reworking your final block:
(Note:
seq_number = seq_index + 1
means the first sequence will be numbered one. Because Python is zero-index, by default it would be thezero
-th one if usedseq_index
directly.)Now examine
alignments_per_sequence
. You can examine the results for say the third one, like soalignments_per_sequence[3]
.For those cases you are getting multiples for sequences, you'll have to figure out which one you'll use. Or take the first one. My quick assesmment was you take the top scorers and see if there is any where there's no gaps. If all have gaps, then you mark that as one to toss it out? It seems those with no gaps would also have the
end
value match the length of the reference sequence. So you have some options on what you want to do once you have the alignments. You'll have to assess what gets you closest to what you want to use in deciding to keep or toss the test sequence.Minor note: looping on the reference sequence may cause you issues later. Not seeing why you aren't just taking the first sequence there?
Just noticed here that you basically reworked that question and posted it as a new question, which is fine. However, in the future if you ask a question in a thread and then make a separate question that is a variant of that, update the thread to point at the new question so that you don't waste people's time answering in two locations.
You are right. I couldn't think because I was in a tight situation. I sincerely apologize. Sorry for taking your time and thank you for your help.
Is there a reason you're pushing python in your various comments by giving generic python advice instead of addressing the actual questions?
EDIT: I see that you've been pushing your pages on medium wherever you can. Instead of doing that, create a Blog type post pointing to your stuff on Medium and help users address their actual questions.
I know about pandas, but I have no idea how to use it. Could you help me set out a way?
hello, could you explain again what you have and what you would like to do? still not complete clear to me. thanks!
I managed to finish the code I needed. It's in the process right now. Thank you for asking though. Have a nice day!