I'm doing comparison between two different strains of COVID-19 using Python , I did alignment using clustalw2 , Now i want detect the dissimilar regions between sequences and extract it using python , is anyone knows how to do this ?
This question is too general. Give more details on your input and desired output. Also, what have you tried so far?
@liorglic makes a lot of good points. I have some of the pieces to make an analysis workflow that would do what you need, I think. However, it's hard to advise or help without specifics.
For example, I have a script that may do an essential part toward the first step you describe. It uses the consensus symbols you often get when you do an alignment, with for example MUSCLE, and uses those to categorize the positions. See the description of my script categorize_residues_based_on_conservation_relative_consensus_line.py here. (The script code is available from the top of that page.) In the description, I even point to an example of using it classify positions and then use those results with some other code to make some visualization scripts. There's a lot of formats to get your alignment data expressed as. I like clustal for a lot of things but that may not be what you have? If you do and it doesn't have the line with the symbols indicating conservation, I have scripts that can add those back, see calculate_cons_for_clustal_nucleic.py and calculate_cons_for_clustal_protein.py described and available on that same sub-repo.
Once you have the 'not_conserved' positions categorized via the script categorize_residues_based_on_conservation_relative_consensus_line.py, you can use some Python code to classify those into contiguous spans to then feed those to extract the dissimilar regions. On that same page, I have a script that is called extract_regions_from_clustal_alignment.py that may be useful for this last step. I'll also add that Biopython has methods to extract the sequences from alignments in various formats and so you may find that easier to use for the extraction step. There's also a script I have called MSA_to_corresponding_residue_numbers.py that may be good for getting the corresponding positions from the alignment before the extraction step. That way you only have to determine spans (regions) of interest for one of the sequences in your pairs.
My profile on Biostars links to my Github profile. My Github profile lists my email if you want to contact me to help you put some of the pieces together if you aren't quite feeling like digging into it right now.
Another thing to ponder is that you say you use ClustalW2. One of the biggest resources for tools such as this list that tool as retired and only useful for three or more sequences, see here. Maybe you meant your alignment output is in clustal format?
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy