I have a fasta file with sequences of about 100 amino acids and i need to expand them on both sides with the corresponding amino acids to get a fasta file that contains the entire domain sequences instead of the 100 amino acid stretches. I try to write a biophyton script that does the job, but i'm an absolute beginner and would be glad for any advice on how to do that. So i figured that my script should first perform a blast search for all the sequences, take the top hit and then somehow use it to expand the query sequence. However I don't really know how to implement that (except for performing the blast search). Any help would be appreciated, thank you.
Like Steve said above, your use of the word "expand" is vague and confusing, but it sounds like you have FASTA file with 100 characters of protein data in each sequence. You would like to write a script in PYTHON to automate your search. Being new to python I think you should first write out the steps of what you would like to do: formulate your algorithm.
For example: 1. BLAST query sequence to reference protein sequence 2. query downstream X amino acids and upstream X amino acids 3. return entire X length of what you are looking for. 4. repeat for other sequences
This list becomes the operations in your script. Figure out how to write each step in python. I'm not sure what exactly you want to do so it's hard to understand how you would develop your algorithm here.
What exactly is it that you're trying to do? You have a multiple sequence alignment (MSA) that has been cut to ca. 100 amino acids and you wish to find the original sequences from which they were made from? Or simply homologs of those sequences? What is the context of this problem? Do you feel that ca. 100 residue alignment is a domain? If so you could build an HMM model using HMMER3 and extract all known homologs from a reference database (such as UniRef100 - all known non-redundant protein sequences). If not ... why have you aligned them?