Question

Expanding Multiple Sequences In Fasta File

0

Entering edit mode

11.4 years ago

david ▴ 10

I have a fasta file with sequences of about 100 amino acids and i need to expand them on both sides with the corresponding amino acids to get a fasta file that contains the entire domain sequences instead of the 100 amino acid stretches. I try to write a biophyton script that does the job, but i'm an absolute beginner and would be glad for any advice on how to do that. So i figured that my script should first perform a blast search for all the sequences, take the top hit and then somehow use it to expand the query sequence. However I don't really know how to implement that (except for performing the blast search). Any help would be appreciated, thank you.

blast • 3.4k views

ADD COMMENT • link updated 11.4 years ago by SaggiSardar ▴ 20 • written 11.4 years ago by david ▴ 10

1

Entering edit mode

Could you define 'expand' and perhaps elaborate on the first sentence of your question?

ADD REPLY • link 11.4 years ago by steve ▴ 40

0

Entering edit mode

Sorry for the confusion. What i need is a dataset of the full sequences of homologous domains (as many as possible) from a given protein class. There is a publication where they use a library of amino acid stretches (from homologous proteins) as a training set for their algorithm. However these stretches comprise only maybe 1/3 of the sequence i need. So I would need a script that goes through the entire list of sequences, finds for each stretch the full amino acids sequence of the corresponding protein and then adds on both ends of the query sequence the amino acids missing for the entire domain im interested in. As you suggest my plan is to first define a strategy, step by step, and then implement it in python. So my idea was to (1) perform a blast search for each sequence, (2) get the highest hit, (3) create an alignment of the two sequences, (4) add a defined number of aa on both sides according to the alignment. Does that make any sense? Any hints on how the steps 3 and 4 could look like? I hope its clearer now.

ADD REPLY • link 11.4 years ago by david ▴ 10

score 2 · Answer 1 · 2012-11-28

Like Steve said above, your use of the word "expand" is vague and confusing, but it sounds like you have FASTA file with 100 characters of protein data in each sequence. You would like to write a script in PYTHON to automate your search. Being new to python I think you should first write out the steps of what you would like to do: formulate your algorithm.

For example: 1. BLAST query sequence to reference protein sequence 2. query downstream X amino acids and upstream X amino acids 3. return entire X length of what you are looking for. 4. repeat for other sequences

This list becomes the operations in your script. Figure out how to write each step in python. I'm not sure what exactly you want to do so it's hard to understand how you would develop your algorithm here.

score 0 · Answer 2 · 2012-11-29

0

Entering edit mode

11.4 years ago

SaggiSardar ▴ 20

Hi David,

What exactly is it that you're trying to do? You have a multiple sequence alignment (MSA) that has been cut to ca. 100 amino acids and you wish to find the original sequences from which they were made from? Or simply homologs of those sequences? What is the context of this problem? Do you feel that ca. 100 residue alignment is a domain? If so you could build an HMM model using HMMER3 and extract all known homologs from a reference database (such as UniRef100 - all known non-redundant protein sequences). If not ... why have you aligned them?

ADD COMMENT • link 11.4 years ago by SaggiSardar ▴ 20

0

Entering edit mode

hi saggisardar,

I have a list of 100 amino acid stretches. I need to find the original sequence for each query, identify the domain that contains the 100 amino acid stretches in this original sequence and replace all stretches in the initial list with the thus found domain sequence. I'm now writing a biophyton script that for each of the 100 amino acid stretches (1) performs a blast search , (2) retrieves the fasta sequence of the best hit, (3) performs a clustalw alignment of query and original sequnce, (4) appends a defined number of amino acids on both sides of the amino acid stretch (so it gets bigger than the desired domain), (4) crop the thus generate sequence based on a clustalw alignment with a reference domain. does that make sense? cheers

ADD REPLY • link 11.4 years ago by david ▴ 10