Question: Help with domain sequence parsing from protein Fasta file
6 weeks ago
Siclari.jimmy0 wrote:

Hi all, I am very new to bioinformatics and have just started using Biopython. I am looking to see if there is a way to extract parts of a sequences from a large number of protein sequences based on the domain. I have sequences for ~500 proteins and I know the location of my domain in question but need the sequence for just that domain +about 50 residues on both sides so I can do an alignment. The solution does not need to be in Biopython. Just really need some help. Thank you.

6 weeks ago
vnyksm10 wrote:

Since, you say you have domain location. Lets consider the location of your domain as 25-40

You have ~500 protein sequences.

You write a code which

  • Opens protein sequence file.

  • Stores protein sequence in string format variable 's' (get rid of any header present in sequence file).

  • slices the required part i.e s[24+50 : 39+50] where 50 being residues on both sides.

  • save the slice in a file.

Iterate the above process for each protein sequence using 'for' loop.

Now, you know the steps you can easily implement this in any language you know.

I hope this is what you needed.

Yes this makes sense. Thank you. However, the domain does not occupy same location for each protein. Sometimes it lies in residues 50-100 while others it may be in 100-150 and so on.

