Question

Help with domain sequence parsing from protein Fasta file

0

Entering edit mode

3.9 years ago

Siclari.jimmy • 0

Hi all, I am very new to bioinformatics and have just started using Biopython. I am looking to see if there is a way to extract parts of a sequences from a large number of protein sequences based on the domain. I have sequences for ~500 proteins and I know the location of my domain in question but need the sequence for just that domain +about 50 residues on both sides so I can do an alignment. The solution does not need to be in Biopython. Just really need some help. Thank you.

sequence alignment • 906 views

ADD COMMENT • link updated 3.9 years ago by vinaykusuma ▴ 10 • written 3.9 years ago by Siclari.jimmy • 0

score 0 · Answer 1 · 2020-05-19

0

Entering edit mode

3.9 years ago

vinaykusuma ▴ 10

Since, you say you have domain location. Lets consider the location of your domain as 25-40

You have ~500 protein sequences.

You write a code which

Opens protein sequence file.
Stores protein sequence in string format variable 's' (get rid of any header present in sequence file).
slices the required part i.e s[24+50 : 39+50] where 50 being residues on both sides.
save the slice in a file.

Iterate the above process for each protein sequence using 'for' loop.

Now, you know the steps you can easily implement this in any language you know.

I hope this is what you needed.

ADD COMMENT • link 3.9 years ago by vinaykusuma ▴ 10

0

Entering edit mode

Yes this makes sense. Thank you. However, the domain does not occupy same location for each protein. Sometimes it lies in residues 50-100 while others it may be in 100-150 and so on.

ADD REPLY • link 3.9 years ago by Siclari.jimmy • 0