Question

Bioinformatics questions sequence

0

Entering edit mode

3.2 years ago

anapaolavi • 0

YKYRYLRHGKLRPFERDI
YKYRYLKHGKLRPFERDI
YKYRYLXHGKLRPFERDI
YKYRSLRHGKLRPFERDI
YKYRCLRHGKLRPFERDI
YKFRYLRHGKLRPFERDI
YKHRYLRHGKLRPFERDI
YKXRYLRHGKLRPFERDI
YLYRWVRRSKLNPYERDL
FYYRLFRHGKIKPYERDI
FFYRRFRHGKIKPYGRDL
FYYRLFRHGKIKPYGRDL
YYYRIWRSEKLRPFERDI
YYYRSHRKTKLKPFERDL
YFYRSHRSTKLKPFERDL
YFYRSHRSSKLKPFERDL
YYYRSSRKTKLKPFERDL
YYYRSYRKEKLKPFERDL

Write a regular expression that describes the alignment in the box.

Find 5 protein sequences from different organisms or strains that contain the pattern described by the regular expression from Q1. List the ID, name, size, source, and function of each protein.

Find 2 proteins with known structures that contain the pattern described by the regular expression from Q1. List the IDs of found protein structures.

Build a multiple sequence alignment for all protein sequences from Q2 and Q3.

Identify the conserved regions in the alignment from Q4 and explore their biological significance.

Evaluate statistical parameters of the regular expression from Q1 based on similar expressions in the Prosite database.

sequence • 1.1k views

ADD COMMENT • link updated 3.2 years ago by Mensur Dlakic ★ 27k • written 3.2 years ago by anapaolavi • 0

0

Entering edit mode

please change your title "Bioinformatics questions sequence". Of course it is a question about bioinformatics...

ADD REPLY • link 3.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

looks like a homework. what have you tried so far ?

ADD REPLY • link 3.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

not sure where to start

ADD REPLY • link 3.2 years ago by anapaolavi • 0

0

Entering edit mode

The first question is asking to write a regular expression that captures those sequences. Depending on what language you are writing this in there will be regex tutorials that you should go through.

ADD REPLY • link 3.2 years ago by rpolicastro 13k

0

Entering edit mode

would this be correct? regex = ([A-Z])+

ADD REPLY • link 3.2 years ago by anapaolavi • 0

0

Entering edit mode

yes but looks like it's a amino-acid alphabet (not A to Z) with a specific length...

ADD REPLY • link 3.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

what would you recommend then?

ADD REPLY • link 3.2 years ago by anapaolavi • 0

0

Entering edit mode

would this be correct? regex = ([A-Z])+

Well, yes, but it also covers the sequences A and AA and AAA and every other sequence of alphabetical uppercase characters that is conceivable (including all sequences that contain non-amino acid letters).

You need to find one that covers (exactly) the given alignment. So best to look at the individual columns of the alignment and see what amino acids they're composed of. This should then give you an idea of how to build the regex.

ADD REPLY • link 3.2 years ago by cschu181 ★ 2.8k

score 2 · Answer 1 · 2021-01-26

This is clearly a homework assignment, and you should ask your instructor for details. It beats the educational goals of your instructor if we show you exactly how to do this. That said, here are couple of hints.

I am guessing that a regular expression assignment is about individual columns in your alignment rather than a full set of sequences. For example, this is a regular expression of the last 4 columns in your alignment:

[EG]-R-D-[IL]

This means that the last column is either I of L, next to the last is always D, the one before it always R, and the one before it is either E or G. You should check with your instructor, but I think that your assignment is to find this pattern across all columns, and then search the database for proteins that match the pattern you found.

For example, here is one protein that matches the whole pattern (the match is in red):

MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNV TGFHTINHTFGNPVIPFKDGIYFAATEKSNVVRGWVFGSTMNNKSQSVIIINNSTNVVIRACNFEL CDNPFFAVSKPMGTQTHTMIFDNAFNCTFEYISDAFSLDVSEKSGNFKHLREFVFKNKDGFLYVYK GYQPIDVVRDLPSGFNTLKPIFKLPLGINITNFRAILTAFSPAQDIWGTSAAAYFVGYLKPTTFML KYDENGTITDAVDCSQNPLAELKCSVKSFEIDKGIYQTSNFRVVPSGDVVRFPNITNLCPFGEVFN ATKFPSVYAWERKKISNCVADYSVLYNSTFFSTFKCYGVSATKLNDLCFSNVYADSFVVKGDDVRQ IAPGQTGVIADYNYKLPDDFMGCVLAWNTRNIDATSTGNYNYKYRYLRHGKLRPFERDISNVPFSP DGKPCTPPALNCYWPLNDYGFYTTTGIGYQPYRVVVLSFELLNAPATVCGPKLSTDLIKNQCVNFN FNGLTGTGVLTPSSKRFQPFQQFGRDVSDFTDSVRDPKTSEILDISPCSFGGVSVITPGTNASSEV AVLYQDVNCTDVSTAIHADQLTPAWRIYSTGNNVFQTQAGCLIGAEHVDTSYECDIPIGAGICASY HTVSLLRSTSQKSIVAYTMSLGADSSIAYSNNTIAIPTNFSISITTEVMPVSMAKTSVDCNMYICG DSTECANLLLQYGSFCTQLNRALSGIAAEQDRNTREVFAQVKQMYKTPTLKYFGGFNFSQILPDPL KPTKRSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGLTVLPPLLTDDMIAAYTAA LVSGTATAGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKQIANQFNKAISQIQESLTTTS TALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYV TQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQAAPHGVVFLHVTYVPSQERN FTTAPAICHEGKAYFPREGVFVFNGTSWFITQRNFFSPQIITTDNTFVSGNCDVVIGIINNTVYDP LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKY EQYIKWPWYVWLGFIAGLIAIVMVTILLCCMTSCCSCLKGACSCGSCCKFDEDDSEPVLKGVKLHY T