Question: How Do I Predict The Sequence Of A Protein Split Over Multiple Contigs?
gravatar for Michael Kuhn
8.3 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

I'm hunting for a member of protein family in an unannotated genome. With a bit of luck and the help of transcriptome data from a distantly related species, I found that the gene seems to be split across three contigs because of TA (or AT?) repeats:

contig 1-(TA)*-small contig 2-(TA)*-contig 3

How would you predict the gene in this case: Run separate predictions for the three contigs? Glue the contigs together manually?

(Also, I'd be happy for any pointer on the function of the TA repeat.)

gene contigs • 1.9k views
ADD COMMENTlink written 8.3 years ago by Michael Kuhn5.0k

Just curious - how do you know there are (AT)n between the contigs? ... I would start by predicting partial genes on each contig. The contig merging can be justified if you have some kind of linking information - like paired reads mapped to adjacent contigs. Transcript from a distant relative sounds a bit risky as evidence.

ADD REPLYlink written 8.3 years ago by Haibao Tang3.0k

contig 1 has AT repeats at the end, contig 2 at both ends and contig 3 at the start. Of course I don't know if they're really connected, but given that all contigs harbor a fragment of the same gene it seems likely.

ADD REPLYlink written 8.2 years ago by Michael Kuhn5.0k
gravatar for Larry_Parnell
8.3 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

I would join the contigs if a couple important criteria are met:

1) Each contig must align to a unique portion of the distant relative with no overlap in residue positions covered (on that relative)? You don't want contig A to match amino acids 15 - 97 and contig B to match amino acids 85 to 188 as this indicates that those two contigs should be joined during assembly of the genome and not manually for the sake of this gene hunting/modeling.

2) Each contig should have relatively the same percent identity and percent similarity. Relatively is key here and it is hard to define what is an acceptable range. You do not want to be dealing with paralogs - 2 genes - when you're assuming a single gene. In other words, don't manually create a gene fusion.

I would also run the translation of the contigs against motif finders (Pfam eg) to assist in identifying what may be missing, if anything, from the protein-coding portion of your gene model.

As to biological function of the TA repeats - could be transposon insertion sites/remnants, could be structural for DNA itself, could be but are unlikely binding sites for DNA modification enzymes or transcription factors. Genetically, these can be used as markers.

ADD COMMENTlink written 8.3 years ago by Larry_Parnell16k
gravatar for Lee Katz
8.3 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

I know that this is a bioinformatics forum and I do love bioinformatics solutions, but this might be a case for the wet lab. Just see if you or someone in your lab can PCR the gaps to verify that the contigs should be glued together.

If a PCR product forms between the contigs, then the predicted synteny is correct and you can glue together the contigs.

ADD COMMENTlink written 8.3 years ago by Lee Katz3.0k

Sure, if I had a specimen of the organism, this would be an option... ;-) But you're right, I was thinking about contacting a lab that works with the organism to see if we could culture it as well.

ADD REPLYlink written 8.3 years ago by Michael Kuhn5.0k

This is what a sequencing center would call "finishing." Even if it is a wet lab solution, you still need bioinformatics to propose primers and analyze any joins made. You can use bioinformatics to predict if this gene is in a family of one member or more by looking at the same gene in other similar organisms.

ADD REPLYlink written 8.3 years ago by Larry_Parnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2346 users visited in the last hour