Question

Est Sequence Database: Which Frame To Choose ?

3

Entering edit mode

13.2 years ago

Woa ★ 2.9k

Hi all,

I have a EST DNA sequence database and I wish it to be searched for protein identification using tandem mass spectrometry data. However I have reasons to believe that the protein is not coded by the reverse strand. So for a six frame translation I can safely ignore the 3 reading frames from the reverse strand.

My question is, then how to use the rest of the three reading frames from the positive strand as a database? Common softwares generally translate all six frames, but in this case I need only three. Shall I use the translated longest open reading frame(ORF) from 3 reading frames? Or shall I keep ALL ORFs generated from the 3 frames? Or shall I keep the translated Reading frame that contains the longest ORF. For the last case there will be a lot of translated STOP CODONS (marked as *) inside the sequence however.

proteomics est orf • 3.5k views

ADD COMMENT • link updated 13.2 years ago by Larry_Parnell 16k • written 13.2 years ago by Woa ★ 2.9k

score 3 · Answer 1 · 2011-02-03

Afaik: If using a classical EST approach you can't ignore the reverse strands because of the way the cDNA clone libraries for EST sequencing are constructed. The single stranded cDNA is turned into double stranded DNA and inserted into the library vectors. The double stranded clones are then sequenced but therefore you will not know what the original strand was. If you used a strand-specific next generation sequencing protocol, your assumption might hold though.

However, if your library is not too big I would still run the full six frame translation and filter afterwards. That will allow you to verify your +strand assumption, I wouldn't be surprised if you would get a lot of unexpected reverse hits even though your theoretical approach told you otherwise.

score 3 · Answer 2 · 2011-02-03

Exactly as Michael writes - EST libraries are not cloned into the sequencing vector in an orientation-specific manner. Some clones are, but this is not a reliable feature, especially with more modern methods that allow cloning of middle to 5' regions of the gene/transcript. The best diagnostic that you will have for the orientation issue, and then reducing your search space to 3 reading frames on one strand, will be the polyA (vs polyT) run at the extreme 3'-end of the clone. However, those clones, when short, may not encode any protein-coding sequence...

In addition, weird s#!t can happen during cloning. For example, I have seen many hybrid ESTs where two different parts of two different genes are joined in vitro to produce a single clone. Also, sloppy library prep will give genomic sequence in the EST reads.

Just compare to all 6 frames and examine and filter your results afterward. You should also compare your peptides to RefSeq sequences for the source of the peptides or as close evolutionarily as you can get. This is in order to have a full-length sequence you can use to judge the quality of your matches to the ESTs.