Hi all -
I'm working on a particular gene in Plasmodium falciparum that I have recently shown is misannotated in the reference genome. The current annotation shows that the gene has three exons; however, our recent data suggests that "exon 1" is not transcribed and spliced with exons 2 and/or 3. Therefore, we believe that the actual gene is only exons 2 and 3. I'm wondering how we can identify the new start codon. At my disposal I have several RNA-seq datasets, the ability to do RT-PCR and sequencing, and most common molecular biology experiments to probe this. I may be able to get a proteome dataset from a collaborator if needed. Is there a way to identify the start codon based on read coverage around a particular methionine/ if you have a read gap around a Met (ie. a density of reads that start with the methionine and no reads that contain sequence 5' from the methionine)? Should I look for peptides that start with methionines in a proteomic dataset? Should I design primers to try to amplify the transcript and sequence in reverse to find the transcription start site? Help me please! :)
With RACE-PCR you could determine a more exact transcript to validate your hypotheses about the missing exon.
Thanks for this suggestion. Do you know if this technique is going to be limited by transcript size? While I can make a guess at which Met is the start codon, I could certainly be wrong and the possibly transcript length for 5' RACE-PCR would be much longer. My thinking is that I can design it so I would aim to have a nice 300-1000 bp product (if my theories are correct), but if I'm wrong, I could be trying to amplify a very large fragment which could be difficult.
Honestly, this leads the question far away from bioinformatics, and wet-lab is not my core competence ;) I just know that our technicians do this on a regular basis on L. salmonis transcripts and I have never heard about a limitation in transcript size, recently got a much longer transcript validated. Normally, we obtain a 5'- and 3'- race, and a normal PCR based consensus sequence.
The problems possibly rather come from alternative transcripts or recently duplicated genes.
Here is the M&M part describing the setting in our paper naming the relevant kits:
I think OPs main interest is in the translation start site (given him talking about Met/start codon), which if I'm not mistaken will not be found using RACE because it will find your transcription start site.
I got that point, but without ribo-seq or proteomics approach there is afaik nothing better than having the correct transcript. We normally get the validated transcript, take the longest ORF. At least nobody has complained about that ;)
In an oversimplified world, you could do a simple western blot and see if the mass of your protein matches to the theoretical mass based on each potential start codon. That being said, I've never done (and never will do) a western blot. This will also crucially depend on the availability of an antibody.
Both really - as Michael points out (and in his comment below), it's probably technically easier to do RACE-PCR to figure out the full length transcript and then take the longest ORF. At least this will validate the splicing variation or genome misannotation theory. I can do a simple RT-PCR across the exon 2/3 junction and sequence that product to verify they are spliced as annotated (Can't RT-PCR amplify across exon 1 and 2 no matter how you try to do it, while clear product from exon 2/3). The 3' end of the gene that spans the exon 2/3 gap contains a well conserved functional domain so I'm less worried about that region.
Fortunately I have recently HA-tagged the endogenous locus by CRISPR-Cas9 editing so I should be able to do a rough validation of protein size/ORF selection via Western as you suggest WouterDeCoster.
What about Edman sequencing? ;-)
Good one! Theoretically I could do it, but what an expensive mess that would be
You could try to find an institution with who you can collaborate... I'm too young to have used the technology myself, but it might prove useful. But I would definitely start with the western. Having the HA tag is a huge advantage. How did you do that CRISPR-based tagging exactly?
Simply pick a suitable PAM site as close to the STOP codon as possible (unfortunately for me this was about 200 bp away) and then design a repair template to silently recodonise the CDS between the cut site and the STOP with a 3xHA tag stuck on the end of the CDS. You flank your changes (in this case the 200 bp recodonisation and exogenous HA sequence) with 300-500 bp "arms" that are homologous to the sequences flanking the desired integrated change (in this case 500 bp upstream of the cut site and 500 bp downstream of the STOP). Transfect and drug select for parasites containing the plasmid and bingo you've got your tag!
This works well in P falciparum because although transfection and CRISPR are horribly inefficient, the parasites don't have canonical non-homologous end joining pathways. Once you've cleaved the chromosome, the only way for the parasite to survive is to use homology-driven repair using the template you design and supply to fix it.
I would suggest ribo seq but perhaps that's beyond what's possible as you described.
Thanks - unfortunately you're probably right. It's a good idea to have in my back pocket if needed, but it's technically beyond what I would like to do as a first approach