I have a plate of colonies to sequence. I pick 2 colonies and sequence each in Fwd and Rev directions. I get back a single bp difference between the 2 strands. 2 bp have an T, two have a C. How should I call this base? Can I call it a Y (C or T) and leave it at that, or do I need to sequence another colony to be sure?
I agree with chrisamiller and PhiS. I'll just add that it also greatly depends on what you will do with your sequence.
I understand from your question that:
- You have picked only 2 [bacterial] colonies for sequencing
- These colonies result from the cloning of a PCR product (?)
- They were sequenced using Sanger sequencing
[NOTE: when describing your problem it is very important to give these kind of details, so please correct me if my assumptions are wrong.]
I am guessing that:
- You might want to check that the sequence is correct (maybe verifying that your qPCR product is correct)?
- You might be cloning a gene (or fragment thereof) in order to express a protein?
[NOTE: here again, these kind of details are crucial in determining if you can accept an ambiguous base or not. Please add a comment or edit your post if it is yet another purpose]
Finally, as Istvan has asked, you need to be clear as to what the difference is: are you looking at a different base call between the two sequenced colonies or between the forward and reverse sequencing events?
If it is the first (i.e. difference between the two colonies) then you need to check the quality of the call at that base (quality scores if you have them, or look at the chromatogram to see if there's a mistake or a double pic etc.). If they are good quality, then you probably have at least these two different variants of the sequence you're targeting.
If it is the second (i.e. difference between the forward and reverse) then you should also look at the quality in each read. If they are bad quality, sequence again. If they are good quality, then I'm scratching my head making a funny face. Start over from scratch.
Now to your question about leaving it ambiguous or not:
- If you just wanted to check that the sequence is "fairly" OK, then fine, leave it as a Y.
- If you're checking the amplicon of a qPCR event, then it is crucial to know if you have only one sequence or two different ones (even if it's a SNP). This will change your interpretation.
- If you want to express a protein from this sequence, then you need to check if the difference (T or C) changes the resulting protein sequence: if yes, you need to choose the correct clone. If not, you can go with either.
There are lots of factors to consider here:
1) What do the quality scores tell you about the base call at that position?
2) How deep is your coverage? If you've got 1x coverage, it's possible that you may be seeing a miscalled base. If you're taking consensus from 30x coverage, it's much less likely.
3) You're sequencing from a population. It's completely possible that within this population there are individuals with both alleles that you're describing, right?
As chrisamiller says, it depends on the details of what you're trying to do. The question is whether what you're seeing is variation due to technical error or due to biological variation.
However, without any additional information, if you've essentially only got 2 reads per sequence with contradicting information at a given position, you can't really call the base with any degree of certainty. In this case, the use of an ambiguity base call (i.e. Y instead of C or T) would be justified, in my view.