Question: If I Have 4 Sequence Runs, 2 In Each Direction, 1 Bp Is Different, On Each, Should I Resequence?
gravatar for John
9.6 years ago by
John770 wrote:

I have a plate of colonies to sequence. I pick 2 colonies and sequence each in Fwd and Rev directions. I get back a single bp difference between the 2 strands. 2 bp have an T, two have a C. How should I call this base? Can I call it a Y (C or T) and leave it at that, or do I need to sequence another colony to be sure?

sequencing dna • 1.9k views
ADD COMMENTlink written 9.6 years ago by John770

Thanks all, yes I had two good reads on each strand and the single bp on one of the colonies was different from that of the other colony (I'd picked 3 originally, but one was just an insert). I went with picking another 2 colonies to be sure. I'm sequencing ~100 markers though, so I was trying to weigh up the extra $$ / time in sequencing another colony with the extra information a C or T gives me over a Y. This is only the 15th sequence or so and the first time this has happened, so I'll see how the others turn out before deciding on a general policy.

ADD REPLYlink written 9.6 years ago by John770

Actually, what I'd really like to know is when you go to publish a sequence like this, how much coverage should you have? Is it acceptable to put a sequence with a Y into genbank, because you didn't go to the effort / cost of re-sequencing to resolve it? Or does the Y represent natural variation... and how many would you need to sequence to answer that question... :)

ADD REPLYlink written 9.6 years ago by John770

In this case I sequenced more colonies and found a consensus sequence, however, I'm cloning PCR products, so I don't think it is possible to say that there could not be natural variation in the PCR amplicon pool. One good example would be a bacterium with 2 different 16s inside a single cell, this could produce 2 populations of PCR products. If the ratio were 1:3, how often would you have to sample cloned colonies in order to observe this natural variation?

ADD REPLYlink written 9.6 years ago by John770

Just to clarify the situation: in each of your two sequencing runs you found a single base difference between the two strands, and that difference was in the same position in both cases?

ADD REPLYlink written 9.6 years ago by Istvan Albert ♦♦ 81k

You should not put unreliable data into Genbank! Either you prove there is natural variation and you submit all the variants or you make sure you have reliable data and you don't have the problem.

ADD REPLYlink written 9.6 years ago by Nicojo1.1k

In this particular case, you can not talk about natural variation: you are sequencing fragments you've cloned into a plasmid! Unless you've contaminated your prep with two colonies from the plate, all the plasmids in one prep should be identical. If you have ambivalent bases, then it's because your sequencing is of bad quality. You should never submit bad quality data to Genbank.

ADD REPLYlink written 9.6 years ago by Nicojo1.1k
gravatar for Nicojo
9.6 years ago by
Kyoto, Japan
Nicojo1.1k wrote:

I agree with chrisamiller and PhiS. I'll just add that it also greatly depends on what you will do with your sequence.

I understand from your question that:

  • You have picked only 2 [bacterial] colonies for sequencing
  • These colonies result from the cloning of a PCR product (?)
  • They were sequenced using Sanger sequencing

[NOTE: when describing your problem it is very important to give these kind of details, so please correct me if my assumptions are wrong.]

I am guessing that:

  • You might want to check that the sequence is correct (maybe verifying that your qPCR product is correct)?
  • You might be cloning a gene (or fragment thereof) in order to express a protein?

[NOTE: here again, these kind of details are crucial in determining if you can accept an ambiguous base or not. Please add a comment or edit your post if it is yet another purpose]

Finally, as Istvan has asked, you need to be clear as to what the difference is: are you looking at a different base call between the two sequenced colonies or between the forward and reverse sequencing events?

If it is the first (i.e. difference between the two colonies) then you need to check the quality of the call at that base (quality scores if you have them, or look at the chromatogram to see if there's a mistake or a double pic etc.). If they are good quality, then you probably have at least these two different variants of the sequence you're targeting.

If it is the second (i.e. difference between the forward and reverse) then you should also look at the quality in each read. If they are bad quality, sequence again. If they are good quality, then I'm scratching my head making a funny face. Start over from scratch.

Now to your question about leaving it ambiguous or not:

  • If you just wanted to check that the sequence is "fairly" OK, then fine, leave it as a Y.
  • If you're checking the amplicon of a qPCR event, then it is crucial to know if you have only one sequence or two different ones (even if it's a SNP). This will change your interpretation.
  • If you want to express a protein from this sequence, then you need to check if the difference (T or C) changes the resulting protein sequence: if yes, you need to choose the correct clone. If not, you can go with either.
ADD COMMENTlink written 9.6 years ago by Nicojo1.1k
gravatar for Chris Miller
9.6 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

There are lots of factors to consider here:

1) What do the quality scores tell you about the base call at that position?

2) How deep is your coverage? If you've got 1x coverage, it's possible that you may be seeing a miscalled base. If you're taking consensus from 30x coverage, it's much less likely.

3) You're sequencing from a population. It's completely possible that within this population there are individuals with both alleles that you're describing, right?

ADD COMMENTlink written 9.6 years ago by Chris Miller21k
gravatar for Phis
9.6 years ago by
Phis1.0k wrote:

As chrisamiller says, it depends on the details of what you're trying to do. The question is whether what you're seeing is variation due to technical error or due to biological variation.

However, without any additional information, if you've essentially only got 2 reads per sequence with contradicting information at a given position, you can't really call the base with any degree of certainty. In this case, the use of an ambiguity base call (i.e. Y instead of C or T) would be justified, in my view.

ADD COMMENTlink written 9.6 years ago by Phis1.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1412 users visited in the last hour