What happens if a UMI fragment has a PCR artefact?
2
0
Entering edit mode
3.4 years ago
asumani ▴ 70

How could we detect if a base change in UMI sequence during PCR amplification makes it identical to another UMI? It will propagate the same transcript with another UMI. What would be the consequences and how could that be solved?

unique moleculer identifier • 1.4k views
ADD COMMENT
2
Entering edit mode

That is a common problem in sequencing and specialized software addressing sequencing errors im UMIs have been developed. See for example UMI-tools where i.sudbery was the senior author. From what I understand their approach uses networks to cluster UMIs (in the presence of random sequencing errors) into groups which most likely (according to their model) have originated from the same actual (true) underlying UMI sequence before PCR/sequencing, therefore accounting for the sequencing errors. You would need to dive into the paper to see to what extend this improves quanitification and how prominent the sequence errors actually are. Maybe i.sudbery stumbles over this thread and gives you an experts summary.

So simplfied: Yes, you will have sequencing errors in UMIs and the goal of e.g. UMI-tools is to use its network-based approach to identify UMIs that, despite errors, actually come from the same original UMI, and group them together in order to avoid that the erroneous UMIs are considered unique or independent of each other.

ADD REPLY
0
Entering edit mode

I see. That wrong UMI created during a PCR amplification.And that single read with wrong UMI will keep propagating during futher PCR amplification. I thought it will mess everything up. But do you mean that its effect will be negligible since it is just one read?

ADD REPLY
0
Entering edit mode

Do not add an answer unless you're answering the top level post. Use Add Comment or Add Reply instead. I'm moving this post to a comment for now, but please be more careful in the future.

ADD REPLY
0
Entering edit mode

That wrong UMI created during a PCR amplification.And that single read with wrong UMI will keep propagating during futher PCR amplification.

If the error happened in an early cycle then yes. Isn't that the reason people use robust enzymes that would not allow errors to creep in.

ADD REPLY
5
Entering edit mode
3.4 years ago

This is precisely the problem addressed by UMI tools, which we created after we noticed that many of the UMIs for reads that mapped to the same co-ordinates were suspiciously similar...

You can read a detailed description of how the different deduplication schemes supported by UMI-tools work in the UMI-tools publication linked by @ATPoint, or on the UMI-tools read the docs site. I'll briefly describe one method: "directional", which is the default, and usually most appropriate.

This diagram may be helpful: enter image description here

We begin by identifying all the reads, that could possibly be PCR duplicates of each other, and extract their UMIs. In some library prep protocols this means reads (or read pairs) with the same mapping co-ordinates, for other methods it just means reads that map to the same gene/transcript/contig. We then examine the edit distances between them - that is the number of bases we would have to change in UMI i to make it into UMI j.

The first intuition is that if two UMIs differ only by a single base, then perhaps it is more likely the formed through a PCR error (or sequencing error), than by two different, but very similar, UMIs being attached to two genuinely independent molecules. The first instinct is to merge these UMIs together. But this becomes less obviously the correct thing to do when you get long chains of similar UMIs (say ACGT, ACAT and ACAG, see the figure). In "directional" we rely on the idea that a UMI that arrose as a PCR error from another UMI will be less frequently observed. We form networks where an edge is formed from UMI i to UMI j b iif the edit distance is less than some threshold (usually 1) AND the number times that times UMI j is observed at that location is less than half the number of times UMI i is observed (strictly f_i >= 2*f_j +1, where f_x is the frequency of UMI x at a particular location. This accounts for cases where there is 2 reads with UMI i and one read with UMI j). We then identify connected sub-networks, and correct all UMIs in the subnetwork to the UMI with the highest count. Thus in the sample on the right in the figure, we find two independent sequences (ACGT and AAAT) and imply that the remaining sequences arose as PCR errors from ACGT.

Things get more complex in droplet based single cell RNA sequence with UMIs because you have the complication that reads might be compatible with coming from more than one transcript. Lets say UMI i could have come from transcript A or from transcript B. In transcript A it is unique, but in transcript B there is a UMI that is within an edit distance of 1. What do we do here? Alevin address this by implementing a deduplication scheme based on the same intuitions as "directional", but also accounts for this transcript assignment problem, @Rob would probably be better placed than me to describe how that works.

ADD COMMENT
0
Entering edit mode
3.4 years ago

You mean if you get a single read with a wrong UMI? That happens to have the same cell barcode and belong to the same gene as another read?

You'd wrongly lose...one read.

ADD COMMENT

Login before adding your answer.

Traffic: 1870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6