How could we detect if a base change in UMI sequence during PCR amplification makes it identical to another UMI? It will propagate the same transcript with another UMI. What would be the consequences and how could that be solved?
This is precisely the problem addressed by UMI tools, which we created after we noticed that many of the UMIs for reads that mapped to the same co-ordinates were suspiciously similar...
You can read a detailed description of how the different deduplication schemes supported by UMI-tools work in the UMI-tools publication linked by @ATPoint, or on the UMI-tools read the docs site. I'll briefly describe one method: "directional", which is the default, and usually most appropriate.
This diagram may be helpful:
We begin by identifying all the reads, that could possibly be PCR duplicates of each other, and extract their UMIs. In some library prep protocols this means reads (or read pairs) with the same mapping co-ordinates, for other methods it just means reads that map to the same gene/transcript/contig. We then examine the edit distances between them - that is the number of bases we would have to change in UMI i to make it into UMI j.
The first intuition is that if two UMIs differ only by a single base, then perhaps it is more likely the formed through a PCR error (or sequencing error), than by two different, but very similar, UMIs being attached to two genuinely independent molecules. The first instinct is to merge these UMIs together. But this becomes less obviously the correct thing to do when you get long chains of similar UMIs (say ACGT, ACAT and ACAG, see the figure). In "directional" we rely on the idea that a UMI that arrose as a PCR error from another UMI will be less frequently observed. We form networks where an edge is formed from UMI i to UMI j b iif the edit distance is less than some threshold (usually 1) AND the number times that times UMI j is observed at that location is less than half the number of times UMI i is observed (strictly f_i >= 2*f_j +1, where f_x is the frequency of UMI x at a particular location. This accounts for cases where there is 2 reads with UMI i and one read with UMI j). We then identify connected sub-networks, and correct all UMIs in the subnetwork to the UMI with the highest count. Thus in the sample on the right in the figure, we find two independent sequences (ACGT and AAAT) and imply that the remaining sequences arose as PCR errors from ACGT.
Things get more complex in droplet based single cell RNA sequence with UMIs because you have the complication that reads might be compatible with coming from more than one transcript. Lets say UMI i could have come from transcript A or from transcript B. In transcript A it is unique, but in transcript B there is a UMI that is within an edit distance of 1. What do we do here? Alevin address this by implementing a deduplication scheme based on the same intuitions as "directional", but also accounts for this transcript assignment problem, @Rob would probably be better placed than me to describe how that works.