22 months ago by
UMI tools does allow for uncertainty in the mapping positions unfortunately - we have considered it, but it would require a complete change to the UMI-tools algorithm and a massive increase in memory requirements.
We have studied this problem ourselves. You can find our analysis here.
We find that adjacent positions (upto around 5bp away) are far more likely to share UMIs than distant ones. However, this enrichment disappears when expression levels are accounted for - it is our belief that adjecent positions are enriched for similar UMIs because highly expressed positions, with more UMIs are likely to be adjacent to other highly expressed positions with many UMIs. Note that this analysis was performed using an sample from an iCLIP expriment, rather than an RNA-seq experiment, so it possible that doesn't hold for RNA-seq. We should probably check.
One important point to bare in mind is that UMI-tools doesn't use the mapping position defined in the "pos" field of the BAM file, but rather adjusts that for any softclipped bases - another source of apparently adjacent bases sharing UMIs. If you want to be super conservative you could de-duplicate "per-gene": that is, only allow one instance of a particular UMI per gene, rather than per-position. This is activated using the
--per-gene flag to UMI-tools, and requires you to either map to a transcriptome and tell UMIs that the contig represents the gene, or use featureCounts to assign each read to a gene (see our single cell tutorial for details). You then tell UMI-tools which of these your are using with