Question: How to deduplicate UMI with mapping shift ?
gravatar for tatsuomano
22 months ago by
tatsuomano20 wrote:

Dear group,

I'm trying to integrate PCR duplicates using UMI tools, and I have one problem.

I could integrate PCR duplicates just at the same genomic position by UMI tools, but can not integrate those with mapping shifts. As the same with the recently reported paper (, our data also have PCR duplicates with some mapping shifts, which biased the read counts. How can I deduplicate such reads? Does UMI tools implement some options ?


umi • 981 views
ADD COMMENTlink modified 22 months ago by i.sudbery9.8k • written 22 months ago by tatsuomano20
gravatar for i.sudbery
22 months ago by
Sheffield, UK
i.sudbery9.8k wrote:

UMI tools does allow for uncertainty in the mapping positions unfortunately - we have considered it, but it would require a complete change to the UMI-tools algorithm and a massive increase in memory requirements.

We have studied this problem ourselves. You can find our analysis here.

We find that adjacent positions (upto around 5bp away) are far more likely to share UMIs than distant ones. However, this enrichment disappears when expression levels are accounted for - it is our belief that adjecent positions are enriched for similar UMIs because highly expressed positions, with more UMIs are likely to be adjacent to other highly expressed positions with many UMIs. Note that this analysis was performed using an sample from an iCLIP expriment, rather than an RNA-seq experiment, so it possible that doesn't hold for RNA-seq. We should probably check.

One important point to bare in mind is that UMI-tools doesn't use the mapping position defined in the "pos" field of the BAM file, but rather adjusts that for any softclipped bases - another source of apparently adjacent bases sharing UMIs. If you want to be super conservative you could de-duplicate "per-gene": that is, only allow one instance of a particular UMI per gene, rather than per-position. This is activated using the --per-gene flag to UMI-tools, and requires you to either map to a transcriptome and tell UMIs that the contig represents the gene, or use featureCounts to assign each read to a gene (see our single cell tutorial for details). You then tell UMI-tools which of these your are using with --per-contig or --gene-tag=XT.

ADD COMMENTlink written 22 months ago by i.sudbery9.8k

Looking carefully at that reference, not only do they not seem to account for expression or soft-clipping, but two of the datasets they use of 3'-tagging protocols. 3'-tagging protocols definately generally have their fragmentation step after the amplification step, and therefore the mapping position of the read is not expected to be informative for deduplication, and thus should always be run with the --per-gene option.

ADD REPLYlink written 22 months ago by i.sudbery9.8k

Thank you very much for your kind comment, and I appreciate your great contribution of UMI-tools. I'm sorry I didn't make it clear enough, but, actually, I'm trying to analyze DNA damage accumulation by BLISS (, so I can't use featureCounts to assign each read to a gene. Closely looking at the data deposited by the authors, I found several reads with the same UMI mapped to the adjacent positions. Some of the sequences might be lost during preparation of the reads by in situ transcription, reverse transcription or PCR processes, but the detail is not clear. Because we want to focus the effects of the DNA damages, so the position and the exact number of the reads matter.

ADD REPLYlink written 22 months ago by tatsuomano20

You should check whether, where you see cases of reads mapping to adjacent bases with the same UMI, the reads mapping to one location are soft-clipped - that is the actual start of the read is at the same base, even if the mapping co-ordinate given in the BAM file is different.

ADD REPLYlink written 22 months ago by i.sudbery9.8k

Sorry for late reply. Your advice worked greatly. As your pointed out, soft-clipping is the cause of mapping shifts with the same UMI. Thank you!

ADD REPLYlink written 22 months ago by tatsuomano20

Sorry, I know this thread has ended a while ago. But I am actually very curious about your statement "However, this enrichment disappears when expression levels are accounted for". I have checked your notebook. The final plot, correct me if I am wrong which I believe is the one account for expression levels, still shows ~ 3% increase in the fraction of intersecting UMIs in the +1,-1 positions compared to remote positions. Are there other caveats that I haven't understand or you haven't explicitly said?

We are working on the liquid biopsy space which heavily relies on UMI for DNA error correction and we have super high depth, sometimes > 100k. Also because we use pair-end read. It is more likely to have 3' end mapping shift since the sequence quality of the 3'end of the reads drops significantly. I have seen fairly a lot of "spillovers" where a small UMI group that are +1,-1 mapping shift from a large group has the exactly same UMI as a large UMI group, just like the UMI error created a small group. I really like this analysis you have done in this notebook. I probably should try it on our data.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Ruolin Liu0

So there IS an increase in the fraction of intersecting UMIs in the +1 and -1 positions. But that enrichment is also present in the null distribution, which is what suggests its an artifact driven by the fact that position super high coverage positions are likely to be close to other super-high coverage positions, and two high coverage positions are likely to share a portion of their UMIs by chance.

Now in DNA sequencing I don't know if its true to assume that one high coverage position is likely to be adjacent to another, so I recommend you do do the analysis for your self. However, you need to make sure you account for clipping - that is, if the pos is say 100, but the CIGAR is 1S99M, we usually regard the read mapping location as 95, not 100, as we think a sequencing error at the first base is the most likely explaination.

ADD REPLYlink written 4 weeks ago by i.sudbery9.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1951 users visited in the last hour