Question: How to take in consideration UMI in WES pipeline?
gravatar for jonessara770
2.7 years ago by
jonessara770220 wrote:


I have received WES data treated with SureSelectXT HS Reagent that has UMI. I wonder how I should do the analysis for this? what I should change in my WES pipeline to consider this? is there any tutorial for analyzing WES with UMI?

could you please share your experiences?

Thanks Sara

wes • 1.3k views
ADD COMMENTlink modified 2.7 years ago by i.sudbery10k • written 2.7 years ago by jonessara770220

TIL: UMIs are "unique molecular identifiers", more widely known as barcodes

ADD REPLYlink written 2.7 years ago by Jeremy Leipzig19k
gravatar for i.sudbery
2.7 years ago by
Sheffield, UK
i.sudbery10k wrote:

UMIs are taken into account during the de-duplication stage of your pipeline. Normally during a WES analysis you will collapse all reads that have the same start and end coordinates into a single read. This is on the assumption that reads that start and end at the same location are PCR duplicates.

However, it is possible that reads that start and end at the same location represent genuinely independent biological entities. We can distinguish the two possibilities by added a random barcode to each molecule before PCR.

How to deal with UMIs in exome sequencing

There are basically two ways in which you could deal with this. The first is to mark any two molecules that start at the same position and have the same UMI as duplicates of each other, removing one of them on some criteria.

The second is to group together all reads that have the same mapping position and UMI and "average" them to find the consensus sequence of the reads.

Our software, UMI-Tools, will do the first for you and help with the second. In all modes it starts by grouping together all reads that have the same start co-ordinates, mate inner distance and UMI sequence. It then further groups these where it believes that one UMI could have arisen from another as a PCR or sequencing error and determines which UMI represents the "parent" sequence.

In dedup mode it will then select one read from each group as representative, based of various metrics and a good dose of randomness and output that read and its mate.

In group mode it assigns a unique group code to each group of reads, adding it as a new tag on the BAM entry for each read. You could then take all the reads assigned to the same group and determine the consensus sequence. I believe others have done this before.

ADD COMMENTlink written 2.7 years ago by i.sudbery10k

Hi, are there any suggestion for the filtering of the consensus reads created? Thank you,

ADD REPLYlink written 9 months ago by deniselavezzari0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1616 users visited in the last hour